Pattern Recognition 33 (2000) 533
Editorial Energy minimization methods represent a fundamental methodology in computer vision and pattern recognition, with roots in such diverse disciplines as physics, psychology, and statistics. Recent manifestations of the idea include Markov random "elds, deformable models and templates, relaxation labelling, various types of neural networks, etc. These techniques are now "nding application in almost every area of computer vision from early to high-level processing. This edition of Pattern Recognition contains some of the best papers presented at the International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR'97) held at the University of Venice, Italy, from May 21 through May 23, 1997. Our primary motivation in organizing this workshop was to o!er researchers the chance to report their work in a forum that allowed for both consolidation of e!orts and intensive informal discussions. Although the subject was hitherto well represented in major international conferences in the "elds of computer vision, pattern recognition and neural networks, there had been no attempt to organize a specialized meeting on energy minimization methods. The papers appearing in this special edition fall into a number of distinct areas. There are two papers on contour detection. Zucker and Miller take a biologically plausible approach by providing a theory of line detection based on cortical cliques. Thornber and Williams, on the other hand, describe a stochastic contour completion process and provide an analysis of its characteristics. The next block of papers use Markov random "elds. Molina et al. compare stochastic and deterministic methods for blurred image restoration. Perez and Laferte provide a means of sampling graph representations of energy functions. Barker and Rayner provide an image segmentation algorithm which uses Markov Chain Monte Carlo for sampling. Turning our attention to deterministic methods, Yuille and Coughlan provide a framework for comparing
heuristic search procedures including twenty questions and the A-star algorithm. Ho!man et al. show how deterministic annealing can be used for texture segmentation. Rangarajan provides a new framework called self-annealing which uni"es some of the features of deterministic annealing and relaxation labelling. The topic of deterministic annealing is also central to the paper of Klock and Buhmann who show how it can be used for multidimensional scaling. Next there are papers on object recognition. Zhong and Jain show how localization can be e!ected in large databases using deformable models based on shape, texture and colour. Myers and Hancock provide a genetic algorithm that can be used to explore the ambiguity structure of line labelling and graph matching. Lastly, under this heading, Kittler shows some theoretical relationships between relaxation labelling and the Hough transform. The "nal pair of papers are concerned with maximum a posteriori probability estimation. Li provides a recombination strategy for population based search. Gelgon and Bouthemy develop a graph representation for motion tracking. We hope this special edition will prove useful to practitioners in the "eld. A sequel to the workshop will take place in July, 1999 and we hope a second compendium of papers will result. Edwin R. Hancock Department of Computer Science University of York York Y01 5DD, England E-mail address:
[email protected] Marcello Pelillo Universita &&Ca' Foscari'' Venezia, Italy
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 6 9 - 2
Pattern Recognition 33 (2000) 535}542
Cliques, computation, and computational tractabilityq Douglas A. Miller, Steven W. Zucker*,1 Center for Computational Vision and Control, Departments of Computer Science and Electrical Engineering, Yale University, P.O. Box 208285, New Haven, CT, USA Received 15 March 1999
Abstract We describe a class of computations that is motivated by a model of line and edge detection in primary visual cortex, although the computations here are expressed in general, abstract terms. The model consists of a collection of processing units (arti"cial neurons) that are organized into cliques of tightly inter-connected units. Our analysis is based on a dynamic analog model of computation, a model that is classically used to motivate gradient descent algorithms that seek extrema of energy functionals. We introduce a new view of these equations, however, and explicitly use discrete techniques from game theory to show that such cliques can achieve equilibrium in a computationally e$cient manner. Furthermore, we are able to prove that the equilibrium is the same as that which would have been found by a gradient descent algorithm. The result is a new class of computations that, while related to traditional gradient-following computations such as relaxation labeling and Hop"eld arti"cial neural networks, enjoys a di!erent and provably e$cient dynamics. The implications of the model extend beyond e$cient arti"cial neural networks to (i) timing considerations in iological neural networks; (ii) to building reliable networks from less-reliable elements; (iii) to building accurate representations from less-accurate components; and, most generally, to (iv) demonstrating an interplay between continuous `dynamical systemsa and discrete `pivotinga algorithms. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Relaxation labeling; Energy minimization; Linear complementarity problem; Game theory; Early vision; Polymatrix games; Complexity; Dynamical system
1. Introduction How are visual computations to be structured? The most popular approach is to de"ne an energy functional that represents key aspects of the problem structure, and to formulate solutions as minima (or maxima) of this functional. Solutions are sought by a gradient-descent procedure, iteratively formulated, and di!erences
q Portions of this material were presented at the Snowbird Workshop on Neural Computing, April, 1992, and at the Workshop on Computational Neuroscience, Marine Biological Laboratories, Woods Hole, MA, in August, 1993. * Corresponding author. E-mail address:
[email protected] (S.W. Zucker) 1 Research supported by AFOSR, NSERC, NSF, and Yale University.
between algorithms often center on the space over which the minimization takes place, as well as on the type of functional being extremized. In curve detection, for example, one can de"ne the functional in terms of curvature, and then seek `curves of least bending energya (e.g., [1]). We have developed a related } but di!erent } approach in which the functional varies in proportion to the residual between estimates of tangents and curvatures [2,3]. By beginning with those points that are wellinformed by initial (non-linear) operators [4], we have been able to "nd consistent points [5] in a small number of iterations. These computations exemplify the popular `stable statea view of neural computation [6,7], and the energyminimization view in computer vision. Its attraction is that, when suitable basins can be de"ned and an energy or potential function exists over a proper labeling space, the resultant algorithms that seek extremal points can be
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 0 - 9
536
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
formulated in gradient descent terms. The design of such networks are mainly in the speci"cation of a connection architecture and in specifying the synaptic weights or compatibilities between processing units, from which the energy form follows. When the compatibilities or synaptic connections between processing units are asymmetric, no such energy form exists, but a more general variational inequality can be de"ned to drive the evolution toward consistent points [5]. Pelillo [8] has used the Baum}Eagon inequalities to analyze the dynamics of such processes. There is another, on the surface very di!erent perspective toward such processes. Relaxation labeling can be viewed from a game-theoretic perspective: Consider nodes in the relaxation graph as players, labels associated with nodes as pure strategies for the players, and the compatibility function r (j, j@) as the `payo!a that i,j player i receives from player j when i plays pure strategy j and j plays strategy j@. The probability distribution over labels}p (j)}then de"nes the mixed strategy for each i player i. Properly completed, such structures are instances of polymatrix games, and the variational inequality de"ning consistent points is related to the Nash equilibrium for such games [9,10]. Other investigations into using game theory for computer vision problems include Duncan [11,12] Berthod [13]. The above relationship between relaxation labeling and game theory opens a curious connection between an analog system for "nding stationary points of an energy functional (or, in relaxation terms, of the average local potential) and the discrete algorithms normally employed to "nd equilibria of games. This connection between continuous dynamical systems and discrete `pivotinga algorithms is exploited below in Section 3, and provides an example of the extremely interesting area of computation over real numbers [14]. In addition to these formal connections between energy functions, variational inequalities, and game theory, such networks are exciting because of the possibility that they actually might model the behaviour of real neurons. This connection arises from a simple model of computation in neurons that is additive and is modeled by voltage/current relationships [6,15]. Neurons "re in proportion to their membrane potential, and three factors are considered in deriving it: (i) changes in currents induced by neuronal activity in pre-synaptic neurons; (ii) leakage through the neuronal membrane; and (iii) additional input or bias currents. These considerations can be modeled as a di!erential equation (see below), and it was this equation that led Hop"eld [16] to study the stable state view of neural computation. Hop"eld and Tank [6] suggest that such a view is relevant for a wide range of problems in biological perception, as well as others in robotics, engineering, and commerce. This equation also corresponds to the support computation in relaxation labeling [10],
which has also been applied to a wide range of such problems. The relationship between neural computation and the modeling of visual behaviour is exciting, but it raises a deep question. Consider, for instance, the following. Although we can readily interpret line drawings, it has been shown that these problems can be NP-hard [17]. The question is then whether it is possible that biological (or other physical) systems are solving problems that are NP-hard for Turing machines. Contrary to other trends in neuromodeling, we would like to suggest that there may be no need to assume the brain attempts to "nd heuristic solutions to NP-hard problems, but rather that it has reduced the problems it is trying to solve to a polynomial class. In a companion paper ([18]; see also [19,20]) we have described an analog network model for the reliable detection of short line contours and edges on hyperacuityscales in primary visual cortex of primates. In our theory and model this is accomplished by the rapid saturationlevel responses of highly interconnected self-excitatory groups of super"cial-layer pyramidal neurons. We call these self-excitatory groups cliques, and our theory implies that they are a fundamental unit of visual representation in the brain. In this previous work we have shown that this theory is consistent with key aspects of both cortical neurobiology and the psychophysics of hyperacuity, particularly orientation hyperacuity. In this paper we shall describe this theory from a more computational viewpoint, and in particular we shall show that the clique-based computations which we have theorized as being consistent with the observed neurobiology of the primary visual cortex in e!ect solve a class of computational problems which are solvable in polynomial time as a function of the problem size.
2. Computation by cliques: a model We shall not present the full biological motivation here, because a simple cartoon of the early primate visual system su$ces to present the intuition. To set the stage anatomically, recall that the retina projects mainly to the lateral geniculate nucleus (LGN), and the LGN projects to the visual cortex (V1). Physiologically, orientation selectivity emerges in visual cortex, but the orientation tuning is rather broad (typically 10 } 203). This presents a problem because, behaviourally, we are able to distinguish orientations to a hyperacuity level [21,22]. Somehow networks of neurons working in concert must be involved to achieve this added precision, but the question is how. One recalls Hebb's [23] classical hypothesis about cell assemblies, and more recent contributions to cell assemblies by Braitenberg [24,25] and Palm [26]. However, Hebb's hypothesis was purely intuitive, and
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
did not address concrete questions in vision. Nor did Braitenberg and Palm consider analog systems. One part of our project is to develop a view of neural computation su$ciently rich to explain the above hyperacuity performance based on low acuity measurements; another part, which we expand upon below, is to show that these computations are computationally tractable. It is this latter analysis, we believe, that is of interest to research in computer vision, because it leads to alternative methods for solving energy minimization problems as they arise in vision and pattern recognition. Several basic facts about neuroanatomy are relevant to motivating our model. First, recall that the projection from the LGN to V1 is dominated by excitatory synapses, and most intra-cortical connections are excitatory. Second, inhibition is much less speci"c, and, "nally, is additive rather than multiplicative (extensive references in support of these observations are in [18,27,28]). We take the observations about hyperacuity and cell assemblies to indicate that the substrate for neural computation might not be simple networks of neurons, but rather might involve groups of tightly interconnected neurons considered as a unit. We formally de"ne such units as cliques of neurons, where the term from graph theory is invoked to suggest that neurons in a clique are densely interconnected. The dominance of excitatory interactions over short distances further suggests that neurons within a clique could form dense excitatory feedback circuits [28], and the natural operation of these circuits is to bring all neurons to saturation response levels rapidly. Neuronal biophysics then limits the process (since regular spiking neurons cannot burst for very long.) This model of groups of neurons raising themselves to saturation level and then hitting their biophysical limit has been studied [18,19]; the result is that a short burst of about 4 spikes in about 25 ms. is achievable for each neuron, and a `computationa, we submit, is achieved when all neurons within the clique "re as a cohort at this rate within this time period. Note that such "ring rates are well beyond the average to be expected in visual cortex. The computation is initiated with a modest initial a!erent current, as would occur, e.g., when the LGN projection stimulates a subset of neurons in the clique. This mode of computation di!ers from the classical view, as discussed above, because the local circuit computations within the cortex are characterized by saturated responses, indicated by a rapid burst of spikes, rather than by following a gradient to an equilibrium. This cohort burst signals the `bindinga of those neurons into a clique, and the excited clique represents the stimulus orientation to high precision. More generally, the above computation is modeled as a two phase process. In the "rst, saturating phase, the input current drives all neurons in the clique to saturation, and in the second, inhibiting phase, the input current is removed and all neurons not enjoying positive
537
feedback decay to their base level. We believe this model is relevant to circuits other than the cartoon model of the LGN to V1 projection used as an introductory example, especially to intra- and inter-columnar circuits, and shall be pursuing them elsewhere. Such neurophysiological modeling is not necessary for the theoretical developments in this paper. A description of this computation is developed more fully in [18], but for completeness we now list several of its advantages. First, there is the question of how to obtain the precise representation underlying (orientation) hyperacuity from the coarse (orientation) tuning of individual simple cells. Our solution is to form a type of distributed code over collections of simple cells, and this collection is the `cliquea. Roughly, the idea is that di!erent cells would cover the stimulus with slight variation in position and orientation of their receptive "elds; the increased sensitivity to orientation derives from the composite covering; see Fig. 1. The organization is revealed by the barrage of action potentials from the cells comprising the clique. The conceptual appeal of the model is indicated from this example: highly accurate computations are derived from a `cliquea of coarse ones. Although there are limits to the following analogy, the situation resembles the addition of `bitsa to the accumulator in a digital computer: more bits leads to higher accuracy. The second advantage of this model is reliability (cf. [29]), and here it di!ers substantially from the architectural considerations underlying digital accumulators. Each neuron can participate in many cliques, and the system is highly redundant. It is shown in [18] that responses at the clique level remain reliable to the 90% level even when individual neurons are only reliable to the 60% level. Here redundancy improves reliability AND accuracy, which is very di!erent from typical uses of redundancy to only improve reliability; see also [30]. The "nal advantage is computational e$ciency, and proving this forms the remainder of this paper.
3. A polynomial-time algorithm for determining system response to input bias This section contains the main contribution of this paper, and it is here that the primary di!erences from standard energy minimization computations are developed. In particular, we do not compute the trajectory that our dynamical system will follow to "nd an equilibrium, but rather the equilibrium itself. Miller and Zucker's [10] paper is helpful as background reading. In analog arti"cial neuronal networks, `neuronsa are modeled as ampli"ers, and `synapsesa between `neuronsa are modeled as conductances. In symbols, let u denote the input voltage and < the output voltage to i i a `neurona i, and let < "g(u ) denote its input}output i i
538
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
If this system starts out from the state in which all ampli"er input and output voltages are zero, by the form of the piecewise-linear ampli"er functions g (u ) : i i [a , b ]PR given by i i 0, u (a ,1, i i c u #d , a )u )b , i,1 i i,1 i,1 i i,1 g (u )" F (2) i i c u #d , a )u )b , i,u(i) i i,u(i) i,u(i) i i,u(i) 1, u 'b i i,u(i) where
G
Fig. 1. Distributed representation for a thin line contour derives from a family of receptive "elds covering it. Each of these receptive "elds comes from a single cortical neuron, and a clique consists of about 33 neurons. In this example receptive "elds are represented by rectangles, and a white slit contour stimulus (heavy black outline) excites a highly interconnected clique of simple cortical (S) cells to maximal saturated feedback response by crossing, in the appropriate direction and within a narrow time interval, the edge response region of a su$ciently large proportion of the clique's cells.Three such receptive "elds, out of the approximately 33 required, are illustrated here.
relationship. While this is often taken as sigmoidal, we have argued piecewise-linear models work as well, and perhaps even o!er advantages [10]. Further, if we let C denote the input capacitance to ampli"er i, I be i i a "xed input bias for i, and if we de"ne R by the i relationship: 1/R "1/o # + D¹ D, i i ij jEi where o is the resistance across C and ¹ is the conduci i ij tance between amplier i and j, then the system's dynamics are governed by the system (e.g., [16]): du C i" + ¹ < !u /R #I , i dt ij j i i i jEi < "g(u ). i i
(1)
a (a (b "a (2(a (b (b , i i,1 i,1 i,2 i,u(i) i,u(i) i c "[g (b )!g (a )]/[b !a ], i,k i i,k i i,k i,k i,k d "g (a )!c a i,k i i,k i,k i,k for all integers k, 1)k)u(i), this is an asymptotically stable equilibrium if the bias terms I are all zero. Howi ever if the bias terms are nonzero (as for example if they were to represent depolarizing input current originating from the LGN) then the system will evolve monotonically to a new equilibrium state in which some ampli"er outputs may be nonzero. If we then remove the bias terms, the system output variables will then monotonically decrease to a "nal equilibrium state, which we may view as the "nal output or computation of the system. It is our purpose here to show that we can determine this "nal state, whatever it might be, in a number of computational steps which is polynomial in the number of bits needed to specify the system, thus showing the problem is in class P [31], as opposed, for example, to NP-hard problems such as the traveling salesman, which very likely are not. Note that we are not computing the trajectory which the system (1) may follow to get to an equilibrium, but only the equilibrium state itself. We shall do this by in e!ect computing a parametrized continuum of equilibria, as in the parametric simplex method ([32]). These equilibria will correspond, "rst, to slowly increasing the upper bounds on all variables in the presence of bias (Phase I), followed by slowly removing the bias (Phase II). We shall "rst show that this procedure is computable in polynomial time, and then show that the solutions obtained are in fact those which would have resulted from a time evolution of Eq. (1). We stress that we are especially interested in (nonempty) sets of ampli"ers S which achieve an asymptotically stable unbiased equilibrium in which the ampli"ers S have output 1, and all other ampli"ers have output 0. Such sets of ampli"ers are called self excitatory sets; conditions for their existence and certain of their properties are described in [18}20]. Loosely, we shall require the conductances and biases to be non-negative, and the resting state to be stable [33].
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
To begin, note that we have previously shown [10] that for piecewise-linear ampli"ers of the form (2) we may represent the Eq. (1) as a constrained linear dynamical system
putational steps, and hence this will also be true of the entire procedure. To describe this pivoting procedure, a more useful version of Eq. (4) will be
p@"Rp#c#dc8 , (3)
where R is an n]n connectivity matrix, c is a vector of bias terms c not including I as a factor, c8 is a vector of i i bias terms c8 which do include I as a factor, d3[0, 1] is i i a scalar, and e is a vector of 1's. We can thus let d"0 bias and d"1 correspond to a bias I . i It can be shown [9] as a variant of the well known Kuhn}Tucker theorem [34] that p is an equilibrium for Eq. (3) if and only if there also exist vectors y, u, v such that p, y, u, v satisfy the system
C
I n
0
DC D
n
I 0 n 0 I n
CD y
C D C D !c
"
u
#d
e
!c8 0
C
R
!I
I n
DC D
n
0
I 0 n 0 I n
y u
p?u#y?v"0.
0
0 #d 1 e
v
#d 2
C D !c8 0
,
p, y, u, v*0 p?u#y?v"0.
(5)
,
v
p, y, u, v*0
C D CD !c
"
The procedure will have two phases. In phase I we shall assume d "1, and d will increase from 0 to 1. In phase 2 1 II, d "1, and d will decrease from 1 to 0. 1 2 To describe phase I, it will be convenient to rewrite Eq. (5) in yet another form
p
!I
CD p
0)p)e,
R
539
(4)
C DC 0
!e
Here I is the n]n identity matrix. n The above system of equations is an example of a linear complementarity problem, which in general is NP-complete, but is polynomial in several important special cases, including linear and convex quadratic programming [35,36]. An important technique for solving these and other special cases of linear complementarity problems is called Lemke's algorithm [37], and we show [10] that it may be used to "nd an equilibrium for any system of Eq. (1) with piecewise-linear ampli"ers. We shall use a variation of Lemke's algorithm here as well, although one which is di!erent from that which we have described previously. As opposed to the version of Lemke's algorithm which we have described previously [9,10], where it is assumed that the practical behavior is polynomial, based on previous experience with it and related algorithms such as the simplex method for linear programming, in this case we can actually show that this version of Lemke's algorithm must terminate in a number of steps which is linear in the number of model neurons. These steps are called pivots, and each pivot amounts to solving a 2n]2n subsystem of the 2n]4n system of linear equations in the "rst line of Eq. (4) (cf. any text on linear programming, e.g. [32] for a detailed description of pivoting).Therefore, if we assume the coe$cients other than d of Eq. (4) to be integers (or equivalently rationals with a single common denominator) each pivot can be shown to require at most a polynomial number of com-
R
I
n
!I 0
n
DC D I 0 n 0 I n
CD d 1 p
C D C D
y " u
!c 0
#d 2
!c8 0
,
v
d , p, y, u, v*0 1 p?u#y?v"0.
(6)
For d "0 we can trivially "nd a solution for the other 1 variables of Eq. (6) by letting p, v"0, and, for i" 1,2, n,
G G
!c !c8 , if!c !c8 '0, i i u" i 0, else, c #c8 , if c #c8 '0 i y" i i 0 else. Observe that by multiplying by !1 the rows of the "rst line of Eq. (6) which correspond to nonzero values of y we obtain a basic feasible tableau, i.e. there is a subset i of n columns which is a permuted identity matrix, and that the trivial solution to these equations whose nonzero elements correspond to these n columns are also a solution to the second line (6), i.e. are nonnegative. The identity matrix or basis of this tableau consists of those columns corresponding to the nonzero elements of y and u, with the remainder of the identity columns taken from those corresponding to v. Note however that a nondegenerate solution corresponding to this basis (i.e. a solution with no basic
540
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
variables equal to zero) would not satisfy the third line of Eq. (6), i.e. would not be complementary. In order to obtain a basic feasible complementary tableau for Eq. (6) we can proceed by pivoting from each of the rightmost n columns which violates complementarity into the corresponding one of the leftmost n columns. We remark at this point that even though p is constrained to be zero, the basic feasible complementary solution which we have constructed does correspond to a nondegenerate solution for an in"nitesimal relaxation of the constraints. That is, we can, keeping the same basis, add to the right-hand-side of the "rst line of 6 a 2nvector (e, e2,2, e2n)?, where e is treated as arbitrarily small but positive. This in"nitesimal relaxation to produce nondegeneracy is in fact the standard lexicographic pivoting method [32]. We can now begin the complementary pivoting procedure which characterizes Lemke's algorithm by pivoting into the leftmost column of the linear equations corresponding to d , thus alowing d to become positive. 1 1 This causes a column i to leave the basis, thus creating a complementary pair i, iI outside the basis, where either iI "i#n or iI "i!n. Our next choice for a pivot column, in order to maintain complementarity, is therefore iI . We can continue this procedure until d "1 or we reach 1 a column where there is no positive element to pivot on (geometrically an in"nite ray), which represents a basis for which d may be arbitrarily large. Note that this is 1 actually Lemke's algorithm in reverse pivoting sequence, since usually we start on an in"nite ray. However in all other respects it is the same as the algorithm described by Lemke. What does each pivot represent? From our construction of Eq. (3), the larger we make d , the larger the 1 possible voltage outputs that each ampli"er may have. Since all connections are nonnegative, by increasing d we can either increase an ampli"er output through 1 biasing, or by outputs from other ampli"ers. Thus each pivot either represents an ampli"er output going from 0 to positive, or from positive to 1, its upper boundary. Altogether there can be at most two pivots for each ampli"er. Therefore the result of Phase I can be computed in a number of steps which is polynomial in the number of bits needed to specify (1). In particular if (as is the natural assumption for modeling the brain) the maximum speci"cation size of individual components (resistors, capacitors, ampli"ers) given in Eq. (3) is bounded and not a function of the number of components, this implies that Phase I can be computed in polynomial time in the number of model neurons. To begin Phase II, we keep the tableau we had at the end of Phase I, but move the leftmost column of Eq. (6) back to the right-hand-side, as in Eq. (5), with d "1 1 and d "1. This leaves us with a feasible basic com2 plementary solution to Eq. (5). We can then rewrite
Eq. (5) as
CD d
C DC !c8
R
!I
0
I
0
n
DC D
n
I 0 n 0 I n
2 p
C D CD
y " u
!c
#d
0
0 , 1 e
v
d , p, y, u, v*0, 2 p?u#y?v"0,
(7)
and pivot into the leftmost column of the "rst line, which we know from our termination of Phase I must be a basic feasible complementary solution with one missing pair. There will either be two possible pivot column choices, one increasing d and the other decreasing it, or else there 2 will be one nonpositive column (an in"nite ray, for which solutions may be constructed for arbitrarily large d ), and 2 one pivoting column which will decrease d. In either case we can pivot into a unique column which will decrease d , this initiating a unique complementarity pivoting 2 procedure which will reduce the bias parameter d from 2 1 to 0, at which point there is no complementary missing pair and Phase II terminates.The argument that this phase is also polynomial is the same as that for Phase I, except that now each pivot monotonically decreases the voltage outputs. It remains to show that the solutions obtained for Phase I and Phase II of the above procedure correspond to those which would be obtained from a time evolution of the system (3), "rst starting from the zero state in the presence of bias, and then removing the bias. With regard to Phase I, let x be the time evolution equilibrium, and x6 correspond to the above parametric solution for d "1. If for some i, x 'x6 , then the only way this can 1 i i happen is if there exists a jOi such that x 'x6 . But now j i let us watch the time evolution of x from zero and compare it to x6 , and suppose that i is the "rst index in time such that x 'x6 . Then this is an obvious contradici i tion, since another such index j must already have existed (excluding degenerate solutions). Therefore x)x6 . But exactly the same argument can be used to show x6 )x, using the evolution of x6 with respect to an increase in d instead of the evolution of x with respect to time. 1 Therefore x"x6 . A similar argument applies to the end result of Phase II.
4. Conclusions The "eld of neural computation is dominated by a stable-attractor viewpoint in which energy forms are minimized. This viewpoint is attractive because of the gradient-descent interpretation of computations and the relevance for modeling perceptual and other complex
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
phenomena. However, serious questions about computational complexity arise for these models, such as how biological systems can actually compute such trajectories. An alternative view of stable-attractors and energy minimization is obtained by interpreting the relevant structures into game theory. This sets up a duality between continuous dynamical systems and discrete pivoting algorithms for "nding equilibria. We exploit this duality, and the biological metaphor, to motivate an alternative interpretation of what a `neural energy minimizing computationa might be. Starting with the standard Hop"eld equations, we consider computations that are organized into excitatory cliques of neurons. The main result in this paper was to show how e$ciently these neurons can bring each other to saturation response levels, and how these responses agree with the end result of gradient-descent computations. The result suggests that arti"cial neural networks can be designed for e$cient and reliable computation using these techniques, and perhaps that biological neural networks have discovered a reliable and e$cient approach to "nding equilibria that di!ers substantially from common practice in computer vision and pattern recognition.
References [1] S. Ullman, High Level Vision, MIT Press, Cambridge, MA, 1996. [2] P. Parent, S.W. Zucker, Trace inference, curvature consistency, and curve detection, IEEE Trans. Pattern Anal. Machine Intell. 11 (1989) 823}839. [3] S.W. Zucker, A. Dobbins, L. Iverson, Two stages of curve detection suggest two styles of visual computation, Neural Comput. 1 (1989) 68}81. [4] L. Iverson, S.W. Zucker, Logical/linear operators for image curves, IEEE Trans. Pattern Anal. Machine Intell. 17 (10) (1995) 982}996. [5] R.A. Hummel, S.W. Zucker, On the foundations of relaxation labeling processes, IEEE Trans. Pattern Anal. Machine Intell. PAMI-5 (1983) 267}287. [6] J.J. Hop"eld, D.W. Tank, Neural compuation of decisions in optimatization problems, Biol. Cybernet. 52 (1985) 141}152. [7] D.J. Amit, Modeling Brain Function: the World of Attractor Neural Networks, Cambridge University Press, Cambridge, 1989. [8] M. Pelillo, The dynamics of nonlinear relaxation labeling processes, J. Math. Imag. Vision 7 (1997) 309}323. [9] D.A. Miller, S.W. Zucker, Copositive-plus Lemke algorithm solves polymatrix games, Oper. Res. Lett. 10 (1991) 285}290. [10] D.A. Miller, S.W. Zucker, E$cient simplex-like methods for equilibria of nonsymmetric analog networks, Neural Comput. 4 (1992) 167}190.
541
[11] H.I. Bozma, J.S. Duncan, A game-theoretic approach to integration of modules, IEEE Trans. Pattern Anal. Machine Intell. 16 (1994) 1074}1086. [12] A. Chakraborty, J.S. Duncan, Game Theoretic Integration for Image Segmentation, IEEE Trans. Pattern Anal. Machine Intell. 21 (1999) 12}30. [13] S. Yu, M. Berthod, A game strategy approach for image labeling, Comput Vision Image Understanding 61 (1995) 32}37. [14] L. Blum, F. Cucker, M. Shub, S. Smale, Complexity and Real Computation, Springer, New York, 1998. [15] T.J. Sejnowski, Skeleton "lters in the brain, in: G.E. Hinton, J.A. Anderson (Eds.), Parallel Models of Associative Memory, Lawrence Erlbaum, Hillsdale, NJ, 1981. [16] J.J. Hop"eld, Neurons with graded response have collective computational properties like those of twostate neurons, Proc. Natl. Acad. Sci. USA 81 (1984) 3088}3092. [17] L.M. Kirousis, C.H. Papadimitriou, The complexity of recognizing polyhedral scenes, J. Comput. System. Sci. 37 (1988) 14}38. [18] D.A. Miller, S.W. Zucker, Computing with self-excitatory cliques: a model and an application to hyperacuity-scale computation in visual cortex, Neural Comput. 11 (1) (1999) 21}66. [19] D.A. Miller, S.W. Zucker, A model of hyperacuity-scale computation in visual cortex by self-excitatory cliques of pyramidal cells, Technical Report TR-CIM-93-13, Center for Intelligent Machines, McGill University, Montreal, August, 1994. [20] D. Miller, S.W. Zucker, Reliable computation and related games, in: M. Pelillo, E. Hancock (Eds.), Energy Minimization Methods in Computer Vision and Pattern Recognition, Lecture Notes in Computer Science, vol. 1223, Springer, Berlin, 1997, pp. 3}18. [21] G. Westheimer, The spatial grain of the perifoveal visual "eld, Vision Res. 22 (1982) 157}162. [22] G. Westheimer, S.P. McKee, Spatial con"gurations for visual hyperacuity, Vision Res. 17 (1977) 941}947. [23] D.O. Hebb, The Organization of Behaviour, Wiley, New York, 1949. [24] V. Braitenberg, Cell assemblies in the cerebral cortex, in: R. Heim, G. Palm (Eds.), Theoretical Approaches to Complex Systems, Lecture Notes in Biomathematics, vol. 21, Springer, New York, 1978, pp. 171}188. [25] V. Braitenberg, A. Schuez, Anatomy of the Cortex. Statistics and Geometry, Springer, Berlin, 1991. [26] G. Palm, Neural Assemblies: An Alternative Approach to Arti"cial Intelligence, Springer, Berlin, 1982. [27] R.J. Douglas, K.A.C. Martin, Neocortex, in: G.M. Shepherd (Ed.), The Synaptic Organization of the Brain, 3rd ed., Oxford University Press, New York, 1990, pp. 389}438. [28] R.J. Douglas, C. Koch, K.A.C. Martin, H. Suarez, Recurrent excitation in neocortical circuits, Science 269 (1995) 981}985. [29] E.F. Moore, C.E. Shannon, Reliable circuits using less reliable relays, J. Franklin Inst. 262 (1956) 191}208, 281}297. [30] S. Winograd, J.D. Cowan, Reliable Computation in the Presence of Noise, MIT Press, Cambridge, MA, 1963.
542
D.A. Miller, S.W. Zucker / Pattern Recognition 33 (2000) 535}542
[31] M.R. Garey, D.S. Johnson, Computers and Intractability, Freeman, San Francisco, 1979. [32] G.B. Dantzig, Linear Programming and Extensions, Princeton University Press, Princeton, NJ, 1963. [33] M.W. Hirsch, S. Smale, Di!erential Equations, Dynamical Systems, and Linear Algebra, Academic Press, New York, 1974. [34] H.W. Kuhn, A.W. Tucker, Nonlinear programming, in: J. Neyman (Ed.), 2nd Berkeley Symposium on Mathematical
Statistics and Probability, University of California Press, Berkeley, CA, 1951, pp. 481}492. [35] K. Murty, Linear Complementarity, Linear and Nonlinear Programming, Heldermann, Berlin, 1988. [36] R.W. Cottle, J.-S. Pang, R. Stone, The Linear Complementarity Problem, Academic Press, New York, 1992. [37] C.E. Lemke, Bimatrix equilibrium points and mathematical programming, Management Sci. 11 (1965) 681}689.
About the Author*STEVEN W. ZUCKER is the David and Lucile Packard Professor of Computer Science and Electrical Engineering at Yale University. Before moving to Yale in 1996, he was Professor of Electrical Engineering at McGill University and Director of the Program in Arti"cial Intelligence and Robotics of the Canadian Institute for Advanced Research. He was elected a Fellow of the Canadian Institute for Advanced Research (1983), a Fellow of the IEEE (1988), and (by)Fellow of Churchill College, Cambridge (1993). Dr. Zucker obtained his education at Carnegie-Mellon University in Pittsburgh and at Drexel University in Philadelphia, and was a post-doctoral Research Fellow in Computer Science at the University of Maryland, College Park. He was Professor Invite'eH at Institute National de Recherche en Informatique et en Automatique, Sophia-Antipolis, France, in 1989, a Visiting Professor of Computer Science at Tel Aviv University in January, 1993, and an SERC Fellow of the Isaac Newton Institute for Mathematical Sciences, University of Cambridge. Prof. Zucker has authored or co-authored more than 130 papers on computational vision, biological perception, arti"cial intelligence, and robotics, and serves on the editorial boards of 8 journals. About the Author*DOUGLAS MILLER obtained his Ph.D. at the University of California, Berkeley, in Operations Research. Following a brief period in industry with Paci"c Gas and Electric, in California, he became a Post-Doctoral Fellow at the Center for Intelligent Machines, McGill University, in 1990. Douglas A. Miller passed away in 1994.
Pattern Recognition 33 (2000) 543}553
Characterizing the distribution of completion shapes with corners using a mixture of random processes Karvel K. Thornber!,*, Lance R. Williams" !NEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA "Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA Received 15 March 1999
Abstract We derive an analytic expression for the distribution of contours x(t) generated by #uctuations in x5 (t)"Lx(t)/Lt due to random impulses of two limiting types. The "rst type are frequent but weak while the second are infrequent but strong. The result has applications in computational theories of "gural completion and illusory contours because it can be used to model the prior probability distribution of short, smooth completion shapes punctuated by occasional discontinuities in orientation (i.e., corners). This work extends our previous work on characterizing the distribution of completion shapes which dealt only with the case of frequently acting weak impulses. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
1. Introduction In a previous paper [1] we derived an analytic expression characterizing a distribution of short, smooth contours. This result has applications in ongoing work on "gural completion [2] and perceptual saliency [3]. The idea that the prior probability distribution of boundary completion shapes can be characterized by a directional random walk is "rst described by Mumford [4]. A similar idea is implicit in Cox et al.'s use of the Kalman "lter in their work on grouping of contour fragments [5]. More recently, Williams and Jacobs [6] introduced a representation they called a stochastic completion xeld } the probability that a particle undergoing a directional random walk will pass through any given position and orientation in the image plane on a path bridging a pair of boundary fragments. They argued that the mode, magnitude and variance of the stochastic completion "eld are related to the perceived shape, salience and sharpness of illusory contours.
* Corresponding author. Fax: 00609-951-2482
Both Mumford [4] and Williams and Jacobs [6] show that the maximum likelihood path followed by a particle undergoing a directional random walk between two positions and directions is a curve of least energy (see [7]). This is the curve that is commonly assumed to model the shape of illusory contours, and is widely used for semiautomatic region segmentation in many computer vision applications (see [8]). The distribution of shapes considered by [1,4}6] basically consists of smooth, short contours. Yet there are many examples in human vision where completion shapes perceived by humans contain discontinuities in orientation (i.e., corners). Fig. 1 shows a display by Kanizsa [9]. This display illustrates the completion of a circle and square under a square occluder. The completion of the square is signi"cant because it includes a discontinuity in orientation. Fig. 2 shows a pair of `Ko!ka Crossesa. When the width of the arms of the Ko!ka Cross is increased, observers generally report that the percept changes from an illusory circle to an illusory square [10]. Although the distribution of completion shapes with corners has not previously been characterized analytically, the idea of including corners in completion shapes
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 1 - 0
544
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
Fig. 1. Amodal completion of a partially occluded circle and square (redrawn from [9]). In both cases, completion is accomplished in a manner which preserves tangent and curvature continuity at the ends of the occluded boundaries.
impulses drawn from a mixture of two limiting distributions. The "rst distribution consists of weak but frequently acting impulses (we call this the Gaussian-limit). The distribution of these weak random impulses has zero mean and variance equal to p2. The weak impulses act at g Poisson times with rate R . The second consists of strong g but infrequently acting impulses (we call this the Poisson-limit). The distribution of these strong random impulses has zero mean and variance equal to p2 (where p p2
2. Approach
Fig. 2. When the width of the arms of the Ko!ka Cross is increased, observers generally report that the percept changes from an illusory circle to an illusory square (see [2]).
is not new. For example, the functionals of Kass et al. [8] and Mumford and Shah [11] permit orientation discontinuities accompanied by large (but "xed size) penalties. This follows work by Blake [12] and others [13}16] on interpolation of smooth surfaces with creases from sparse depth (or brightness) measurements. More recently, Belheumer [17] (working with stereo pairs) used a similiar functional for interpolation of disparity along epipolar lines. Belheumer's approach is especially related because he derives his functional by considering a distribution of surface cross-section shapes characterized by a mixture of random processes } smoothly varying disparity is modeled by a one-dimensional Brownian motion while depth discontinuities are modeled by a Poisson process. In this paper, we derive a very general integral}di!erential equation underlying a family of contour shape distributions. This family is based on shapes traced by particles following any of several default paths modi"ed by random impulses drawn from a mixture of distributions (e.g., di!erent magnitudes, directions, rates). For our "gural completion application, we are especially interested in a shape distribution based upon straight line base-trajectories in two-dimensions modi"ed by random
Suppose we are given a collection of contour segments, for example, as would be present in an image of objects occluded by other objects. Our goal is to predict all reasonably likely completions of these contours and their relative likelihoods. If x is the location of the end of one 1 contour segment, and x is the beginning of another, then 2 one candidate for the most general prior between x and 1 x would be given by 2 x(t)"x #+ *xlu(t!tl), t (t (t , x(t )"x , 1 i 2 2 2 1 l where u( ) ) is the unit step function, i.e., u(t)"0 when t(0 and u(t)"1 when t'0 and the displacements *xl are stochastic with some zero-mean distribution. The times tl would also be stochastic, e.g., Poisson with some (possibly time varying) rate R(t). Such curves would resemble the tracks of classical Brownian particles connecting x and x . While this represents the most general 1 2 prior for a continuous curve, lacking any bias, the expected contour will be simply Sx(t)T"x #x (t!t )/(t !t ), 1 2 1 2 1
t )t)t 1 2
which in space is independent of t and t , and the details 1 2 of the distribution of the (*xl, tl). For ordinary di!usion in one and two dimensions, all points can be reached with probability one. For this reason, such completions are both degenerate and sterile, and will not be considered further. Except at isolated points (corners), most boundaries are continuous in position and orientation. Thus, if in the
1 There are four (not "ve) because we combine p2 and R into g g a single parameter, ¹"R p2. g g
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
above example, h and h are the directions at x and x , 1 2 1 2 then two additional boundary conditions enter: dx/ds"[cos h , sin h ] at x and 1 1 1 dx/ds"[cos h , sin h ] at x , 2 2 2 where s is the distance along the curve x(s) as in di!erential geometry. Between x and x one would like to write 1 2 dx/ds"[cos h(s), sin h(s)] where hQ (s) is a normally distributed random variable with zero mean (i.e., h(s) is a Brownian motion). See Mumford [4] and Williams and Jacobs [6]. Note that speed is assumed to be constant. In this paper, we consider more general contours x(t) with arbitrary parameterization (t), and for each component q, set dx (t)/dt"x5 (t), q"1 ,2, d, q q where the x5 (t) are independent random variables. Each q x5 changes by *x5 l at t l according to a zero-mean q q q distribution on *x5 , while the t l occur at a mean rate of q q R (t), for example, according to Poisson statistics. This q results in what is probably the least constrained, simple prior which captures the essential properties of the missing contour and its relative likelihood:
P
x(t)"x #x5 (t!t )# 1 1 1
t
dt@+ *x5 l u(t@!tl), t1 l
t )t@)t , 1 2 where x(t )"x ,x5 (t )"x5 .2 Clearly x5 and x5 will have 2 2 2 2 1 2 directions h and h . 1 2 3. Prior distribution of smooth completion shapes We de"ne P(2 D 1) to be the likelihood that a contour, x(t), is at x with x5 for t"t given that it was at x with 2 2 2 1 x5 for t"t , averaged over all x(t) subjected to random 1 1 impulses. While we calculate P(2 D 1) directly in the next section (for a mixture of frequent-weak impulses and infrequent-strong impulses), it is of value to derive an integral}di!erential equation for P(2 D 1) which includes all types of impulses. The transition probability, P(2 D 1), embodies three aspects of contour distributions: (1) boundary conditions; (2) base-trajectories; and (3) impulse statistics. Boundary conditions constrain possible contours at keypoints3 by
2 More generally, and for those who eschew abruptness, we will broaden the *x5 l impulses by employing a stochastic `forcea of the form F "+lf(t!tl), where the f( ) ) are `smootha funct tions of t. 3 We use this term (adopted from [18]) as a generic term to denote position}velocity constraints derived from the image.
545
specifying at least one of x , x5 , xK , etc. We "nd the choice i i i of (x , x5 ) to be most useful, but other applications may i i require other combinations. A variety of base-trajectories can represent the contour at times between the arrival of random impulses. We have found straight lines to be the most useful in our application. In addition to boundary conditions and base-trajectories, the contour distributions are de"ned by impulse statistics. Random impulses have the form, x5 (t)"+ lv u(t!t l), which includes prok k k cess, k, with impulses, v , occuring at Poisson distributed k times, t l, with mean rate, R (t). While we focus here on k k only two processes, frequent-weak (Gaussian) and infrequent-strong (Poisson), an entire spectrum between these limits is also available (see Fig. 3). We now turn to the calculation of an integral-di!erential equation for P(2 D 1). Recall that P(2 D 1) is the probability that given x(t )"x and x5 (t )"x5 , then 1 1 1 1 x(t )"x and x5 (t )"x5 for t 't : 2 2 2 2 2 1 P(2 D 1),P(x , x5 , t D x , x5 , t ) 2 2 2 1 1 1 "Sd(x(t )!x )d(x5 (t )!x5 )T 2 2 2 2 1 where S ) T is an average over all contours matching the 1 boundary condition x , x5 at t . For conciseness, in the 1 1 1 following expressions we write P(x, x5 , t) instead of P(x , x5 , t D x , x5 , t ). In these expressions, (x, x5 ) refers to 2 2 2 1 1 1 (x , x5 ), and the boundary conditions, (x , x5 ), are impli2 2 1 1 cit. We "nd (using d@(t)"Ld(t)/Lt): L P(x, x5 , t)"Sx5 d@(x !x)d(x5 !x5 )T t t t t 1 #SxK d(x !x)d@(x5 !x5 )T t t t 1 "!x5 ) LxSd(x !x)d(x5 !x5 )T t t 1 !Lx5 SxK d(x !x)d(x5 !x5 )T , t t t 1 where x "x(t), x5 "x5 (t) and xK "xK (t). While the delta t t t function, d(x5 !x5 ), forces the "rst expectation value to be t x5 , the second expectation value is less readily speci"able.4 From the form and distribution of the random impulses, it follows that xK "+ lv d(t!t l). As the occurrences of k t k k the #uctuations are independent, it is reasonable to take their average instantaneous rate, R (t), to be independent k of the number of #uctuations, N , of type, k, in the k interval of interest. This leads us to the standard Poisson distribution for the probability of N independent #uctuk ations, and simpli"es the resulting equation. Should this be an over-simpli"cation, we could still work out P(2 D 1) as discussed in [1]. These considerations (together with a decay term to reduce the contribution of long contours)
4 In Appendix A we calculate the second expectation value in terms of the probability distribution of the number of times an elementary `forcea, f (t!t l), acts in a given interval of time. k k Though exact, for our present purposes, the expression is unnecessarily complicated.
546
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
Fig. 3. Examples of stochastic curves for various impulse rates, R , and impulse magnitudes, Dv D. The most relevant region of this shape k k space is spanned by a mixture of frequent-weak (Gaussian) and infrequent-strong (Poisson) processes.
result in the following integral-di!erential equation for P(x, x5 , t): L P(x, x5 , t)"!x5 ) LxP(x, x5 , t)!xK ) Lx5 P(x, x5 , t) t 1t
P
!P(x, x5 , t)/q! + dm R (m)f (t!m) k k k
A P
) Lx5 P x! dt@ G f (t@!m), x5 t,t{ k
P
B
! dt@ GQ f (t@!m), t , t,t{ k where x is x(t) for impulse-free conditions, x(t )"x 1t 1 1 and x5 (t )"x5 . For straight line base-trajectories, 1 1 G "(t`!t@)u(t`!t@)I and therefore xK "0. This t,t{ 1t causes the second term to drop out. Then (since the details of the elementary force, f (t), are unlikely to matk ter) we assume that v ":dt f (t), and obtain: k k L P(x, x5 , t)"!x5 ) LxP(x, x5 , t)!P(x, x5 , t)/q t !+ R (t)v ) Lx5 P(x, x5 !v /2, t). k k k k The time evolution of the conditional probability at (x, x5 ) is governed by three factors: (1) advection; (2) decay; and (3) di!usion. The e!ect of the advection term is that probability mass at (x, x5 ) is transported with velocity x5 . The e!ect of the decay term is that the total probability mass decreases exponentially with increasing time. It is
the di!usion term which is most important. It expresses the fact that the random impulses drive the distribution through its velocity-gradient averaged over the duration of the impulse. In the limit of impulses of short duration, this average can be approximated by the velocity-gradient evaluated at x5 !v /2 (i.e., the average velocity before k and after the #uctuation). For small impulses, v , and for k stationary processes (i.e., for constant R ), we obtain the k following simple di!erential equation for the conditional probability: L P "!x5 ) LxP #TL2x5 P /2!P /q, 0 0 0 t 0
T"+ R v v k k k k
which for isotropic impulses, becomes: L P "!x5 ) LxP #¹L2x5 P /2!P /q, t 0 0 0 0 where ¹ is the velocity-variance-rate parameter, analogous to the position-variance-rate parameter (di!usion coe$cient) of transport theories.5 This Fokker}Planck equation is similiar to the equation described by
5 In these expressions, the zero subscript denotes the fact that the probability, P , is averaged over trajectories modi"ed by 0 zero impulses of the large-infrequent type (i.e., the pure Gaussian case). In the next section, we will consider probabilities averaged over trajectories modi"ed by a mixture of weak, frequent impulses and a single impulse of the large, infrequent type. This mixed probability will be denoted, P . 1
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
Mumford [4]. The only di!erence is that the di!usion term in our equation involves velocity rather than orientation. One consequence of this di!erence is that our equation separates, so that P (2 D 1)"exp (!t/ 0 q)P (2 D 1)P (2 D 1), 2, the product extending to as 0x 0y many dimensions as are of interest. Solving, we "nd for P (2 D 1): 0x exp[!x5 2 /2¹t] exp[!6(x !v t)2/¹t3] 21 21 x P (2 D 1; t)" ) 0x J2n¹t Jn¹t3/6 and similarly for y, z, 2, etc. Here v "(x5 #x5 )/2, x 2 1 t"t !t , x5 "x5 !x5 , x "x !x , and ¹ arises 2 1 21 2 1 21 2 1 from setting ¹ "¹ d . The expression shows clearly the ij i ij contribution of velocity di!usion, and the persistence of the initial velocity and anticipation of the "nal velocity in the position dependence.
4. Inclusion of corners in prior distribution This formulation is clearly not just limited to frequent-weak impulses } it contains the full spectrum of stochastic contributions to x5 . While some problems t may call for this #exibility in full, it turns out that corners (discontinuities in orientation) can be included by using a mixture of frequent-weak and infrequent-strong impulses. In fact, for our purposes, it will su$ce to include only zero or one large impulse per contour. Previously [1] we showed how P(2 D 1)"Sd(x(t )!x ) 2 2 d(x5 (t )!x5 )T could be obtained from an evaluation of 2 2 1 the characteristic functional '(k )"Sexp i: dt k ) x T t t t 1 which since
P
x(t)"x #G(t, t )x5 # 1 1 1
t
dt@G(t, t@)F(t@)
t1
T AP
/(p )" exp i dt p ) F t t t
T AP
where we used F "+ +lf (t!t l), the t l's being govk k t k k erned by a Poisson process of rate R (t). Although one k can use this expression as given (as we do in Appendix A) several limiting cases provide signi"cant and useful simpli"cations. The most common is the Gaussian limit of small-frequent impulses, which we have already developed, and in which case
T AP
BU
exp i dt p ) F t t
A P
B
1 "exp ! dt p ) T ) p , t t 2
where T"+ R v v when f (t)"v d(t).6 The opposite k k k k k k limit is large-infrequent impulses. These are necessary if we are to include discontinuities in dx(t)/ dt (i.e., corners). For the case in which a single, large impulse can act in addition to the numerous, small impulses of the Gaussian case, we proceed as follows. Returning to the more general of the above expressions for Sexp(i: dt p ) F )T, we t t expand the exponential in + R (m) to "rst order in those k k processes with low rates, i.e., + R (m) where R (m) is small. s s s In this way, we include only zero and one large scattering events, obtaining the factor: 1#+ : dm R (m)exp[i: dt p ) f (t!m)] s s t s . 1#+ : dm R (m) s s Again taking the impulses, f , to be of relatively short s duration and including an average over a Gaussian distribution of variance p2 (i.e., the least constrained zerop mean distribution of a given variance) we obtain the factor:
A
B
1 1#: dm R (m)exp ! p ) p2 ) p p 2 m p m 1#: dm R (m) p normalized to unity for p "0 and where R (m) is the m p mean rate of this process, i.e., R (m)"+ R (m). Here p s s p2 has the dimensions of (velocity)2 while T has the p dimensions of (velocity)2/(time). So "nally we "nd that
BU
exp(!1: dt p ) T ) p )[1#: dm R (m)exp(!1p ) p2 ) p )] t t p 2 m p m . 2 " 1#: dm R (m) p
in which F(t@) is the stochastic force, it su$ced to determine /(p )" exp i dt@ p F t{ t{ t{
547
BU
For the probability of two or more impulses to be negligible, it is necessary that : dm R (m);1. Using p the corresponding result in Appendix A for the Poisson case, we "nd for P(2 D 1) in two dimensions, the
with p ": dt k G(t, t@) where G(t, t@)"0 for t)t@. This t{ t results in the following expression:
T AP
BU
exp i dt p ) F t t
CP
"exp
A AP
B BD
dm+ R (m) exp i dt p .f (t!m) !1 k t k k
,
6 Since the v are small, the exponential in which they occur k can be expanded to second order: the zero order term cancels the `!1a, and the "rst order term is zero since the total force has zero mean.
548
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
expression: P (2 D 1; t)P (2 D 1; t)#:t dm R (m)P (2 D 1; t, m)P (2 D 1; t, m) 0y 0 p 1x 1y P(2D1;t)" 0x ) exp(!t/q), 1#:t dm R (m) 0 p exp[!(A2/D#B2/H)/2¹] x x P (2 D 1; t, m)" , 1x 2n¹JDH v t2#m [x5 (t!m)#x5 m] p 2 1 , A "x ! x B "x5 , x 21 x 21 t#m p (t4/12)#m (t3/3!t2m#m2t) p D" , H"t#m , p t#m p where m "p2/¹ and with analogous expressions for P . p p 1y Note that in the expression for /(p ), that p2 is taken to be t p diagonal. The time, m, is the time of the single, large, scattering event. We observe that the Poisson process will dominate for t;m . p The above result for zero and one rare event exhibits several important dependencies. First, recall that ¹ is the velocity-#uctuation rate. Thus, for ¹t(p2 (i.e., t(m ), p p the frequent-weak process will be less e!ective than a single strong but rare event. This will be the case for higher velocities (i.e., smaller time intervals). Second, if R (m) is taken to be a constant, R , then R t will control p p p the number of rare events (preferably R t;1). We note p that it would be possible to derive the distribution for one rare event by appropriate joining of two Gaussian processes, but integrals over all intermediate positions and velocities would be required. Although the above expression for P(2 D 1) involves a number of symbols, there are in fact only four basic parameters: ¹, m , R and q. The values of these four p p parameters determine the shape distribution and remain constant for a given application. The "rst parameter, ¹"R p2, is the velocity di!usion coe$cient.7 The g g smaller ¹ becomes, the more the most likely completion will dominate the distribution. Consequently, ¹ controls sharpness. The second parameter, m , is equal to p2/¹. As p p mentioned above, if the time associated with the most likely completion is signi"cantly less than m , then there is p not enough time for the frequent-weak impulses to modify the particle's path to match the second boundary condition. Since this factor is in the exponent of the probability expression, infrequent-strong impulses become favored with rate proportional to the third parameter, R . Finally, the fourth parameter, q, determines p the rate at which particles decay } the smaller the value of
7 Ironically, since only the product of R and p2 appears in the g g equations, in the Gaussian-limit, the distribution of the random impulses, v , is arbitrary while in the Poisson-limit, the distribug tion of the random impulses, v , is Gaussian. p
q, the smaller the contribution of long contours. In practice, we "rst set ¹ to achieve the desired sharpness in P . 0 We then adjust m to suppress the Gaussian component p in P and adjust R to achieve the desired mixture of 1 p smooth completions and corners.
5. Sampling and scale invariance Strictly speaking, the above expressions apply only to the continuum. In practice, the expressions must be evaluated on a discrete grid. To reduce aliasing and other artifacts associated with discrete sampling, the transition probabilities can be convolved with broadening functions, for example P(x , y , x5 , y5 Dx , y , x5 , y5 ) 2 2 2 2 1 1 1 1 "
P
=
~=
P
dx@
=
~=
dy@
P
= ~=
dx5 @
P
=
~=
dy5 @E(x@, p2)E(y@, p2) x y
]E(x5 @, p25 ) E(y5 @, p25 ) p(x !x@, y !y@, x y 2 2 x5 !x5 @, y5 !y5 @ D x , y , x5 , y5 ) 2 2 1 1 1 1 where E(r@, p2)"exp[!(r@2/2p2)]/J2np2. r r r In addition to sampling, a second consideration is scale-invariance. Ideally, the transition probabilities should remain invariant as the scene is uniformly scaled, that is, P(2 D 1 ; t) should remain constant as Dx !x DP 2 1 cDx !x D. Unfortunately, the expressions given pre2 1 viously do not possess this property. However, if the speeds are increased by a factor of c and the velocity di!usion coe$cient is increased by a factor of c2, then the expression for P(2 D 1; t) is invariant except for an overall factor of c4. Stated di!erently, P(2 D 1; t)Pc4P(2D1; t) when x5 Pcx5 , x5 Pcx5 and ¹Pc2¹. While this may at "rst 1 1 2 2 seem rather ad hoc, it is actually quite reasonable } increased displacements require correspondingly higher speeds. Likewise, the di!usion coe$cient must increase to e!ect the equivalent change in particle trajectories. The factor of c4 is simply the ratio of sampling volumes in
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
the scaled and unscaled systems (i.e., the Jacobian): P(x , y , x5 , y5 D x , y , x5 , y5 ) 2 2 2 2 1 1 1 1 = = = = "c4 ) dx@ dy@ dx5 @ dy5 @ ~= ~= ~= ~= ]E(x@, c2p2) E(y@, c2p2) E(x5 @, c2p25 ) E(y5 @, c2p25 ) y x y x ]p(x !x@, y !y@, x5 !x5 @, y5 2 2 2 2 !y5 @ D x , y , x5 , y5 ), 1 1 1 1
P
P
P
P
where E(r@, c2p2)"exp[!(r@2/2c2p2)]/J2nc2p2. These r r r considerations lead to the following scale-invariant expressions for P (2 D 1; t) and P (2 D 1; t, m): 0x 1x P (2 D 1; t) 0x exp[!(6/c2¹t3)(x !v t)2(1!(1#¹t3/12p2)~1)] 21 x x " J2n(p2#¹t3/12) x exp[!(1/2c2¹t)x5 2 (1!(1#¹t/p25 )~1)] 21 x ] exp(!t/q) J2n(p 5 #¹t) x exp[!(A2/2c2¹D)(1!(1#¹D/p2)~1)] x x P (2 D 1;t,m)" 1x J2n(p2#¹D) x exp[!(B2/2c2¹H)(1!(1#¹H/p25 )~1)] x x ] J2n(p25 #¹H) x ]exp(!t/q). From these and the corresponding expressions for the y dimension we obtain P (2 D 1; t) and P (2 D 1; t). We note 0 1 that the c2 Jacobian factors in the above expressions are cancelled by c2 factors in the denominators. We must now address the problem of choosing c. Recall that there are four parameters which de"ne the distribution: ¹, R , m , and q. The last three are purely p p time-like and remain invariant. Only, ¹, which is proportional to the square of the magnitude of the random impulses, must be scaled by a factor of c2. If Dx5 D and Dx5 D 1 2 represent the spatial scales of the "lters used to compute the keypoints, then letting c"(Dx5 D#Dx5 D)/2 is a natural 1 2 choice since the larger the velocity, the larger the random impulses which must modify it. The result is that the transition probabilities remain invariant as both the scene and the "lters are uniformly scaled (see Fig. 4).
Fig. 4. Two completion "elds related through a scale transformation. If the distance between the keypoints, Dx !x D, and 2 1 the scale of the "lters used to compute the keypoints, Dx5 D and 1 Dx5 D, are increased by a factor of c, then ¹ must be increased by 2 a factor of c2.
549
6. Stochastic completion 5elds The magnitude of the stochastic completion xeld at (g, g5) is de"ned as the probability that a completion, with a distribution of shapes given by P(2 D 1), will connect two keypoints via a path through (g, g5). The stochastic completion "eld originating in an arbitrary set of n keypoints can (in turn) be expressed as the sum of n2 pairwise "elds. In this section, we describe the problem of computing the completion "eld for a pair of keypoints, and give example completion "elds for a range of speeds. To accomplish this, we "rst consider the distribution of contours which begin at the "rst keypoint, (x , x5 ) at 1 1 time t . We then consider the fraction of those contours 1 which pass through the "eldpoint, (g, g5) at t and then through the second keypoint, (x , x5 ) at time t (where 2 2 2 t 't't ). Integrating over all t (!R(t (t) and 2 1 1 1 t (t(t (R), we "nd the relative probability that 2 2 a completion from (x , x5 ) to (x , x5 ) includes (g, g5), giving 1 1 2 2 the value of the stochastic completion "eld, C(g, g5). Since the entire history of the contour at t is summarized by (g, g5) (the velocity #uctuations being independent), the probability of (x , x5 )P(g, g5)P(x , x5 ) factors into 1 1 2 2 a product of probabilities for (x , x5 )P(g, g5) and for 1 1 (g, g5)P(x , x5 ): 2 2 P((x , x5 , t ),(g, g5, t) D (x , x5 , t )) 2 2 2 1 1 1 "P((x , x5 , t ) D (g, g5, t)) P((g, g5, t) D (x , x5 , t )) 2 2 2 1 1 1 Fig. 5a}d shows four completion "elds due to a pair of keypoints positioned on a horizontal line and separated by a distance of 80 pixels. In each sub"gure, the orientation of the right keypoint is 1303 and the orientation of the left keypoint is 503. The speeds, c, in Fig. 5a}d are 1, 2, 4 and 8 respectively. The values of the four parameters de"ning the contour shape distribution are: ¹"0.0005, q"9.5, m "100 and R "1.0]10~6. The p p completion "elds were computed using the expression for P(2 D 1) given in Section 4 and using the integral approximations for P@(2 D 1) described in Appendix B. Fig. 5a}d displays images of size 256]256 where brightness codes the logarithm of the sum of the completion "eld magnitude evaluated at 36 discrete orientations (i.e., at 103 increments). As the speed is increased, the relative contribution of P and P reverses. This results in a transition 0 1 from a distribution dominated by smooth contours to a distribution consisting predominantly of straight (or nearly straight) contours containing a single orientation discontinuity. When the distribution is dominated by P , 1 the e!ect of aliasing in orientation becomes evident.
7. Conclusion In our previous paper [1], we assumed that the statistics of occluded shapes could be modeled by
550
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
Fig. 5. Stochastic completion "elds (logarithm of magnitude) due to a pair of keypoints positioned on a horizontal line and separated by a distance of 80 pixels. In each sub"gure, the orientation of the left keypoint is 1303 and the orientation of the right keypoint is 503. The speeds, c, in (a}d) are 1, 2, 4 and 8, respectively. The values of the four parameters de"ning the contour shape distribution are: ¹"0.0005, q"9.5, m "100 and R "1.0]10~6. p p
minimally-constrained distributions over all paths. We derived an analytic expression for the shape, salience and sharpness of illusory contours in terms of the characteristic function of the simplest of these distributions (i.e., Gaussian) and applied this expression to well known examples from the visual psychology literature. In this
paper, we extended our work in several important directions. First, we have derived a general integral-di!erential equation including the full spectrum of random impulses and which we believe will be useful for modeling a broad family of shape distributions. We have also derived an analytic expression characterizing the
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
distribution of completion shapes with corners using a mixture of Gaussian and Poisson limiting cases. Finally, we have presented scale-invariant forms for these expressions.
Appendix A. Expected distributions For x "x #: dt@ G F , F stochastic, t 1t t,t{ t{ t{ + lf (t!t l), and x "x #G x5 we desire k k k 1t 1 t,t1 1
T AP AP
BU BT A P
'(k )" exp i dt k ) x t t t "exp i dt k ) x t 1t
F" t
1
BU
exp i dt@ p ) F t{ t{
with p ": dt k G . So long as the t l's are independent, t{ t tt{ k we can write
T AP
BU
/(p )" exp i dt p ) F t t t = "< + P k N k Nk/0
AP
dt R (t ) k k k
AP
B B
]exp i dt p f (t!t ) /j tk k k
Nk
,
where j ": dt R (t ) (see [19}21]). Here P k is the probk k k k N ability of N events in the time interval of interest, and k R the average rate of these events. To derive k SxK 2d(x(t )!x )d(x5 (t )!x5 )T, for the integral}di!erent 2 2 2 2 tial equation in the text, we must calculate at g"0: !i
L Lg
P
P
di dj e~ii> x2 e~*j> x2'(k ) t (2n)3 (2n)3
for k "k (t)#gdA(t!t ) and k (t)"id(t!t ) t r 2 r 2 !jd@(t!t ). Alternatively, p "p (t)#gd(t !t). Set2 t r 2 ting p(t)"p (t)"iG 2 #jL 2G 2 and k(t)"k (t) results r t ,t t t ,t r in P(x , x5 , t D x , x5 , t ). Inserting these p(t) and k(t) in 2 2 2 1 1 1 the expressions for ' and / above and working out the g items, we "nd for the case that the P are Poisson: N SxK d(x(t )!x )d(x5 (t )!x5 )T t2 2 2 2 2
tion in an integral}di!erential equation is unnecessarily complex. We note that the above is valid for general Poisson processes-weak/strong, frequent/infrequent and anything in between. To evaluate P(2 D 1) for the Poisson case one simply calculates the above without the (!iL/Lg) and for g"0, proceeding in a manner analogous to [1].
Appendix B. Integral approximations The expressions we have derived (thus far) depend on time, i.e., they give the probability that a particle will be at some position and velocity, (x , x5 ), at time t given that 2 2 the particle was observed at some other position and velocity, (x , x5 ), at time 0. We refer to this quantity as 1 1 P(2 D 1; t). However, if we are really interested in computing the probability that two boundary fragments are part of the same object, then we are more interested in the integral of P(2 D 1; t) over all future times. We refer to this quantity as P@(2 D 1)":= dt P(2 D 1; t). To derive an expres0 sion for P@(2 D 1), we must not only approximate this integral analytically, we must also approximate the integral over the time of the single large scattering event, m, in the expression for P(2 D 1; t). To begin, we divide the expression for P@(2 D 1) into two terms, I and I , and use the method of steepest descent 0 1 separately on each part. This requires one steepest descent approximation for I and two steepest descent 0 approximations for I (which contains the additional 1 dependency on m):
P P P
t.!9 dt P(2 D 1; t) 0
P@(2 D 1)"
"
t.!9 P (2 D 1; t)#R :t dm P (2 D 1; t, m) p0 1 dt 0 1#R t 0 p
"
t.!9 1 t.!9 R p dt P (2 D 1; t)# dt@ 1#R t 0 1#R t@ 0 p 0 p
P
P
]
t{
0
dm P (2 D 1; t@, m) 1
"I #I . 0 1
P
"xK P(x , x5 , t D x , x5 , t )#+ dm R (m)f (t!m) 1t 2 2 2 1 1 1 k k k
A P P
551
]P x ! dt@ G f (t@!m), x5 2 t,t{ k 2
The "rst integral to be dealt with is that over the time of the single large scattering event:
P
I (t@)"R R p
B
! dt@ L G f (t@!m),t D x , x5 , t . t t, t{ k 2 1 1 1 We need not take the Poisson limit of P k, for we could N work directly with ' and / above, but their representa-
t{
0
dm P (2 D 1; t@, m). 1
Here m enters the integrand only through the D and A terms in the expression for P (2 D 1; t@, m). A more accu1 rate, steepest-descent approximation would include the m-dependence in both. However, since the dependence in the A term dominates the behavior of the integral, we
552
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
ignore the dependence in the D term in determining the local maximum. We "nd a a@ #a a@ y y, m "! x x 015 a@ 2#a@ 2 x y v t@2#m x5 t@ p 2 , a "x ! x x 21 t@#m p m x5 a@ " p 21 , x t@#m p and similarly for a and a@ . The approximate result for y y the integral is then I (t@)+F (m )R P (2 D 1; t@, m ), R R 015 p 1 015 F (m)"J2nc2¹D/(a@ 2#a@ 2), 0(m(t@. R x y When m(0 or m't@ then we set F "0. Of course, I is R R never actually zero. If its behavior for m (0 or m 't@ 015 015 is important, we must simply approximate more carefully. We now face :t.!9dt P(2 D 1; t) where t is so large that 0 .!9 P(2 D 1; t) for t't is comparatively negligible. There .!9 are two integrals with di!erent local optima (i.e., t and 015 t@ ). The "rst integral (which lacks the single large scat015 tering event) is
P
t.!9 dt P (2 D 1; t)/(1#R t). 0 p 0 Here the dependence on t in the argument of the exponentials is:
I " 0
!6(g /t!g /t2#g /3t3)!4 ln t, a b c g "(v2#v2#(x5 2 #y5 2 )/12)c2¹, a x y 21 21 g "2(x v #y v )/c2¹, b 21 x 21 y g "3(x2 #y2 )/c2¹. c 21 21 This argument has a local maximum at t '0, where 015 t satis"es 015 !2t3#3(g t2!2g t#g )"0 a b c yielding the approximate value for I of 0 I +F P (2 D 1; t )/(1#R t ), 0 0 0 015 p 015 F "J2nt5 /(12(g !g t )#4t3 ). 0 015 c b 015 015 If more than one real t '0 exists, the one yielding the 015 largest P (2 D 1) is chosen. Generally, one must take all 0 roots, real and complex, into account. However, for this problem, we found that choosing the largest real root su$ced. The second integral (which includes the single large scattering event) is
P
I " 1
t.!9 dt@ I (t@)/(1#R t@). R p t1
If this term is to be important then m "(p2/¹)
P P P =
= = dx@ dy@ dh@ E(x@, c2p2)E(y@, c2p2) x y ~= ~= ~=
]E(h@, p2)p(x !x@, y !y@, h !h@ D x , y , h ), h 2 2 2 1 1 1 where E(r@, c2p2)"exp[!(r@2/2c2p2)]/J2nc2p2. This r r r leads to the following scale-invariant expressions for P (2 D 1; t) and P (2 D 1; t, m): 0 1 P (2 D 1; t) 0 exp[!(6/c2¹t3)(x !v t)2(1!(1#¹t3/12p2)~1)] 21 x x " J2n(p2#¹t3/12) x exp[!(6/c2¹t3)(y !v t)2(1!(1#¹t3/12p2)~1)] 21 y y J2n(p2#¹t3/12) y
]
exp[!(1/2c2¹t)(x5 2 #y5 2 )(1!(1#¹t/p2)~1)] 21 21 h J2n(p2#¹t) h
]
]exp(!t/q)
K.K. Thornber, L.R. Williams / Pattern Recognition 33 (2000) 543}553
P (2 D 1; t, m) 1 exp[!(A2/2c2¹D)(1!(1#¹D/p2)~1)] x x " J2n(p2#¹D) x exp[!(A2/2c2¹D)(1!(1#¹D/p2)~1)] y y ] J2n(p2#¹D) y exp[!(1/2c2¹H)(x5 2 #y5 2 )(1!(1#¹H/p2)~1)] 21 21 h ] Jn(p2#¹H) h ]exp(!t/q), where x5 "c cos h , x5 "c cos h , y5 "c sin h and 1 1 2 2 1 1 y5 "c sin h . As before, the Jacobian factor is cancelled 2 2 by equal factors in the denominators of the above expressions. Finally, we note that for the constant speed case, there is a small change in the integral approximations described in Appendix B. The optimum time, t , in the 015 steepest descent approximation of I now solves: 0 7 ! t3#3(g t2!2g t#g )"0. a b c 4 The integral approximations are otherwise unchanged. References [1] K.K. Thornber, L.R. Williams, Analytic solution of stochastic completion "elds, Biol. Cybernet. 75 (1996) 141}151. [2] K.K. Thornber, L.R. Williams, Scale, orientation and discontinuity as emergent properties of illusory contour shape, Neural Information Processing Systems (NIPS '98), Denver, CO, 1998. [3] L.R. Williams, K.K. Thornber, A comparison of measures for detecting natural shapes in cluttered backgrounds, Proc. 5th European Conf. on Computer Vision (ECCV'98), Freiburg, Germany, 1998. [4] D. Mumford, Elastica and computer vision, in: Chandrajit Bajaj (Ed.), Algebraic Geometry and Its Applications, Springer, New York, 1994. [5] I.J. Cox, J.M. Rehg, S. Hingorani, A Bayesian multiple hypothesis approach to edge grouping and contour segmentation, Int. J. Comput. Vision 11 (1993) 5}24.
553
[6] L.R. Williams, D.W. Jacobs, Stochastic completion "elds: a neural model of illusory contour shape and salience, Neural Comput. 9 (1997) 837}858. [7] B.K.P. Horn, The curve of least energy, MIT AI Lab Memo No. 612, MIT, Cambridge, MA, 1981. [8] M. Kass, A. Witkin, D. Terzopolous, Snakes: active minimum energy seeking contours, Proc. 1st Int. Conf. on Computer Vision (ICCV), London, England, 1987, pp. 259}268. [9] G. Kanizsa, Organization in Vision, Praeger, New York, 1979. [10] M. Sambin, Angular margins without gradients, Ital. J. Psychol. 1 (1974) 355}361. [11] D. Mumford, J. Shah, Boundary detection by minimizing functionals, Proc. IEEE Conf. on Comp. Vision and Pattern Recognition (CVPR), 1985. [12] A. Blake, The least disturbance principle and weak constraints, Pattern Recognition Lett. 1 (1983) 393}399. [13] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell. 6 (1984) 721}741. [14] J. Marroquin, Surface reconstruction preserving discontinuities, MIT AI Lab Memo No. 792, MIT, Cambridge, MA, 1984. [15] W.E.L. Grimson, T. Pavlidis, Discontinuity detection for visual surface reconstruction, Comput. Vision Graph. Image Process. 30 (1985) 316}330. [16] D. Terzopolous, Computing visible surface representations, MIT AI Lab Memo No. 612, MIT, Cambridge, MA, 1985. [17] P.N. Belheumer, Bayesian models for reconstructing the scene geometry in a pair of stereo images, Proc. Conf. Information Sciences and Systems, Johns Hopkins University, Baltimore, MD, 1993. [18] R. Heitger, R. von der Heydt, A computational model of neural contour processing, "gure-ground and illusory contours, Proc. 4th Int. Conf. on Computer Vision, Berlin, Germany, 1993. [19] R.P. Feynman, A.R. Hibbs, Quantum Mechanics and Path Integrals, McGraw-Hill, New York, 1965. [20] K.K. Thornber, Treatment of microscopic #uctuations in noise theory, Bell System Tech. 53 (1974) 1041}1078. [21] K.K. Thornber, A new approach for treating #uctuations in noise theory, J. Appl. Phys. 46 (1975) 2781}2787.
About the Author*KARVEL THORNBER received his PhD from Caltech in 1966 in EE, Physics, and Applied Mathematics. Having spent two years at Stanford, one year at Bristol, and 20 years at Bell Labs, he is currently a senior research scientist at the NEC Research Institute in Princeton, NJ. Prior to vision, his "elds of interest included theoretical and device physics, and logic. About the Author*LANCE WILLIAMS received his BS in Computer Science from Pennsylvania State University in 1985 and his MS and PhD in Computer Science from the University of Massachusetts at Amherst in 1988 and 1994. After graduate school, he spent four years as a postdoctoral scientist at NEC Research Institute in Princeton, NJ. He is now an Assistant Professor in the Computer Science Department at the University of New Mexico in Albuquerque. His principal research interest is in computational modeling of human visual information processing.
Pattern Recognition 33 (2000) 555}571
Restoration of severely blurred high range images using stochastic and deterministic relaxation algorithms in compound Gauss}Markov random "eldsq Rafael Molina!,*, Aggelos K. Katsaggelos", Javier Mateos!, Aurora Hermoso#, C. Andrew Segall" !Departamento de Ciencias de la Computacio& n e I.A., Universidad de Granada, 18071 Granada, Spain "Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208-3118, USA #Departamento de Estadn& stica e I.O., Universidad de Granada, 18071 Granada, Spain Received 15 March 1999
Abstract Over the last few years, a growing number of researchers from varied disciplines have been utilizing Markov random "elds (MRF) models for developing optimal, robust algorithms for various problems, such as texture analysis, image synthesis, classi"cation and segmentation, surface reconstruction, integration of several low level vision modules, sensor fusion and image restoration. However, no much work has been reported on the use of Simulated Annealing (SA) and Iterative Conditional Mode (ICM) algorithms for maximum a posteriori estimation in image restoration problems when severe blurring is present. In this paper we examine the use of compound Gauss}Markov random "elds (CGMRF) to restore severely blurred high range images. For this deblurring problem, the convergence of the SA and ICM algorithms has not been established. We propose two new iterative restoration algorithms which can be considered as extensions of the classical SA and ICM approaches and whose convergence is established. Finally, they are tested on real and synthetic images and the results compared with the restorations obtained by other iterative schemes. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Image restoration; Compound Gauss}Markov random "elds; Simulated annealing; Iterative conditional mode
1. Introduction Image restoration refers to the problem of recovering an image, f, from its blurred and noisy observation, g, for the purpose of improving its quality or obtaining some type of information that is not readily available from the degraded image. It is well known that translation linear shift invariant (LSI) image models do not, in many circumstances, lead to appropriate restoration methods. Their main problem q This work has been supported by the `ComisioH n Nacional de Ciencia y TecnologmH aa under contract TIC97-0989. * Corresponding author. Tel.: #34-958-244019; fax: #34958-243317 E-mail address:
[email protected] (R. Molina)
is their inability to preserve discontinuities. To move away from simple LSI models several methods have been proposed. The Compound Gauss}Markov random "elds (CGMRF) theory provides us with a mean to control changes in the image model using a hidden random "eld. A compound random "eld has two levels; an upper level which is the real image that has certain translation invariant linear sub-models to represent image characteristics like border regions, smoothness, texture, etc. The lower or hidden level is a "nite range random "eld to govern the transitions between the sub-models. The use of the underlying random "eld, called the line process, was introduced by Geman and Geman [1] in the discrete case and extended to the continuous case by Jeng [2], Jeng and Woods [3,4] and Chellapa et al. [5].
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 2 - 2
556
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Given the image and noise models, the process of "nding the maximum a posteriori (MAP) estimate for the CGMRF is much more complex, since we no longer have a convex function to be minimized and methods like simulated annealing (SA) (see [1]) have to be used. Although this method, when blurring is not present, leads to the MAP estimate, it is a very computationally demanding method. A faster alternative is deterministic relaxation which results in local MAP estimation, also called iterative conditional mode (ICM) [6]. When blur is present, di!erent approaches have been proposed to "nd the MAP. Blake and Zisserman [7] propose the use of gradually non-convexity, which can be extended to the blurring problem, Molina and Ripley [8] propose the use of a log-scale for the image model, and Green [9], Bouman and Sauer [10], Schultz and Stevenson [11] and Lange [12] use convex potentials in order to ensure uniqueness of the solution. Additional solutions are motivated by the application of nonlinear partial di!erential equations in image processing, particularly anisotropic di!usion [13]. These operators are designed to smooth an image while preserving discontinuities (e.g. [14}16]). While traditionally utilized for image enhancement, embedding the di!usion inhibitor within a variational framework facilitates image restoration. Restoration using di!erent variants of the di!usion mechanism has been presented by Geman and McClure [17], Hebert and Leahy [18], Green [9], and Bouman and Sauer [10]. Design of the edge-preserving function is critical for solution convergence and quality. Selection of suitable expressions has been explored by Charbonnier et al. [19]. In this paper we extend the use of SA to restore high dynamic range images in the presence of blurring, a case where convergence of this method has not been shown (see Jeng [2] and Jeng and Woods [3,4] for the continuous case without blurring). In Section 2 we introduce the notation we use and the proposed model for the image and line processes as well as the noise model. Both, stochastic and deterministic relaxation approaches to obtain the MAP estimate without blurring are presented in Section 3. Reasons why these algorithms may be unstable in the presence of blurring are studied in Section 4. In Section 5 we modify the SA algorithm and its corresponding relaxation approach in order to propose our modi"ed algorithms. Convergence proofs are given in the Appendix. In Section 6 we test both algorithms on real images and compare the results with other existing methods. Section 7 concludes the paper.
2. Notation and model 2.1. Bayesian model We will distinguish between f, the &true' image which would be observed under ideal conditions and g, the
observed image. The aim is to reconstruct f from g. Bayesian methods start with a prior distribution, a probability distribution over images f by which they incorporate information on the expected structure within an image. It is also necessary to specify p(g D f ), the probability distribution of observed images g if f were the &true' image. The Bayesian paradigm dictates that the inference about the true f should be based on p ( f D g) given by p ( f D g)"p (g D f )p ( f )/p( g )Jp (g D f )p ( f ).
(1)
Maximization of Eq. (1) with respect to f yields fK "arg max p( f D g), f
(2)
the maximum a posteriori estimator. For the sake of simplicity, we will denote by f (i) the intensity of the true image at the location of the pixel i on the lattice. We regard f as a p]1 column vector of values f (i). The convention applies equally to the observed image g. 2.2. Incorporation of the edge locations in the prior model The use of CGMRF was "rst presented by Geman and Geman [1] using an Ising model to represent the upper level and a line process to model the abrupt transitions. Extensions to continuous range models using GMRF were presented by Jeng [2]. The CGMRF model used in this paper was proposed by Chellapa et al. [5] and it is an extension of the Blake and Zisserman's weak membrane model [7] used for surface interpolation and edge detection. The convergence proof that will be given here can also be extended to the CGMRF de"ned by Jeng [2] and Jeng and Woods [3,4]. Let us "rst describe the prior model without any edges. Our prior knowledge about the smoothness of the object luminosity distribution makes it possible to model the distribution of f by a Conditional Auto-Regressive (CAR) model (see [20]). Thus,
G
H
1 p( f )Jexp ! f T(I!/N) f , 2p2w
(3)
where N "1 if cells i and j are spatial neighbors (pixels ij at distance one), zero otherwise and / just less than 0.25. The parameters can be interpreted by the following expressions describing the conditional distribution: E( f (i) D f ( j), jOi)"/ + f ( j), j /)"3 i var ( f (i) D f ( j), jOi)"p2w,
(4)
where the su$x &j nhbr i' denotes the four neighbor pixels at distance one from pixel i (see Fig. 1). The parameter p2w measures the smoothness of the &true' image.
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
557
zero and variance p2n . This means that the observed image corresponds to the model g(i)"(Df )(i)# n(i)"+ d(i!j) f ( j)#n(i), where D is the p]p matrix j de"ning the systematic blur, assumed to be known and approximated by a block circulant matrix, n(i) is the additive Gaussian noise with zero mean and variance p2n and d( j) are the coe$cients de"ning the blurring function. Then, the probability of the observed image g if f were the &true' image is
C
D
1 DDg!Df DD2 . p(g D f )Jexp ! 2p2n Fig. 1. Image and line sites.
3. MAP estimation using stochastic and deterministic relaxation
From Eq. (3) we have 1 !log p( f )"const# + f (i)( f (i)!/(Nf )(i)). 2p2w i Then, if i:#1, i:#2, i:#3, i:#4 denote the four pixels around pixel i as described in Fig. 1 and we assume a &toroidal edge correction', we have 1 + [/ (f (i)!f (i:#1))2 !log p( f )"const# 2p2w i #/ (f (i)!f (i:#2))2#(1!4/) f 2(i)]. This expression can be rewritten as 1 !log p( f, l)"const# + [/( f (i)!f (i:#1))2 2p2w i ](1!l([i, i:#1]) #bl([i, i:#1])#/( f (i)!f (i:#2))2
The MAP estimates of f and l, fK , lK are given by fK , lK "arg max p ( f, l D g). f,l
(7)
This is an obvious extension of (2) where now we have to estimate both the image and line processes. The modi"ed simulated annealing (MSA), algorithm we are going to propose in this work, ensures convergence to a local MAP estimate regardless of the initial solution. We start by examining the SA procedure as de"ned by Jeng [2]. Since p( f, l D g) is nonlinear it is extremely di$cult to "nd fK and lK by any conventional method. Simulated annealing is a relaxation technique to search for MAP estimates from degraded observations. It uses the distribution
G
1 1 1 exp ! DDg!Df DD2 p ( f, l D g)" T ¹ 2p2n Z T
](1!l([i, i:#2]))#bl([i, i:#2]) #(1!4/) f 2(i)],
(6)
(5)
where l([i, j]),0 for all i and j. We now introduce a line process by simply rede"ning the function l([i, j]) as taking the value zero if pixels i and j are not separated by an active line and one otherwise. We then penalize the introduction of an active line element in the position [i, j] (see Fig. 1) by the term bl([i, j]) since otherwise the expression in Eq. (5) would obtain its minimum value by setting all line elements equal to one. The intuitive interpretation of this line process is simple; it acts as an activator or inhibitor of the relation between two neighbor pixels depending on whether or not the pixels are separated by an edge. 2.3. Noise models A simpli"ed but very realistic noise model for many applications is to assume that it is Gaussian with mean
1 1 ! + [/( f (i)!f (i:#1))2 ¹ 2p2w i ](1!l([i, i:#1]))#bl([i, i:#1]) #/( f (i)!f (i:#2))2 ](1!l([i, i:#2]))#bl([i, i:#2])
H
#(1!4/) f 2(i)] ,
(8)
where ¹ is the temperature and Z a normalization T constant. We shall need to simulate the conditional a posteriori density function for l([i, j]), given the rest of l, f and g and the conditional a posteriori density function for f (i) given the rest of f, l and g. To simulate the line process conditional a posteriori density function, p (l([i, j]) D T
558
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
l([k, l]): ∀[k, l]O[i, j], f, g), we have
5. Set t"t#1. Go back to step 4 until a complete sweep of the xeld f is xnished. 6. Go to step 2 until t't , where t is a specixed integer. f f
p (l([i, j])"0 D l([k, l]): ∀[k, l]O[i, j], f, g) T
C
D
1 / Jexp ! ( f (i)!f ( j))2 , ¹ 2p2w
(9)
p (l([i, j])"1 D l([k, l]): ∀[k, l]O[i, j], f, g) T
C
D
1 b Jexp ! . ¹ 2p2w
(10)
Furthermore, for our Gaussian noise model, p ( f (i) D f ( j): ∀jOi, l, g)&N(kl*i+(i),¹p2l*i+(i)), T
(11) Theorem 1. If the following conditions are satisxed:
where kl*i+(i) and p2l*i+(i) are given by
1. D/D(0.25, 2. ¹(t)P0 as tPR, such that, 3. ¹(t)*C /log(1#k(t)), T
f ( j)(1!l([i, j])) kl*i+(i)"jl*i+(i)/ + nnl*i+(i) j /)"3 i
A
# (1!jl*i+(i))
B
(DTg)(i)!(DTDf )(i) #f (i) , (12) c
p2wp2n , p2l*i+(i)" l nn *i+(i) p2n #c p2w
Note that steps 2 and 3 consist in an exact and direct sampling from the independent conditional law p (l([i, j]) D f , g)"< p (l([i, j]) D f (i), f ( j), while T t~1 Wi, jX T t~1 t~1 steps 4 and 5 consist in one sweep of Gibbs sampling from conditional distribution p ( f D l , g) with initial conT t "guration f . t~1 The following theorem from Jeng [2] guarantees that the SA algorithm converges to the MAP estimate in the case of no blurring.
(13)
where c is the sum of the square of the coe$cients de"ning the blur function, that is, c"+ d( j)2, j (1!l([i, j]))#(1!4/) and nnl*i+(i)"/+ j /)"3 i nnl*i+(i) p2n jl*i+(i)" l , nn *i+(i) p2n #c p2w and l[i] is the four-dimensional vector representing the line process con"guration around image pixel (i). Then the sequential SA to "nd the MAP, with no blurring (D"I), proceeds as follows (see [2]): Algorithm 1 (Sequential SA procedure). Let i , t" t 1, 2,2, be the sequence in which the sites are visited for updating. 1. Set t"0 and assign an initial conxguration denoted as f , l and initial temperature ¹(0)"1. ~1 ~1 2. The evolution l Pl of the line process can be obt~1 t tained by sampling the next point of the line process from the raster-scanning scheme based on the conditional probability mass function dexned in Eqs. (9) and (10) and keeping the rest of l unchanged. t~1 3. Set t"t#1. Go back to step 2 until a complete sweep of the xeld l is xnished. 4. The evolution f Pf of the image f can be obtained by t~1 t sampling the next value of the image process from the raster-scanning scheme based on the conditional probability mass function given in Eq. (11) and keeping the rest of f unchanged. t~1
then for any starting conxguration f , l , we have ~1 ~1 p( f , l D f , l , g)Pp ( f, l) as tPR, t t ~1 ~1 0 where p (. , .) is the uniform probability distribution over the 0 MAP solutions, C is a positive constant and k(t) is the T sweep iteration number at time t. Instead of using a stochastic approach, we can use a deterministic method to search for a local maximum. An advantage of the deterministic method is that its convergence is much faster than the stochastic approach, since instead of simulating the distributions, the mode from the corresponding conditional distribution is chosen. The disadvantage is the local nature of the solution obtained. This method can be seen as a particular case of SA where the temperature is always set to zero.
4. Instability of the SA and ICM solutions Unfortunately, due to the presence of blurring the convergence of SA has not been established for this problem. The main problem of the methods is that, if c is small, as is the case for severely blurred images, the term [(DTg)(i)!(DTDf ) (i)]/c in (12) is highly unstable. For the ICM method the problem gets worse because sudden changes in the "rst stages, due to the line process, become permanent (see Ref. [21]). Let us examine intuitively and formally why we may have convergence problems with Algorithm 1 and its deterministic relaxation approximation when severe blurring is present. Let us assume for simplicity no line process and examine the iterative procedure where we update the whole image at the same time; it is important to note that this is not the parallel version of SA but an
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
where a"1/p2w and b"1/p2n or
iterative procedure. We have f "j/Nf !(1!j) t t~1
C
D
DTD DTg f !f # (1!j) t~1 c t~1 c
"Af #const, (14) t~1 where t is the iteration number, understood as sweep of the whole image, and
C
A" I!j(I!/N)!(1!j)
559
D
DTD . c
(15)
f (i)"ult~1*i+(i) f (i)#(1!ult~1*i+(i)) t t~1 f ( j)(1!l([i, j])) ] jl*i+(i)/ + t~1 # (1!jl*i+(i)) nnl*i+(i) j /)"3 i (DTg)(i)!(DTDf ) (i) t~1 #f (i) , ] t~1 c
A A
BB
where u *i+(i)"(p2n (1!nn *i+(i)))#(1!c)p2w/(p2n #p2w). So, in order to have a contraction, we update the whole image at the same time using the value of f (i) obtained in the previous iteration, f k~1(i), and, instead of simulating t from the normal distribution de"ned in Eq. (11) to obtain the new value of f (i), we simulate from the distribution lt~1
lt~1
For the method to converge, A must be a contraction mapping. However, this may not be the case. For instance, if the image su!ers from severe blurring then c is close to zero and the matrix [DTD/c] has eigenvalues greater than one. Furthermore, if the image has a high dynamic range, like astronomical images where ranges [0, 7000] are common, it is natural to assume that p2w is big and thus, (1!j)[DTD/c] has eigenvalues greater than one. Therefore, this iterative method may not converge. It is important to note that, when there is no blurring, c"1 and A is a contraction mapping. Let us modify A in order to have a contraction. Adding [(1!j)(1!c)/c] f to both sides of Eq. (14) we have, in the iterative procedure,
(19) p2lt~1*i+(i)"(1!(ult~1*i+(i))2)p2lt~1*i+(i). m The reason to use this modi"ed variance is clear if we take into account that, if
(1#[(1!j)(1!c)/c]) f
t
X&N(m, p2)
# Af #const t~1 t~1
and
"[(1!j)(1!c)/c] f
N(klt~1*i+(i), ¹p2lt~1*i+(i)) m m with mean
(17)
klt~1*i+(i)"ult~1*i+(i) f k~1(i)#(1!ult~1*i+(i))klt~1*i+(i) m t and
(18)
or
>DX&N(jX#(1!j)m, (1!j2)p2),
f "uf #(1!u)[Af #const], t t~1 t~1 with u"(1!c)p2w/(p2n #p2w). We then have for this new iterative procedure
where 0(j(1, then
f "AI f #(1!u)const, t t~1 where AI "[I!o(I!/N)!(1!o)DTD],
>&N(m, p2). We then have for this iterative method that the transition probabilities are
C
1 n k ( f k D f k~1, l k, g)Jexp ! [ f !Mltkf k~1!Qltkg]t T(t ) t t t t 2¹(t ) tk k
with o"p2n /(p2n #p2w), is now a contraction mapping.
D
][Qltk]~1[ f k!Mltkf k~1!Qltkg] , 1 t t (20)
5. The modi5ed simulated annealing algorithm where Let us now examine how to obtain a contraction for our iterative procedure. Let us rewrite Eq. (12) as an iterative procedure and add (a(1!nnl*i+(i))#b(1!c)) f (i) to each side of the equation, we have (a#b) f (i)"(a(1!nnl*i+(i))#b(1!c)) f (i) t t~1 # a/ + f ( j)(1!l([i, j])) t~1 j /)"3 i #b((DTg)(i)!(DTDf ) (i)#cf (i)), (16) t~1 t~1
Mltk")ltk#(I!)ltk)(Cltk!(DTD)ltk), H Qltk"(I!)ltk)Bltk,
(21) (22)
where Cltk*i+*f k(i)"/jltk*i+(i) + t j /)"3 i and
(1!l([i, j])) f ( j) nnltk*i+(i) tk
A
(DTD)ltk*i+ * f k(i)"(1!jltk*i+(i)) H t
B
(DTDf )(i) !f (i) , c
560
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
)ltk is a diagonal matrix with entries ultk*i+(i) and Qltk is 1 a diagonal matrix with entries p2l*i+(i). m In the next section we apply the modi"ed SA and ICM algorithms, whose convergence is established in the appendix to restore astronomical images. The algorithms are the following:
The modi"ed ICM procedure is obtained by selecting in steps 2 and 4 of Algorithm 2 the mode of the corresponding transition probabilities.
Algorithm 2 (MSA procedure). Let i , t"1, 2,2, be the t sequence in which the sites are visited for updating.
Let us "rst examine how the modi"ed ICM algorithm works on a synthetic star image, blurred with an atmospherical point spread function (PSF), D, given by
1. Set t"0 and assign an initial conxguration denoted as f , l and initial temperature ¹(0)"1. ~1 ~1 2. The evolution l Pl of the line process can be obt~1 t tained by sampling the next point of the line process from the raster-scanning scheme based on the conditional probability mass function dexned in Eqs. (9) and (10) and keeping the rest of l unchanged. t~1 3. Set t"t#1. Go back to step 2 until a complete sweep of the xeld l is xnished. 4. The evolution f Pf of the image system can be obt~1 t tained by sampling the next value of the whole image based on the conditional probability mass function given in Eq. (17) 5. Go to step 2 until t't , where t is a specixed integer. f f The following theorem guarantees that the MSA algorithm converges to a local MAP estimate, even in the presence of blurring. Theorem 2. If the following conditions are satisxed: 1. D/D(0.25 2. ¹(t)P0 as tPR, such that 3. ¹(t)*C /log(1#k(t)), T then for any starting conxguration f , l , we have ~1 ~1 p( f , l D f , l , g)Pp ( f, l) as tPR, t t ~1 ~1 0 where p (. , .) is a probability distribution over local MAP 0 solutions, C is a constant and k(t) is the sweep iteration T number at time t. We notice that if the method converges to a con"guration ( f, l), then fM "arg max p( f DlM , g) f Furthermore, lM "arg max p(lD fM , g) l We conjecture that the method we are proposing converges to an distribution over global maxima. However, the di$culty of using synchronous models prevent us from proving that result (See Ref. [22]).
6. Test examples
d(i)J(1#(u2#v2)/R2)~d,
(23)
with d"3, R"3.5, i"(u, v), and Gaussian noise with p2n "64. If we use p2w"24415, which is realistic for this image, and take into account that, for the PSF de"ned in Eq. (23), c"0.02, A de"ned in Eq. (15) is not a contraction. Figs. 2a and b depict the original and corrupted image, respectively. Restorations from the original and modi"ed ICM methods with b"2 for 2500 iterations are depicted in Figs. 2c and d, respectively. Similar results are obtained with 500 iterations. The proposed methods were also tested on real images and compared with ARTUR, the method proposed by Charbonnier et al. [19]. ARTUR minimizes energy functions of the form
G
J( f )"j2 + u[ f (i)!f (i :#1)] i
H
#+ u[ f (i)!f (i :#2)] # DDg!Df DD2, i
(24)
where j is a positive constant and u is a potential function satisfying some edge-preserving conditions. The potential functions we used in our experiments, u , u , GM HL u and u , are shown in Table 1. HS GR Charbonnier et al. [19] show that, for those u functions, it is always possible to "nd a function JH such that J( f )"inf JH( f, l), l where JH is a dual energy which is quadratic in f when l is "xed. l can be understood as a line process which, for those potential functions, takes values in the interval [0, 1]. To minimize Eq. (24), Charbonnier et al. propose the following iterative scheme: 1. 2. 3. 4. 5. 6.
n"0, f 0,0 Repeat ln`1"arg min [JH( f n, l)] l f n`1"arg min [JH( f, ln`1)] f n"n#1 Until convergence.
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
561
Fig. 2. (a) Original image, (b) observed image, (c) ICM restoration, (d) restoration with the proposed ICM method.
Table 1 Edge preserving potential functions used with ARTUR Potential u GM function
u HL
u HS
u GR
Expression t2 of u(t) 1#t2
log(1#t2) 2J1#t2!2 2 log[cosh (t)]
In our experiments the convergence criterion used in step 6 above was DD f n`1!f nDD2/DD f nDD2(10~6. The solution of step 4 was found by a Gauss}Seidel algorithm. The stopping criterion was DD f n`1,m`1!f n`1,mDD2/DD f n`1,mDD2(10~6, where m is the iteration number. We use images of Saturn which were obtained at the Cassegrain f/8 focus of the 1.52 m telescope at Calar Alto Observatory (Spain) on July 1991. Results are presented on a image taken through a narrow-band interference "lter centered at the wavelength 9500 As . The blurring function de"ned in Eq. (23) was used. The parameters d and R were estimated from the intensity pro"les of satellites of Saturn that were recorded simulta-
neously with the planet and of stars that were recorded very close in time and airmass to the planetary images. We found d&3 and R&3.4 pixels. Fig. 3 depicts the original image and the restorations after running the original ICM and our proposed ICM methods for 500 iterations and the original SA and our proposed SA methods for 5000 iterations. In all the images the improvement in spatial resolution is evident. In particular, ring light contribution has been successfully removed from equatorial regions close to the actual location of the rings and amongst the rings of Saturn, the Cassini division is enhanced in contrast, and the Encke division appears on the ansae of the rings in all deconvolved images. To examine the quality of the MAP estimate of the line process we compared it with the position of the ring and disk of Saturn, obtained from the Astronomical Almanac, corresponding to our observed image. Although all the methods detect a great part of the ring and the disk, the ICM method (Fig. 4a) shows thick lines. The SA method, on the other hand, gives us thinner lines and the details are more resolved (Fig. 4b). Obviously, there are some gaps in the line process but better results would be obtained by using eight neighbors instead of four or, in general, adding more l-terms to the energy function. Fig. 5 depicts the results after running ARTUR using potential functions u , u , u and u on the Saturn GM HL HS GR image together with the results obtained by the proposed
562
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Fig. 3. (a) Original image, (b) restoration with the original ICM method and (c) its line process, (d) restoration with the original SA method and (e) its line process, ( f ) restoration with the proposed ICM method and (g) its line process, (h) restoration with the proposed SA method and (i) its line process.
Fig. 4. Comparison between the real edges (light) and the obtained line process (dark). (a) proposed ICM method, (b) proposed SA method.
ICM method. Note that line processes obtained by the potential functions used with ARTUR are presented in inverse gray levels. The results suggest that u and GM
u capture better the lines of the image than u and HL HS u . Lines captured by all these functions are thicker GR than that obtained by the proposed ICM method; notice that the line process produced by these potential functions is continuous on the interval [0, 1]. Furthermore, u also captures some low-intensity lines, due to the GM noise, that creates some artifacts on the restoration, specially on the Saturn rings, see Fig. 5b. Finally, the potential functions used with ARTUR have captured the totality of the planet contour although the line process intensity on the contour is quite low. The methods were also tested on images of Jupiter which were also obtained at the Cassegrain f/8 focus of the 1.52 m telescope at Calar Alto Observatory (Spain) on August 1992. The blurring function was the same as in the previous experiment. Fig. 6 depicts the original image and the restorations after running the original ICM and our proposed ICM method for 500 iterations and our proposed SA method for 5000 iterations. In all the images the improvement in spatial resolution is evident. Features like the Equatorial plumes and great red spot are
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
563
Fig. 5. (a) Restoration with the proposed ICM method ( f ) and (k) its corresponding horizontal and vertical line process. (b)}(e) show the restoration when ARTUR is run for the potentials u , u , u and u , respectively. Their corresponding horizontal line processed GM HL HS GR are shown in (g)}( j) and their vertical processed are shown in (l)}(o).
very well detected. ARTUR was also tested on these images obtaining similar results to the obtained with Saturn. In order to obtain a numerical comparison, ARTUR and our methods were tested and compared using the
cameraman image. The image was blurred using the PSF de"ned in Eq. (23) with the parameters d"3 and R"4. Gaussian noise with variance 62.5 was added obtaining a image with SNR"20 dB. The original and observed image are shown in Fig. 7.
564
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Fig. 6. (a) Original image, (b) restoration with the original ICM method and (c) its line process, (d) restoration with the proposed ICM method and (e) its line process, ( f ) restoration with the proposed SA method and (g) its line process.
In order to compare the quality of the restorations we used the peak signal-to-noise ratio (PSNR) that, for two images f and g of size M]N, is de"ned as
C
PSNR"10 log 10
Fig. 7. (a) Original cameraman image, (b) observed image.
Figs. 8 and 9 depict the restorations after running our proposed SA method for 5000 iterations, our proposed ICM method for 500 iterations and ARTUR with di!erent potential functions.
D
M]N]2552 . DDg!f DD2
Results, shown in Table 2, are quite similar for all the methods but they suggest that better results are obtained when our proposed SA method is used. For the two best methods in terms of the PSNR, our proposed SA and ARTUR with u , we have included cross-sections of the GM original and restored images in Fig. 10. It can be observed that, although both pro"les are quite similar, our proposed SA method obtain sharper edges than the ones obtained with u . GM
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
565
Fig. 8. (a) Restoration with the proposed SA method and (b), (c) its horizontal and vertical line process, (d) restoration with the proposed ICM method and (e), ( f ) its horizontal and vertical line process, (g) restoration with ARTUR with u function and (h), (i) its horizontal GM and vertical line process.
Table 2 Comparison of the di!erent restoration methods in terms of PSNR ARTUR with
PRNR (dB)
Observed
Proposed ICM
Proposed SA
u GM
u HL
u HS
u GR
18.89
20.72
21.11
20.75
20.64
20.72
20.51
Table 3 shows the total computing time of the studied methods after running them on one processor of a Silicon Graphics Power Challenge XL. It also shows the relative execution time referred to the computing time of the ICM
method. The little di!erence between the ICM and SA methods is due to the fact that most of the time is spent in convolving images.
566
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Fig. 9. (a) Restoration with ARTUR with u function and (c), (c) its horizontal and vertical line process, (d) restoration with ARTUR HL with u function and (e), ( f ) its horizontal and vertical line process, (g) restoration with ARTUR with u function and (h), (i) its HS GR horizontal and vertical line process. Table 3 Total computing time of the methods and relative time per iteration referred to the ICM Original Method ICM Total time (s) 1149 Relative time 1.00
Proposed SA 12852 1.13
ICM 140 0.12
ARTUR with SA 2250 0.20
7. Conclusions In this paper we have presented two new methods that can be used to restore high dynamic range images in the presence of severe blurring. These methods
u GM 198 0.17
u HL 38 0.17
u HS 29 0.17
u GR 29 0.17
extend the classical ICM and SA procedures, and the convergence of the algorithms is guaranteed. The experimental results verify the derived theoretical results. Further extensions of the algorithms are under consideration.
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
567
valued with a number of elements equal to the number of pixels in the image. For simplicity, we assume ) is Rd and k is a Lebesgue measure on Rd. De"ne a Markov operator P : ¸1P¸1 as follows: n
P
P n(s )" n n
n (s Ds )n(s ) ds . (A.1) n n n~1 n~1 n~1 ) By Pm we mean the composite operation P n n`m P P P . The convergence problem we are n`m~12 n`2 n`1 dealing with is the same as the convergence of Pm as 0 mPR.
De5nition A.1. Let x be a vector with components x(i) and Q be a matrix with components q(i, j). We de"ne DDxDD 2 and DDQDD as follows: 2 1@2 DDxDD " + Dx(i)D2 , 2 i DDQxDD 2"max (o(i))1@2, DDQDD "sup 2 DDxDD 2 x i where o(i) are the eigenvalues of matrix QTQ.
A
B
De5nition A.2. A continuous nonnegative function < : )PR is a Lyapunov function if
Fig. 10. Cross section of line 153 of the original cameraman image (solid line) and reconstructed images (dotted line) with (a) proposed SA and (b) ARTUR with u . GM
Appendix. Convergence of the MSA procedure In this section we shall examine the convergence of the MSA algorithm. It is important to make clear that in this new iterative procedure we simulate f (i) using Eq. (17) and to simulate l([i, j]) we keep using Eqs. (9) and (10). We shall denote by n the corresponding transition T probabilities. That is, n k ( f kD f k~1, l k, g) is obtained from T(t ) t t t Eq. (20) and n k (l kD f k~1, l k~1) is obtained from Eqs. (9) t T(t ) t t and (10). Since updating the whole image at the same time prevents us from having a stationary distribution we will not be able to show the convergence to the global MAP estimates using the same proofs as in Geman and Geman [1] and Jeng and Woods [3]. To prove the convergence of the chain we need some lemmas and de"nitions as in Jeng [2] and Jeng and Woods [3]. We assume a measure space (), &, k) and a conditional density function n (s Ds ) which de"nes a Markov chain n n n~1 s , s , 2, s ,2. In our application, the s are vectors 1 2 n i
lim <(s)"R, @@s@@?= where DDsDD is a norm of s.
(A.2)
Denote by D the set of all pdf 's with respect to Lebesgue measure and the ¸ norm de"ned as follows: 1
P
DDnDD " 1
)
Dn(s)D ds ∀n3¸1.
De5nition A.3. Let P : ¸1P¸1 be a Markov operator. n Then MP N is said to be asymptotically stable if, for any n n , n 3D, 1 2 lim DDPm(n !n )DD "0. (A.3) 0 1 2 1 m?= The following theorem from Jeng and Woods [3] gives the su$cient conditions on the asymptotic stability of Pm in terms of transition density functions. 0 Theorem A.1. Let (), &, k) be a measure space and k be Lebesgue measure on Rd. If there exists a Lyapunov function, < : )PR, such that
P
<(s )n (s Ds ) ds )a<(s )#b n n n n~1 n n~1 ) for 0)a(1 and b*0
(A.4)
568
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Obviously if D/D(0.25, ∀xO0, xTMlx(+x(i)2. Furthermore,
and = + DDh iDD "R, m "im8 m 1 i i/1 where
for any integer m8 '0, (A.5)
A
xTMlx*+x(i)2!o + /(x(i)!x(i:#1))2 i
h i(s i)" inf n i(s iDs i ) (A.6) m m m ~1 m m @@smi~1@@xr and r is a positive number satisfying the following inequality:
# + /(x(i)!x(i:#2))2 i
b ∀DDsDD'r <(s)'1# 1!a
# (1!4/)+ x2(i) !(1!o)xTDTDx i
then, for the Markov operator, P : ¸1P¸1, dexned by n Eq. (25) we have that Pm is asymptotically stable. 0 We are going to show that the su$cient conditions of Theorem A.1 are satis"ed by the Markov chain de"ned by our MSA procedure when the parameters describing the Algorithm 2 satisfy the conditions of Theorem 2. Since asymptotic stability of an inhomogeneous Markov sequence implies convergence, this will imply for our restoration problem that for any starting con"guration f , l , ~1 ~1 we have p( f , l D f , l , g)Pp ( f, l) as tPR, t t ~1 ~1 0 where p (. , .) is the probability distribution over the MAP 0 solutions. Let us prove the following lemma. Lemma A.1. If D/D(0.25 then, ∀l,
B
"xT(I!o(I!/N)!(1!o)DTD)x and if D/D(0.25, !I((I!o(I!/N)!(1!o)DTD). So, if D/D(0.25, !I(Ml(I and xTMlTMlx(xTx, which proves that Ml is a contraction matrix for D/D(0.25. h We shall also use the following lemma from Jeng [2] and Jeng and Woods [3]. Lemma A.2. Assume B is a d-dimensional positive-dexnite matrix with eigenvalues o(1)*o(2)*2*o(d)'0 and B"JTDJ, where D is a diagonal matrix which consists of the eigenvalues. Let b'0, then
DDMlDD (1, 2 where Ml has been dexned in (21). Proof. First we note that from (16)
A
f (i)"f (i)!o / + ( f (i)!f ( j)) t t~1 t~1 t~1 j /)"3 i ](1!l([i, j]))#(1!4/) f (i) t~1 # (1!o)((DTg)(i)!(DTDf )
B
(i)), t~1
where o"a/(a#b). So, Ml is symmetric and for any vector x
A
xTMlx"+x(i)2!o + /(x(i)!x(i:#1))2 i
B
](1!l([i, i:#1]))
A
! o + /(x(i)!x(i:#2))2(1!l([i, i:#2])) i
B
P
1
# (1!4/)+ x2(i) ! (1!o)xTDTDx i
exp[!xTB~1x] dx J(2p)dDBD @@x@@2;b
A B
C
D
d~2 b2 exp ! . 2o(d) Jo(d) b
*q
Using these two lemmas, let us now show that the su$cient conditions of Theorem A.1 are satis"ed. The proof follows the same steps as the one given in Jeng [2] and Jeng and Woods [3]. Let <( f, l) be the Lyapunov function <( f, l)"DD f DD #DDlDD . 2 2
(A.7)
Step 1: Show that
P
+ <( f k, l k)n k ( f k, l kD f k~1, l k~1, g) df k)b#a<( f k~1, l k~1). t t T(t ) t t t t t t t ltk )
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
First we show that
P
DD f kDD n k ( f kD f k~1, l k, g) df k)b #a DDf k~1DD t 2 T(t ) t t t t 1 t 2 ) We have
P
)
∀l k. (A.8) t
DD f kDD n k ( f kD f k~1, l k, g) df k (by change of variable) t 2 T(t ) t t t t
P
DD fM k#Mltkf k~1#QltkgDD const t t 2 ) 1 fM t [Qltk]~1fM k dfM k exp ! t t 2¹(t ) tk 1 k
This result can be obtained by establishing a lower bound for DDh kDD . First, recall that t 1 h k" inf n k ( f k, l kDf k~1, l k~1, g), T(t ) t t t t t ( ftk~1, ltk~1)|Ra where R is de"ned as a R "M( f, l)D<( f, l))aN. a By de"nition of ¸ norm, we have 1
P
"
P
C
D
(DDfM kDD #DDMltkf k~1DD #DDQltkgDD ) const t 2 t 2 2 ) 1 exp ! fM t [Qltk]~1fM k dfM k)b #aDD f k~1DD , t t 1 t 2 2¹(t ) tk 1 k (A.9)
)
C
DDh kDD "+ inf n k ( f k, l kD f k~1, l k~1, g) df k t 1 T(t ) t t t t t ltk ( ftk~1, ltk~1)|Ra and, from the de"nition of the iterative procedure,
P P
DDh kDD "+ t 1 ltk
inf n k ( f k, l k D f k~1, l k~1, g) df k T(t ) t t t t t (ftk~1,ltk~1)|Ra
"+ inf n k ( f k D f k~1, l k, g)n T(t ) t t t T(tk) ltk (ftk~1, ltk~1)|Ra ](l k D f k~1, l k~1) df k t t t t
D
C P
*inf inf n k (l k D f k~1, l k~1) T(t ) t t t ltk (ftk~1, ltk~1)|Ra
where a"max DDMlDD , 2 l
(A.10) + n k (l kD f k~1, l k~1)DD l kDD )b #aDDl k~1DD , T(t ) t t t t 2 1 t 2 ltk since l k has only a "nite number of levels, choosing b big t 1 enough, the above inequality obviously holds. Let us now show that Eq. (A.8) holds. We have, using Eqs. (A.9) and (A.10), + <( f k, l k)n k (f k, l kD f k~1, l k~1, g) df k t t t t T( t ) t t t ltk )
P
"+ DD f kDD n k (f k, l kD f k~1, l k~1, g) df k t 2 T( t ) t t t t t ltk )
P
#+ DDl kDD n k ( f k, l kDf k~1, l k~1, g) df k t 2 T( t ) t t t t t ltk ) )b #aDD f k~1DD #b DDl k~1DD "b#a<( f k~1, l k~1). 1 t 2 2 t 2 t t Step 2: Show that if temperature ¹(t) decreases as C /log (1#k(t)), then for any n '0, we have T 0 = + DDh n0DD "R, mt 1 m/1 where C is a positive constant and k(t) is the iteration T number of sweeps up to time t.
inf n k ( f k D f k~1, l k, g) df k T(t ) t t t t (ftk~1, ltk~1)|Ra
]
b "max [¹(t )]1@2[trace(Ql )]1@2#DDQltkgDD ], 1 k 1 2 l with a(1, since for Lemma A.1, Ml is a contraction, ∀l. Furthermore, it can be easily shown that
P
569
D
* inf inf n k (l k D f k~1, l k~1) t T(t ) t t (ltk,ltk~1) @@ftk~1@@2xa ]inf ltk
CP
D
inf n k ( f k D f k~1, l k, g) df k . T(t ) t t t t @@ftk~1@@2xa
Let
G
C
DH
G
C
DH
d "max .!9
/ b sup ( f (i)!f ( j))2, 2p2w 2p2w @@f@@2xa
and / b inf ( f ( i)!f ( j))2, 2p2w 2p2w @@f@@2xa then, from Eqs. (9) and (10), d "min .*/
inf inf n k (l k Df k~1, l k~1) T(t ) t t t (ltk,ltk~1) @@ftk~1@@2xa 1 d !d p8 C .*/ * exp ! .!9 "C exp ! 1 , 2 2 ¹(t ) ¹(t ) k k where C "p8 (d !d ) and C "(1/2)p8 with p8 the 1 .!9 .*/ 2 number of line sites.
A C
DB
C
Hence,
C P
C DDh kDD *C exp ! 1 t 1 2 ¹(t ) k
D
]inf inf n k ( f k D f k~1, l k, g) df k. T(t ) t t t t ltk @@ftk~1@@2xa
D
570
R. Molina et al. / Pattern Recognition 33 (2000) 555}571
Now, from Eq. (20),
where
n k ( f k D f k~1, l k, g) T(t ) t t t
C
1 "const exp ! [ f !Mltkf k~1!Qltkg]t[Qltk]~1 t 1 2¹(t ) tk k
D
][ f k!Ml f k~1!Ql g] . t t tk
tk
inf n k ( f k D f k~1, l k, g) df k t t T(t ) t t @@ftk~1@@2xa
P
C
D
1 f t [Qltk]~1f k df k. exp ! t t 2¹(t ) t k 1 k @@ftk@@2;b
Then, from Lemma A.2, we can establish the following inequality:
P
C
D
1 1 exp ! f t [Qltk]~1f k df k t t [(2p)d¹(t )DQltkD]1@2 tk 2 2¹(t ) tk 1 k 1 @@f @@ ;b k
A
*q
B
b
J¹(t )j k d
C
D
d~2 b2 exp ! , 2¹(t )j k d
where j is the smallest eigenvalue of Qltk and b is a posid 1 tive constant. Obviously j and b depend on l k. Denoting jH and d t d bH such that
A
bH
J¹(t )j H k d
B
d~2
A
C B
b*2 exp ! 2¹(t )j H k d
D
C
D
b d~2 b2 "inf exp ! , 2¹(t )j J¹(t )j k d ltk k d we have
C C
DA D
B
C bH d~2 DDh kDD *C exp ! 1 q 2 t 1 ¹(t ) J¹(t )j H k k d b*2 ]exp ! . 2¹(t )j H k d Then, if we take ¹(t )"C /log(1#k), we have k T CH(log(1#k))(d~2)@2 DDh kDD * , t 1 1#k
bH2 , C "C # T 1 2j H d = + DDh n0DD "R. mt 1 m/1 References
1 * [(2p)d¹(t )DQltkD]1@2 k 1 ]
B
and so
It can be shown that there exists a "nite number b with b
P
A
bH d~2 CH"C q 2 J¹(t )j H k d and
[1] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1984) 721}742. [2] F.C. Jeng, Compound Gauss}Markov random "elds for image estimation and restoration, Ph.D. Thesis, Rensselaer Polytechnic Institute, 1988. [3] F.C. Jeng, J.W. Woods, Simulated annealing in compound Gaussian random "elds, IEEE Trans. Inform. Theory 36 (1988) 94}107. [4] F.C. Jeng, J.W. Woods, Compound Gauss}Markov models for image processing, in: Katsaggelos, A.K. (Ed.), Digital Image Restoration, Springer Series in Information Science, vol. 23, Springer, Berlin, 1991. [5] R. Chellapa, T. Simchony, Z. Lichtenstein, Image estimation using 2D noncausal Gauss}Markov random "eld models, in: Katsaggelos, A.K. (Ed.), Digital Image Restoration, Springer Series in Information Science, vol. 23, Springer, Berlin, 1991. [6] J. Besag, On the statistical analysis of dirty pictures, J. Roy. Statist. Soc. Ser. B 48 (1986) 259}302. [7] A. Blake, A. Zisserman, Visual Reconstruction, MIT Press, Cambridge, 1987. [8] R. Molina, B.D. Ripley, Using spatial models as priors in astronomical images analysis, J. Appl. Statist. 16 (1989) 193}206. [9] P.J. Green, Bayesian reconstruction from emission tomography data using a modi"ed EM algorithm, IEEE Trans. Med. Imaging 9 (1990) 84}92. [10] C. Bouman, K. Sauer, A generalized Gaussian image model for edge-preserving MAP estimation, IEEE Trans. Image Process. 2 (1993) 296}230. [11] R.R. Schultz, R.L. Stevenson, Stochastic modeling and estimation of multispectral image data, IEEE Trans. Image Process. 4 (1995) 1109}1119. [12] K. Lange, Convergence of EM image reconstruction algorithms with Gibbs smoothing, IEEE Trans. Image Process. 4 (1990) 439}446. [13] P. Perona, J. Malik, Scale-space and edge detection using anisotropic di!usion, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 629}639. [14] Y.L. You, W. Xu, A. Tannenbaum, M. Kaveh, Behavioral analysis of anisotropic di!usion in image processing, IEEE Trans. Image Process. 5 (1996) 1539}1553.
R. Molina et al. / Pattern Recognition 33 (2000) 555}571 [15] F. Catte, P.L. Lions, J.M. Morel, T. Coll, Image selective smoothing and edge detection by nonlinear di!usion, SIAM J. Numer. Anal. 29 (1992) 182}198. [16] P. Saint-Marc, J.S. Chen, G. Medioni, Adaptive smoothing: a general tool for early vision, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 514}529. [17] S. Geman, D.E. McClure, Bayesian image analysis: an application to single photon emission tomography, the Proceedings of Stat. Comput. Sect. Amer. Stat. Assoc., Washington, D.C., 1985, pp. 12}18. [18] T. Hebert, R. Leahy, A generalized EM algorithm for 3-D Bayesian reconstruction from Poisson data using Gibbs priors, IEEE Trans. Med. Imaging 8 (1989) 194}202.
571
[19] P. Charbonnier, L. Blanc-Feraud, G. Aubert, M. Barlaud, Deterministic edge-preserving regularization in computed imaging, Trans. Image Process. 5 (1997) 298}311. [20] B.D. Ripley, Spatial Statistics, Wiley, New York, 1981. [21] R. Molina, A.K. Katsaggelos, J. Mateos, J. Abad, Restoration of severely blurred high range images using compound models, Proceedings of ICIP-96, vol. 2, 1996, pp. 469}472. [22] L. Younes, Synchronous random "elds and image restoration, CMLA, Ecole Normales SupeH rieure de Cachan. Technical Report, 1998.
About the Author*RAFAEL MOLINA was born in 1957 and received his degree in Mathematics (Statistics) in 1979. He completed his Ph.D. Thesis in 1984 in Optimal Design in Linear Models and became Associate Professor in Computer Science and Arti"cial Intelligence at the University of Granada in 1989. His areas of research interest are image restoration (applications to astronomy and medicine), parameter estimation, image compression, and blind deconvolution. He is a member of the IEEE, SPIE, Royal Statistical Society and AERFAI (AsociacioH n Espan8 ola de Reconocimento de Formas y AnaH lisis de ImaH genes) and currently the Dean of the Computer Engineering Faculty at the University of Granada.
About the Author*AGGELOS K. KATSAGGELOS received the Diploma degree in electrical and mechanical engineering from the Aristotelian University of Thessaloniki, Thessaloniki, Greece, in 1979 and the M.S. and Ph.D. degrees both in electrical engineering from the Georgia Institute of Technology, Atlanta, Georgia, in 1981 and 1985, respectively. In 1985 he joined the Department of Electrical Engineering and Computer Science at Northwestern University, Evanston, IL, where he is currently professor, holding the Ameritech Chair of Information Technology. During the 1986}1987 academic year he was an assistant professor at Polytechnic University, Department of Electrical Engineering and Computer Science, Brooklyn, NY. His current research interests include image and video recovery, video compression, motion estimation, boundary encoding, computational vision, and multimedia signal processing. Dr. Katsaggelos is a Fellow of the IEEE, an Ameritech Fellow, a member of the Associate Sta!, Department of Medicine, at Evanston Hospital, and a member of SPIE. He is a member of the Steering Committee of the IEEE Transactions on Medical Imaging, the IEEE Technical Committees on Visual Signal Processing and Communications, Image and Multi-Dimensional Signal Processing, and Multimedia Signal Processing, and the editor-in-chief of the IEEE Signal Processing Magazine. He has served as an Associate editor for the IEEE Transactions on Signal Processing (1990}1992), an area editor for the journal Graphical Models and Image Processing (1992}1995), a member of the Steering Committee of the IEEE Transactions on Image Processing (1992}1997). He is the editor of Digital Image Restoration (Springer-Verlag, Heidelberg, 1991), co-author of Rate-Distortion Based Video Compression (Kluwer Academic Publishers, 1997), and co-editor of Recovery Techniques for Image and Video Compression, (Kluwer Academic Publishers, 1998). He has served as the General Chairman of the 1994 Visual Communications and Image Processing Conference (Chicago, IL), and as technical program co-chair of the 1998 IEEE International Conference on Image Processing (Chicago, IL).
About the Author*JAVIER MATEOS was born in Granada (Spain) in 1968. He received his Diploma and M.S. degrees in Computer Science from the University of Granada in 1990 and 1991, respectively. He completed his Ph.D. in Computer Science at the University of Granada in July 1998. Since September 1992, he has been Assistant Professor at the Department of Computer Science and Arti"cial Intelligence of the University of Granada. His research interest includes image restoration and image and video recovery and compression. He is a member of the AERFAI (AsociacioH n Espan8 ola de Reconocimento de Formas y AnaH lisis de ImaH genes).
About the Author*AURORA HERMOSO was born in 1957 and received her degree in Mathematics (Statistics) in 1979. She completed her Ph.D. in 1984 in Likelihood test in Log Normal Processes and became Associate Professor in Statistics at the University of Granada in 1986. Her area of research interest is Estimation in Stochastic Systems. She is a member of SEIO (Sociedad Espan8 ola de EstadmH stica e InvestigacioH n Operativa).
About the Author*ANDREW SEGALL is a Ph.D. student at Northwestern University, Evanston, Illinois, and a member of the Image and Video Processing Laboratory. His research interests are in image processing and include scale space theory, nonlinear "ltering and target tracking.
Pattern Recognition 33 (2000) 573}586
Noniterative manipulation of discrete energy-based models for image analysis Patrick PeH rez*, Annabelle Chardin, Jean-Marc LaferteH IRISA/INRIA, Campus de Beaulieu, F-35042 Rennes Cedex, France Received 15 March 1999
Abstract With emphasis on the graph structure of energy-based models devoted to image analysis, we investigate e$cient procedures for sampling and inferring. We show that triangulated graphs, whom trees are simple instances of, always support causal models for which noniterative procedures can be devised to minimize the energy, to extract probabilistic descriptions, to sample from corresponding prior and posterior distributions, or to infer from local marginals. The relevance and e$ciency of these procedures are illustrated for classi"cation problems. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Energy-based models; Independence graph; Causality; Triangulated graphs; Trees; Noniterative procedures
1. Introduction and background Many issues of image analysis can be modeled and coped with by designing an energy function ;(x, y) which captures the interaction between the unknown variables x"(x ) to be estimated, and the observed variables } the ii measurements or data } , y"(y ) . A standard, although jj complex in general, problem is then the minimization of this function with respect to x, y being known. Other intricate issues, such as estimation of internal parameters or validation of selected models, also arise within this framework. Depending on the number of variables, the nature of the single variable state space (discrete or not), and the properties of the function (convex or not, local or not), various situations with speci"c types of di$culty occur. The number of variables, which may be extremely large in usual image analysis problems, remains however a generic source of concern.
* Corresponding author. Tel: #33-299-847273; fax: #33299-847171 E-mail addresses: [email protected] (P. PeH rez), [email protected] (A. Chardin)
Such an energetic modeling is encountered in various "elds (e.g., statistical physics, multivariate statistics, combinatorial optimization, arti"cial intelligence). We are here interested in its use in di!erent approaches to image processing problems, such as in Markov random "elds (MRFs)-based approaches [1,2] and partial di!erential equations (PDEs)-based approaches [3].1 In the former class of approaches x and y are random vectors and the energy function is naturally related to the joint distribution2 through P(x, y)JexpM!;(x, y)N. In the latter one, ; is a functional of continuous functions x and y which are discretized afterward (e.g., within the function minimization process). In the following we shall only refer to energy functions of a "nite number of variables,
1 Many aspects of energetic modeling do not need to be related to any probabilistic framework, as in PDFs-based approaches, although we think it is a good thing to do. Thus, in order to remain quite general, we tried in this paper to emphasis, when possible, the non-probabilistic aspect of issues of interest. However, certain issues are intrinsically probabilistic, such as drawing samples for instance. 2 We adopt the convention that all the probability masses will be denoted by P(.). We shall refer to them simply as `distributionsa.
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 3 - 4
574
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
while keeping in mind the strong (and sometimes straight-forward) connection to continuous energetic models. We shall also assume that x 's take values in i a "nite set ". The "rst critical step toward energetic modeling obviously relies on the choice of the energy form. A tailormade parameterized class of functions is generally chosen. An important ingredient of functions usually used, is their decomposition as a sum of simple interaction potentials depending just on a few variables. Thus, by specifying very simple local interactions (possibly nonlinear and involving variables of di!erent natures) which sum up as an energy function of all variables and parameters, one can de"ne a global model. This local/global duality is behind the #exibility and power of these modeling approaches. With such a setting, each variable only directly interacts with a few other `neighboringa variables. From a more global point of view, all variables are mutually dependent, but only through the combination of successive local interactions. This key notion of local functional dependencies is naturally captured by de"ning an independence graph associated to ;. It is an undirected graph for which i and j are neighbors if x and x appear within i j a same local component of the chosen energy decomposition. This graph structure turns out to be a powerful tool to account for important local and global structural properties of the model. As we shall see, in some speci"c cases it su$ces to deduce causality properties, thus allowing the design of e$cient estimation algorithms. This paper is particularly dedicated to the exploration of this type of situations in case of discrete models (i.e., x 's take i values in a "nite set "). After the speci"cation of an energetic model, one deals with the actual use of it for modeling a class of problems and for solving them. At that point, three main general issues may be of interest: 1. Sampling: in order to evaluate the statistical properties of the speci"ed energy-based model, one might want to draw samples from the prior and posterior distributions (P(x) and P(xDy) respectively) associated to the energy function. It is then a purely probabilistic issue; 2. Inferring: one of the primary goals in early vision problems is to infer the `besta estimate of x given y, with respect to a criterion to be devised. 3. Learning: also, one has to tune the parameters involved in the de"nition of ;. The estimation of the optimal parameter vector is tricky since the whole energy landscape depends on it. Apart from manual tuning, consistent estimation procedures (e.g., EMtype algorithms [4,5]) exist, but they remain extremely heavy, if practicable. In general, there is no way to directly draw samples from the prior and posterior distributions. Among other
problems, these distributions are known up to proportionality constants (the partition functions) whose computation (by summing up exponentials over all possible values of x) is not tractable in general. One has to use iterative Monte Carlo Markov chain (MCMC) methods (where these constant do not appear) to get samples from distributions converging toward the target distribution. This Monte Carlo framework then allows in addition to compute approximations of partition functions or any other expression involving sums over very large set of con"gurations (like posterior marginals or posterior expectations). However, the overall procedure is computation demanding due to slow convergence. As for estimation of x in case of discrete model, there exist two standard estimators stemming from Bayesian estimation theory. The maximum a posteriori (MAP) estimator which is the most widely used, makes the best estimate of the most probable x given y : x( "arg max P(xDy)"arg min ;(x, y). It corresponds x x to the global minimizer of the energy function, and its estimation can therefore be seen as a non probabilistic problem, the energy simply being a `costa function to be minimized. The second estimator is as for itself intrinsically probabilistic. It is the so-called MPM (for marginal posterior modes) which de"nes site-wise estimate as the most probable given y : ∀i, x( "arg max i P(x Dy). i x i The global minimization necessary to get MAP estimate is not possible in general. Various iterative algorithms can be devised to cope with the problem, but they only provide approximate estimates in general (i.e., `locala energy minima). As for MPM estimates, they rely on the computation of the posterior marginals which is not tractable in general. Aforementioned MCMC iterative techniques can provide us with approximations of these marginals. Problem of parameter estimation is even more complicated. Standard Expectation}Maximization (EM)-type iterative methods require the knowledge of prior partition function, as well as local posterior expectations relative to the current parameter "t [4,5]. Both ingredients are out of reach in general, and (Monte Carlo) approximations are once again necessary [6]. It turns out that for most energy-based models suitable for image analysis problems, one has to devise deterministic or stochastic iterative algorithms exploiting the locality of the model. While permitting tractable single-step computations, the locality results in a very slow propagation of information. As a consequence, these iterative procedures may converge very slowly. It is particularly unbearable for stochastic (sampling or minimization) algorithms. This motivates the search for speci"c models allowing noniterative or e$cient handling of the di!erent listed issues. In this spirit, probabilistic causal models have been already thoroughly studied [7}15]. The class of causal autoregressive xelds, unilateral MRFs, mesh MRFs, and
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
mutually compatible MRFs on bidimensional lattices has thus been introduced. As we shall recall later, these models rely on a probabilistic causality concept captured by the factorization of P(x) in terms of causal transition kernels. We examine here the causality from a more graphical point of view, in order to identify causal models at xrst sight, based on simple characteristics of the independence graph of the model. We then explain how noniterative two-sweep algorithms can be devised on these nice graph structures whose simplest instances are trees. In particular, we present two algorithms for the exact computation of MAP and MPM estimates. They are respectively related to Viterbi algorithm [16] and Baum algorithm [17] which both stem from Hidden Markov (chain) Models (HMMs) [18]. On trees, the general setting we intend to develop here is very much related to discrete classi"cation models by Bouman et al. [19] and by LaferteH et al. [20,21]. It is also formally similar to Gaussian models on trees designed by Chou et al. in seminal papers [22,23], and which have been applied to various image processing problems (optical #ow estimation [24], texture analysis [25], remote sensing [26,27]). The paper is organized as follows. In Section 2, we de"ne the independence graph structure which can be associated to any energy-based interacting model, and show how the graph is transformed through three basic mechanisms: freezing of variables (i.e., partial conditioning), energy minimizing with respect to some variables (i.e., conditional MAP estimation) and summing over all possible values of some variables (i.e., marginal computation). In Section 3, standard causality is "rst de"ned as a probabilistic concept which can be functionally characterized, and then examined from an alternate graph-theoretic point of view. It is shown that certain constraints on the independence graph ensure at once that noniterative computations are reachable. In Section 4, we detail such e$cient computations on trees. The relevance and e$ciency of the di!erent procedures are then illustrated in Section 5 for classi"cation tasks.
2. Independence graph and Markovian properties In the coming section and the following one, we choose a general notational setting in which some variables of interest (observed or not) are gathered into a vector z"(z )n associated to an energy i i/1 function ;(z). 2.1. Dexnition and properties As we said in the introduction, an important characteristic of the energy function is its usual decomposition as
575
a sum of local terms: (1) ;(z)"+ v (z ), c c c|C where elements of C are `smalla subsets of indices (usually one or two), and the interaction potential v c only depends on z O(z ) . Equivalently, the joint distric i i|c bution of z factorizes into a product of positive factor potentials: P(z)J< g (z ), (2) c c c where g OexpM!v N. The interaction structure such inc c troduced is conveniently captured by a graph [2,28]: De5nition. The independence graph associated to energy decomposition ;(z)"+ v (z ) is the simple undirected c c c graph G"[S, E] with vertex set S"M1,2, nN, and edge set E de"ned as Mi, jN3E Q &c3C: Mi, jNLc. As a consequence of this de"nition, elements of C are cliques of G (i.e., subsets on which G generates complete subgraphs). In the following, we will always assume that the energy function is such that its independence graph is connected. This graph structure is equivalently characterized by its neighborhood system NOMn(i)N de"ned as: i i3n( j) Q j3n(i) Q Mi, jN3E. The independence graph will be equivalently denoted as G"[S, N]. The vertex set n(i) contains the neighbors of i in G. For practical convenience, the neighborhoods must be small, i.e., G should be of reduced degree. Since a same joint distribution can obviously be de"ned by di!erent energy functions [1,28], and a same energy function can be decomposed in a number of di!erent ways, di!erent independence graphs can be assigned to P(z). However, a unique minimal independence graph can be de"ned for this distribution [29]: the neighborhood of i in this graph is the intersection of neighborhoods of i in all possible independence graphs of joint distribution P(z). As a consequence, the key probabilistic information conveyed by an independence graph G about two vertices, relies on the absence of edge between them: this absence will remain in the minimal graph. One can easily show that in this case, random variables z and z are independent given all the remaining variables: i j Mi, jNNENP(z , z Dz M N) i j S~ i,j "P(z Dz M N)]P(z Dz M N). (3) i S~ i,j j S~ i,j For the minimal independence graph, the implication is replaced by an equivalence. This probabilistic statement constitutes the pairwise Markov property. To prove it, it su$ces to note that the distribution of z factorizes into a product of two functions, one of which not depending on z , and the other one not depending on z . More i j generally, one can prove the following global Markov
576
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
property [30}32]: If a vertex subset a separates two other
disjoint subsets b and d in G (i.e., all chains from i3b to j3d intersect a), then random vectors z and z are b d independent given z : P(z , z Dz )"P(z Dz )]P(z Dz ) and a b d a b a d a P(z Dz , z )"P(z Dz ). The particular case where b"MiN b a d b a and a"n(i) constitutes the local Markov property according to which: P(z Dz , jOi)"P(z Dzn )J < g (z ). c c i j i (i) c>i|c 2.2. Graphical mechanisms When handling an energy-based model, three mechanisms are extensively used: (i) freezing variables: a set of variables is "xed in a given state, either de"nitively (e.g., the observations), or momentarily for practical convenience (e.g., in case of alternate sampling). The frozen variables then become like parameters of the energy function; (ii) summing out: to compute probabilistic quantities or distributions, one has to sum expM!;(z)N over all possible values of one or several variables which then `disappeara from the model; (iii) maximizing out: when dealing with MAP estimation, the global maximization of expM!;N is often performed through coordinate-wise maximizations (i.e., w.r.t. a few variables at a time). We now examine the structural transformations (if any) generated on independence graphs by these basic mechanisms. Freezing } From independence graph de"nition, it is straightforward to see that the subset of variables z (with a aLS) with energy function deduced from ; by freezing other variables z 6 in a given state (with a6 OS!a) exhibits a the subgraph generated by G on a as an independence graph. From the probabilistic point of view, it is a matter of conditioning: this means that an independence graph of z given z 6 is simply obtained from G by deleting edges a a with at least one endpoint in a6 (see Fig. 1a). Summing and maximizing } It has been shown in [29] that the marginal P(z ) for some subset aLS has an a independence graph in which two sites are neighbors if they are neighbors in G, or if they belong to the neighborhood of a same connected component of a6 "S!a. This results from the summation of P(z)"P(z , z 6 ) with rea a spect to z 6 which provides the marginal distribution. It a turns out that the same graphical property holds in case
of maximization of P(z ,z 6 ) with respect to z 6 . Let us a a a brie#y sketch the similar proofs of these two properties. Let Ma6 N be the connected components of a6 in G. The k k neighborhood n(a6 )OMi3S!a6 : n(i)Wa6 O0N of a6 bek k k k longs to a and separates a6 from the rest. Consequently, k P(z) factorizes into: P(z)Jg (z )< g (z 6 k, zn 6 k ). a a k a (a ) k It follows that:
(4)
+ P(z)Jg (z )]< + g (z 6 k, zn 6 k ) , a a k a (a ) za6 k hgigj za6 k O k n(a6 k) G (z )
(5)
max P(z)Jg (z )]< max g (z 6 k, zn 6 k ) . a a k a (a ) za6 za6 k k hgigj
(6)
(zn(a6 k)) This means that, in both cases, the components of each zn 6 k become in general mutually dependent through (a ) function G or G . In case a6 reduces to a single site k k k (a6 "MiN), the neighbors of i become mutually neighbork ing through summation or maximization of the joint distribution w.r.t. z (see Fig. 1b). i This of course remains a graphical viewpoint. Depending on the analytical form of the original distribution (even for a "xed graph structure), simpli"cations may occur either in G 's or in G 's (factorization, or actual k k dependence of these functions on less variables), thus reducing the actual number of appearing edges (if any). Such simpli"cations occur with causal models, as we shall see. However, it is very unlikely that simpli"cations simultaneously occur in both G 's and G 's. k k Within estimation issues concerned by this paper, the energy function is generally of the following form (or can be rewritten that way): ;(x, y)"+ v (x )#+ l (y , x ). c c c ii i i This corresponds to pointwise measurements, i.e., components of x and y are in one-to-one correspondence within independence graph of (x, y), and with a mild abuse of notation, they are indexed identically, even though associated to diwerent vertices of the joint graph (Fig. 2a). From the above description of graphical mechanisms, one concludes that a priori distribution P(x) and a posteriori distribution P(xDy) have a common independence graph, namely the one deduced from energy term + v (x ). As for the exact form of prior distribution, it is c c c associated to this energy term (i.e., P(x)Jexp M!+ v (x )N) only if for all i, + iexpM!l (y , x )N is a i i i c c c y OGk
Fig. 1. Graphical consequence of (a) freezing the variables on v sites and (b) summing or maximizing out the variables on v sites.
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
Fig. 2. (a) Example of (x, y) independence graph for pointwise measurements, with x 's on L sites and y 's on v sites; (b) the i i independence graph of x and xDy is obtained by removing v sites and edges connected to them.
constant. It is always possible to make this assumption hold. In this case + v constitutes the so-called prior c c energy.
3. Causality and graphs The concept of causality relies on an ordering of sites, and expresses that the conditional distribution of a component z given its `pasta reduces to the conditional i distribution given a `smalla neighborhood in the past. To make it precise, one has to introduce a total ordering according to which the variables z 's are now (re)indexed i from 1 to n, and seeks the following property [30,31,33]: ∀i'1,
P(z Dz ,2, z )"P(z Dzn8 ), (7) i i~1 1 i (i) where n8 (i) is a small subset of i's past pa(i)O M1,2, i!1N.3 If (7) holds, the distribution of (z ,2, z ), 1 k for any k takes the factorized form: k P(z ,2, z )" < P(z Dzn8 ), (8) 1 k i (i) i/1 where, for notational convenience, we let n8 (1)"0, which means that zn8 has to be ignored. There are no unknown (1) normalizing constants within the joint distribution (8), and a noniterative forward recursive sampling of this Markov chain-type distribution can be easily performed. Given the nice properties o!ered by causality, it is worth addressing the following issue: the set of sites S being ordered and ;(z)"+ v (z ) being an energy c c c function with independence graph G"[S, N], could random vector z with distribution P(z)JexpM!;(z)N J< g (z) be causal with small past neighborhoods? It is c c not the case in general. As explained in Section 2, for any node i, an independence graph G for marginal distribui 3 `Pasta neighborhood system N I OMn8 (i)N de"nes an oriented i independence graph GI "[S, EI ] on S according to: (i, j)3EI Qi3n8 ( j). See [30,31,33] for more material about directed independence graphs and their semantics.
577
tion P(z ,2, z ) can be easily derived from G. In this 1 i graph, the neighborhood of i is composed of n(i)Wpa(i) and of all sites of pa(i) that are connected to i in G through Mi#1,2, nN. This neighborhood can be far larger than n(i) (e.g., if G is a M]N grid lexicographically ordered and equipped with the "rst-order neighborhood system, the previous graphical technique predicts a neighborhood of M!1 sites for i in marginal P(z ,2, z ), in case site i is away from the border). In this 1 i case independence relation (7) a priori holds only for a large set n8 (i) of predecessors, which makes the causal representation hardly useful. By successively considering marginals of vectors zpa , zpa , , z , one can however establish two cases (n) (n~1) 2 1 where causal representation turns out to be at least as local as the original non-causal one. Before we detail them, note that < g can be rearranged as < g , where c c i i g is the product of g 's for all c containing i and no i c further site: g O< g . Then g depends only on i c>.!9 c/i c i z and on variables attached to the subset i 6 c!MiN of n(i)Wpa(i). Also, by convention c>.!9 c/i g ,1 if no clique c of C veri"es max c"i. i 3.1. Functional characterization The marginal independence graph G previously coni sidered is a sort of `upper bounda. However, in certain cases depending on the expression of the factors under concern, simpli"cations might occur in marginal P(z ,2, z ), leading to a simpler independence graph. If 1 i (9) ∀i, + < g (z ),k (constant), c c i zi c>.!9 c/i it is easy to show, by successive marginalizations, that P(z ,2, z )J.!9 c/i hood of i, and: < g (z ) P(z Dzpa )"P(z Dzn8 )" c>.!9 c/i c c . i (i) i (i) k i The model then veri"es Eq. (7) with past neighborhoods n8 (i)Ln(i)Wpa(i) which are small for n(i)'s are. This way to introducing causality is at the heart of the various bidimensional causal representations [7,8,9}15,34]. As we already said, this causal probabilistic decomposition allows to recursively draw samples from P(z), starting from node 1, and all marginals can be exactly computed. However, when Eq. (9) holds for the prior model (z,x) (and therefore for the joint model z,(x, y) in case of pointwise measurements), it does not hold in general for the posterior model (z,xDy), although prior and posterior independence graphs are the same! This is particularly harmful since posterior model is at the heart of inference and sampling procedures (at least in inverse problems).
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
578
3.2. Graphical characterization Graphical considerations will allow to point out an important class of interaction models for which the same conclusion (i.e., causality relative to some in-past parts of the original non-causal neighborhoods) systematically holds, whatever the actual factors are. To make notations general and simpler to handle in the following, we now note g (z , zn Wpa ) even though i i (i) (i) some of the components of zn Wpa might be absent from (i) (i) the arguments of the function (i.e., if 6 c!MiN c>.!9 c/i dn(i)Wpa(i)). Return to successive marginalizations from z to z . As n 1 explained in Section 2, the summation of < g w.r.t. i i z makes all sites of n(n)Wpa(n)"n(n) mutually neighn bors through function G (zn )O+ ng (z , zn ). A particun (n) z n n (n) lar situation for which this structural change has no incidence is when n(n) is already a clique. As a consequence random vector zpa exhibits the (n) subgraph generated by G on pa(n) as an independence graph. Its joint distribution is proportional to
G
if i"0, 6 otherwise
(11)
4 In the course of the recursion, other G 's (if any) such that i max n(i)Wpa(i)"n6 will similarly `aggregatea to g 6 . n
where iOMj3S: max n8 (j)"iN. One can show that Eq. (10) along with the connectedness of G ensures that n8 (i)O0, ∀i'1 (therefore n6 Omax n8 (i) exists). In this case, it clearly turns out that one deals with a `leaves-to-roota recursion on a tree structure ¹G de"ned as follows: ∀i'1, its parent is n6 and its child set is i. The root is vertex 1. The nodes for which i"0 (the6 "rst to be 6 The relevant recurconsidered) are the `leavesa (Fig. 3e). sive structure for de"ning algorithms is not anymore the ordering of sites, but the underlying tree structure which can be de"ned if (10) holds. The root prior distribution and transition kernels are then obtained: g (z ) P(z )" 1 1 < G (z ), 1 j 1 G 1 j|16 g (z ,zn8 ) P(z Dzn8 )" i i (i) < G (zn8 ), i (i) j (j) G (zn8 ) i (i) j|i6 from which the exact joint distribution is deduced. From Section 2, we know that the completeness of n(i)Wpa(i)'s ensures that the maximization counterpart of these derivations exists as well. It yields a noniterative way to "nd energy minimizers which generalizes chain-based Viterbi minimization algorithm [16]. Also, the upward recursive de"nition (11) holds both for prior model (z,x) and for posterior model (z,xDy).5 In the latter case, it allows either to compute (and sample from) posterior distribution, or to compute (and maximize) local posterior marginals, within a single downward sweep. The functional characterization of causality is the most general, but necessitates the prior de"nition of a site ordering and of all transition probabilities (up to multiplicative constants). It therefore relies more on the form of potential than on structural information. In particular, causality of this type for the prior model is not inherited by the posterior model, due to energy modi"cation by data-based terms. Besides, it is strongly related to the ordering originally de"ned. A new causal representation w.r.t. another ordering of sites is not possible in general. By contrast, graphical viewpoint allows in some cases to identify at "rst glance (without need of any computational or probabilistic argument) interaction structures that always support causal models. Hence, noniterative sampling, energy minimization, marginal computation and normalization constant calculation are possible either for prior model or for posterior one if they are respectively de"ned on triangulated independence graphs. In particular, for any compatible ordering (there are several of them in general), one can get back to the classical causal representation (8) based on transition
5 In these cases, we shall change notation G into F and F , i i i respectively.
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
579
Fig. 3. (a) Example of graph G"[S, N] supporting causal energy-based models since in-past neighborhoods n(2)Wpa(2)"M1N, n(3)Wpa(3)"M1,2N, n(4)Wpa(4)"M2,3N, n(5)Wpa(5)"M3,4N, and n(6)Wpa(6)"n(6)"M2,4N are cliques of G; (b-c-d) successive subgraphs obtained by summing out (or maximizing out) (z , z ), z , and z successively; (e) the associated algorithmic 5 6 4 3 tree ¹G.
kernels, by means of simple noniterative computations. This probabilistic representation which can be derived afterward is used for noniterative sampling and marginal computation/maximization. In the following, we will focus on particular case of trees. They are triangulated for they do not have cycles by de"nition. For them, each n8 (i) reduces to a singleton with the parent of i, and G,¹G obviously.
4. Models on trees Consider an energy-base joint model de"ned on a tree as: expM!;(x, y)N"< f (x , xn)h (y , x ), i i i i i i with + ih (y , x ),m . This means that P(yDx)" i y i i i < P(y D x )"< hi(yi,i xi) and P(x)J< f . Recall n6 denotes i i m i i i i the unique parent of i (with convention 11 "0, and x 1 having to be ignored) and i is the set of i's children. 1 Also introduce ancestor site set 6 n66 composed of the sites of the chain between i and 1 (except i itself ) and descendant site set iOMj : i3$M N (see Fig. 4). As in 7 Eq. (11) (with z,x, n8 (i)"Mn6 N, and appropriate changes of notations) we can recursively de"ne functions F 's. The causal probabilistic speci"cation of the prior i model is then obtained: f (x , x 1 ) P(x Dxn6 )" i i 1 < F (x ), j i i F (xn6 ) i j|i1 and 1 P(x)" < f (x , xn6 ). i i F 1i This allows to draw easily samples from the prior distribution according to a root-to-leaves recursive procedure. If in addition, + i f (x , xn6 ),k , which is usually the case x i i i for labeling priors used in detection, segmentation, and classi"cation problems, we turn back to the setting of Section 3.1. This yields the simple causal description P(x D xn6 )"f (x , xn6 )/k . Moreover, if f 's also verify f ,k@ i i i i i 1 1
Fig. 4. Independence graph whose prior component is a (dyadic) tree: L vertices are for x 's while v vertices are for i pointwise measurements y 's. i
and ∀i'1, + n6 f (x , xn6 ),k@ (which is also often the case, x i i i with k "k@ ) then it comes that all prior marginals are i i uniform. In this case, coming derivations are greatly simpli"ed. 4.1. Leaves-to-root maximizations and global energy minimizer Using maximization instead of summation in the upward scheme provides a two-sweep Viterbi-like method that minimizes energy ;(x, y) w.r.t. x [35]. The maximization counterpart of (11) applied to posterior model provides functions F 's which, in this case, `collecta i dependencies to more and more data as the recursion proceeds: F not only depends on xn6 , but also on i (y , y )Oy`. i i7 i7 The MAP estimate has then to be recovered component by component according to a downward recursion where one has simply to read look-up tables built during the previous sweep (Fig. 5a): Two-sweep MAP computation on a tree m upward sweep Leaves
G
F (xn6 , y )"max i h (y , x ) f (x , xn6 ) i i x i i i i i xH(xn6 , y )"arg max i h (y , x ) f (x , xn6 ) i i x i i i i i
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
580
Recursion
obtained as with prior model:
G
F (xn6 ,y`)"max i h (y , x ) f (x , xn)< F (x ,y`) i i7 x i i i i i j|i6 j i j7 xH(xn6 ,y`)"arg max i h (y , x ) f (x , xn6 )< F (x ,y`) i7 x i i i i i j|i6 j i j7 i Root
G
F (y)"max 1h (y , x ) f (x )< F (x , y`) 1 x 1 1 1 1 1 j|16 j 1 j7 xH(y)"arg max 1h (y , x ) f (x )< F (x ,y`) x 1 1 1 1 1 j|16 j 1 j7 1 . downward sweep x( "xH(y) and ∀i'1, x( "xH(x( n ,y`). 1 1 i i i7 The procedure can equivalently be expressed in terms of interaction potentials v "!log f and l "!log h , i i i i products being replaced by sums. Function !ln F (xn6 , y`) is the minimum value for the piece of i i7 energy + M NX l #v for a "xed value of xn6 , and the k| i i7 k k resulting Viterbi-like algorithm provides a global minimizer of ; since
C C
x( "arg min min ;(x, y) i xi xk,kEi
"arg min l (y , x )#v (x , x( n)#min + l #v i i i i i k k xi xi7 k|i7
D
F (y)"+ 1 h (y , x ) f (x )< F (x , y`) 1 x 1 1 1 1 1 j|16 j 1 j7 . downward sweep
D
Initialization
4.2. Leaves-to-root summations given data In the same spirit as `forward}backwarda Baum algorithm on chains [17], a two-sweep procedure can be devised to compute exactly the sitewise posterior marginals (from which the MPM estimate can be deduced). Let us start by introducing xn6 within local posterior marginal P(x Dy): i P(x Dy)"+ P(x Dxn6 , y)P(xn6 Dy), i i xn6
posterior
F (xn6 ,y`)"+ i h (y , x ) f (x , xn6 )< F (x , y`) i i7 x i i i i i j|i7 j i j7 Root
"arg min l #v (x , x( n1 )#+ !log F . i i i j xi j|i1
∀i'1,
local
F (xn6 , y )"+ i h (y , x ) f (x , xn6 ) i x i i i i i i Recursion
D
#+ min v #l #min + l #v j j k k xj7 k|j j|i1 xj 7
of
m downward sweep Leaves
"arg min [l #v (x , x( n1 ) i i i xi
C
1 P(xDy)" < h (y , x ) f (x , xn6 ). (13) i i i i i F (y) 1 i A noniterative sampling from the posterior distribution can be performed thanks to this probabilistic representation. Also the joint likelihood of data P(y) is accessible: from P(yDx)"< hi(ymi,i xi), P(x)" 1 < f (x , xn6 ) and i F1 i i P(xDy) given above, it comes P(y)" F1(y) . < 1C imi The upward sweep computing F 'sF provides the necesi sary ingredients for the downward recursion (12). We end up with the following two-sweep procedure (Fig. 5b): Two-sweep computation marginals on a tree
D
C
P(x Dxn6 , y)"P(x Dxn6 , y`) i i i7 h (y , x ) f (x , xn6 ) " i i i i i < F (x , y`), j i j7 F (xn6 , y `) i7 i j|i7 and
(12)
where P(x Dxn6 , y)"P(x Dxn6 , y`) due to separation property. i i i7 This makes appear a downward recursion on site-wise posterior marginal, provided that the posterior marginal at root, P(x Dy), and the posterior transition probabilities 1 P(x Dxn6 , y`) are available. These quantities are provided i i7 by a previous upward sweep corresponding to successively summing out x 's from leaves to vertex 1, as in (11) for i z,xDy (i.e., ∀i, g ,h f ) and n8 (i)"Mn6 N, yielding funci i i tions F 's. The Markov chain-type representation is then i
h (y , x ) f (x ) P(x Dy)" 1 1 1 1 1 < F (x , y`) 1 j|16 j 1 j7 F (y) 1 Recursion h (y , x ) f (x , xn6 ) P(x Dy)"+ n6 P(xn6 Dy) i i i i i < F (x , y`) i x j|i7 j i j7 F (xn6 , y`) i i7 Leaves h (y , x ) f (x , xn6 ) P(x Dy)"+ n6 P(xn6 Dy) i i i i i i x F (xn6 ,y`) i i7 where the MPM estimates are obtained within the topdown part by maximizing the sitewise posterior marginals.6 6 Note that a slightly di!erent, but more complex, procedure can be devised based on the upward propagation of partial posterior marginals P(x Dy`) [21]. It is the exact counterpart of i i2 Gaussian inference on a tree based on upward Kalman "ltering [22]. Contrary to what we propose here, this method requires however an explicit knowledge of the prior marginals P(x ) and i of the child-to-parent transition probabilities P(xn6 Dx ). i
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
It is interesting noting that Bouman et al. [19] de"ne a very similar noniterative estimator on tree structure. Starting from an original Bayesian estimator they call `sequential MAPa, devised to improve MAP estimates, they obtain a downward recursive approximation of it which goes as follows: x( +arg max P(x Dy), 1 1 x1 x( +arg max P(x Dx( n6 , y`) ∀i'1. (14) i i i7 xi At root, their estimator provides the MPM estimate. As for the estimates at other sites, the in#uence of observations which are not on descendants is simply replaced by the dependency with respect to the parent variable, set at its optimal value already computed. This inference scheme can be plugged into our two-sweep summation procedure to produce an alternate estimator close to the MPM, that we could refer to as `semi-MPMa. Note that the corresponding (exact) top-down recursive estimation is formally very similar to the one of the MAP estimation (see Fig. 5a): in both cases, the estimate at a site i is obtained by maximizing a function of the estimated value
581
x( n6 on the parent vertex (contrary to MPM estimation), and of the data y` n . 7 Table 1 gathers in a structured and synthetic way the di!erent two-sweep procedures presented so far, but within the general setting of triangulated graphs (i.e., not necessarily with n8 (i) reducing to an unique parent node): it concerns a discrete energy-base model with triangulated independence graph and expM!;(x, y)N" < f (x , xn8 )h (y , x ), with + ih (y , x )"m . Note howx i i i i i i i (i) i i i ever that for sake of simplicity, downward recursions (indicated with a black symbol) that require summations over possible values of past neighborhood, i.e., w.r.t. xn8 , (i) are only written down for a tree, when n8 (i) reduces to Mn6 N. Apart from providing a practical summary of the di!erent noniterative computations on these models, this table allows to emphasize the profound similarity of the procedures.
5. Experimental results To demonstrate the practicability and the relevance of the causal models we have presented for low level image
Table 1 Generic upward sweeps (summations on the prior model, summations on the posterior model, and maximization on the posterior model) and downward sweeps (for computing various marginals, sampling from them, and inferring) for discrete energy-base model with triangulated independence graph and expM!;(x, y)N"< f (x , xn8 ) h (y , x ), with + h (y , x )"m i i i i (i) i i i xi i i i mUpward sweepm leaves recursion root
F (xn8 )"+ f i (i) i xi F (xn8 )"+ f < F (xn8 ) i (i) i j (j) xi j|i6 F "+ f < F (x ) 1 1 j 1 x1 j|11
F (xn8 ,y )"+ h f i (i) i i i xi F (xn8 ,y`)"+ h f < F (xn8 ,y`) i (i) i2 ii j (j) j7 xi j|i1 F (y)"+ h f < F (x ,y`) 1 1 1 j 1 j7 x1 j|11
F (xn8 ,y )"max h f i (i) i xi i i F (xn8 ,y`)"max h f < F (xn8 ,y`) i (i) j2 xi i i j|i j (j) i2 1 F (y)"max h f < F (x ,y`) 1 1 1 j 1 j2 x1 j|11
.Downward sweep.
root
recursion
Prior sampling h
Posterior sampling e
Prior marginals (n8 (i)"Mn6 N)j
Semi MPM inference L
1 h j P(x )" < F (x ) 1 j 1 F 1 j|11
f h P(x Dxn8 )" i < F i (i) j F i j|i1 f j P(x )"+ P(xn6 ) i < F i j F n6 i j|i1 x
Posterior marginals (n8 (i)"Mn6 N)r MPM inference (n8 (i)"Mn6 N)v
MAP inference
h f erP(x Dy)" 1 1 < F (x ,y`) 1 j 1 j2 F (y) 1 j|11 Lvx( "arg max h f < F (x ,y`) x( "arg max h f < F (x ,y`) 1 1 x1 1 1 j|1 j 1 j2 x1 1 1j|1 j 1 j2 1 1 hf hf eP(x Dxn8 ,y`)" i i < F rP(x Dy)"+ P(xn6 Dy) i i < F i (i) i2 j i j F F i j|i1 i j|i1 xn6 L x( "arg max h f < F i xi i i j|i j 1 with xn8 "x( n8 (i) (i)
vx( "arg max + i xi xn6 ]P(xn6 Dy)h f < F i i j j|i1
x( "arg max h f < F with i xi i i j|i j 1 xn8 "x( n8 (i) (i)
582
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
Fig. 5. Downward and backward steps for (a) MAP (resp. semi-mpm), and (b) MPM inference, on a (quad)tree.
analysis problems, we report experimental results of classi"cation with a model based on the standard quadtree as for its prior independence graph, with the leaves "tting the pixels of the image y to be classi"ed. The energy function is composed of a Potts-like prior term encouraging likeness of children with parents, along with a Gaussian data term: (y !k i)2 x ;(x, y)" + b[1!d(x , xn6 )]# + i i 2p2i 0 x i;1 i>i6 / # log(p i), (15) x where x is a tree labeling with x 3M1, 2, MN, b is a positi ive parameter, and M(k , p2)NM are the mean and varik k k/1 ances of the M classes. First experiments were carried out on 256]256 synthetic images involving "ve classes with known means and variances (Fig. 6). The variances were set to a higher level in the second image (standard deviations range from
15 to 40 in the "rst image, and from 15 to 70 in the second one). We compared the three noniterative inference procedures on the quadtree, and the iterative ICM algorithm running on the spatial counterpart of energy (15). The obtained classi"cations are shown in Fig. 7 while Table 2 indicates the corresponding rates of good classi"cation and CPU times in seconds. On both images, the three noniterative estimators have provided very close classi"cations which are better than those obtained by iterative estimation with the gridbased model (and noniterative estimations are less degraded that the iterative one for image d2), while taking two to three times less cpu time. Their noniterative nature results in a xxed computational complexity per site (e.g., they exhibit an O(n) complexity). We experimentally determined, using MATLAB implementations, that MAP, semi-MPM, and MPM inferences are achieved with respectively around 79, 94 and 107 #oating point operations (#ops) per site, when x 's can take two possible i
Fig. 6. 5-class synthetic data.
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
583
Fig. 7. From left to right: MAP, semi-MPM, and MPM estimates on the quadtree, ICM iterative estimate on the pixel grid, for the classi"cation (a) of synthetic image d1, (b) of synthetic image d2.
Table 2 Comparative percentages of misclassi"cation, and CPU times in seconds on synthetic images quadtree
Image d1 Image d2
2d grid
MAP
sMPM
MPM
ICM
4.79% (3.5 s) 8.01% (3.5 s)
4.73% (5.7 s) 7.96% (5.7 s)
4.73% (8.4 s) 7.97% (8.4 s)
5.30% (10.1 s) 9.65% (10.5 s)
values. With a similar implementation, standard ICM estimation on bidimensional grid [1] costs around 52 #ops/site, whereas the overall procedure is iterative with no guarantee on the required number of iterations. Among the three noniterative estimators, the MPM estimator is the more time consuming due to the larger amount of calculations required in the downward sweep. However, this extra cost (for similar estimates) might worth the pain since the obtained knowledge of posterior marginals P(x Dy) allows to assess for each site the degree i of conxdence that can be associated to the estimated value, e.g., through the entropy } + i P(x Dy) log P(x Dy) of x i i the marginal. Fig. 8 shows such `con"dence mapsa. These con"dence measures, reminiscent of error covariance matrices of Gaussian models on trees [24], can be useful for a better appreciation and use of obtained estimates. Visually, the classi"cations provided by the three noniterative estimators exhibit a `blockya aspect, reminding the underlying prior quadtree structure. The amount of such artifact depends on the relative location
of spatial patterns with respect to the block partition induced on the pixel grid by the quadtree. Also, these artifacts are more apparent in the processing of more noisy images, where the role of quadtree-based prior has to be enforced to get rid of noise. In the prospect of parameter estimation, this is not a serious problem, provided that the overall estimate is good (i.e., the percentage of misclassi"cation is low). However, if the visual rendering of the estimate is at the heart of the concerned application, a single ICM smoothing sweep su$ces to remove the `blockynessa at reasonable cost. There is an other source of concern lying in the huge number of successive summations/multiplications usually involved in functions computed through upward sweeps. If no attention is paid to that aspect, one will often end up with quantities either too small, or too large to be handled by computers. To prevent the algorithms from being trapped in these tricky situations, it might be necessary to devise a rescaling of the quantities of interest (namely F 's, F 's, or F 's). A simple way to proceed, i i i consists in normalizing these functions such that
584
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
Fig. 8. `Con"dence mapsa associated to noniterative MPM classi"cation of synthetic images d1 and d2 by the entropy of posterior marginals at leaves (the darker, the less entropy and the higher con"dence).
Fig. 9. (a) 512]512 Spot image (courtesy of Costel, University of Rennes 2 and GSTB); (b) direct MV classi"cation; (c) ICM iterative classi"cation on the pixel grid; (d) MAP noniterative classi"cation on the quadtree.
summing out xn8 yields 1. For instance (i) F "+ ih f < F /(+ n8 (i) ih f < F ). It is easy to see that i x i i j|i6 j x ,x i i j|i6 j these normalizations have no incidence whatsoever on the procedures we have described. Finally we consider the supervised classi"cation of a 512]512 Spot image (Fig. 9a) provided by the Costel laboratory (University of Rennes 2), into 8 classes with physical meanings (mainly the types of culture). Max-
imum likelihood classi"cation (often used in remote sensing applications) is poor (Fig. 9b), but provides a simple and sensible initial con"guration for the iterative grid-based classi"cation whose "nal result is obtained after 65 s (Fig. 9c). In less time, the three tree-based noniterative estimators have provided close results of good quality. See for instance in Fig. 9d the MAP classi"cation, obtained within 40 s.
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
6. Conclusion In this paper, we intended to provide a comprehensive and uni"ed picture of models on `causal graphsa. We presented in detail the manipulation of such discrete models, with emphasis on (a) the use of graph theoretic concepts as tools to devise models and get insight into algorithmic procedures; (b) the profound unity which underlies the di!erent procedures whether they compute probabilities, draw samples, or infer estimates. In particular, we presented three generic exact noniterative inference algorithms devoted to models exhibiting a triangulated independence graph. The "rst algorithm allows to compute the MAP estimate (and can be considered apart from any probabilistic framework as performing global energy minimization). The second one, whose aim is intrinsically probabilistic, allows to compute local posterior marginals which can be used to get the MPM estimate or to estimate parameters within an EM-like algorithm [20]. The third one mixes, to some extent, the characteristics of the two others. On simple quadtrees, these two-sweep procedures provide a hierarchical framework suitable for discrete image analysis problems such as detection, segmentation or classi"cation. Apart from providing a lower cost alternative to iterative inference schemes, these tree-based models are good candidates for handling multiresolution data, as advocated in [21,27].
Acknowledgements Authors gratefully acknowledge their debt to Eric Fabre for enlightening and stimulating discussions.
References [1] J. Besag, Spatial interaction and the statistical analysis of lattice systems, J. Royal Statist. Soc. B 36 (1974) 192}236. [2] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell. 6 (6) (1984) 721}741. [3] B. ter Haar Romeny (Ed.), Geometry-driven Di!usion in Computer Vision, Kluwer Academic Publishers, Dordrecht, 1995. [4] G. Celeux, D. Chauveau, J. Diebolt, On stochastic versions of the EM algorithm, Technical Report 2514, INRIA, March 1995. [5] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Royal Statist. Soc. B 39 (1977) 1}38, with discussion. [6] B. Chalmond, An iterative Gibbsian technique for reconstruction of M-ary images, Pattern Recognition 22 (6) (1989) 747}761.
585
[7] K. Abend, T.J. Harley, L.N. Kanal, Classi"cation of binary random patterns, IEEE Trans. Inform. Theory 11 (1965) 538}544. [8] H. Derin, P.A. Kelly, Discrete-index Markov-type random processes, Proc. IEEE 77 (10) (1989) 1485}1509. [9] J. Goutsias, Mutually compatible Gibbs random "elds, IEEE Trans. Inform. Theory 35 (6) (1989) 1233}1249. [10] J. Goutsias, Unilateral approximation of Gibbs random "eld images, Graph. Mod. Image Proc. 53 (1991) 240}257. [11] A. Habibi, Two-dimensional Bayesian estimate of images, Proc. IEEE 60 (1972) 878}883. [12] J. Moura, N. Balram, Recursive structure of noncausal Gauss}Markov random "elds, IEEE Trans. Inform. Theory 38 (2) (1992) 335}354. [13] D. Pickard, A curious binary lattice, J. Appl. Probab. 14 (1977) 717}731. [14] D. Pickard, Unilateral Markov "elds, Adv. Appl. Probab. 12 (1980) 655}671. [15] J. Woods, C. Radewan, Kalman "ltering in two dimensions, IEEE Trans. Inform. Theory 23 (1977) 473}481. [16] G.D. Forney, The Viterbi algorithm, Proc. IEEE 61 (3) (1973) 268}278. [17] L. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains, IEEE Ann. Math. Statist. 41 (1970) 164}171. [18] L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77 (2) (1989) 257}285. [19] C. Bouman, M. Shapiro, A multiscale image model for Bayesian image segmentation, IEEE Trans. Image Process. 3 (2) (1994) 162}177. [20] J.-M. LaferteH , F. Heitz, P. PeH rez, A multiresolution EM algorithm for unsupervised image classi"cation using a quadtree model, in: Proc. Int. Conf. on Pattern Recognition, Vienna, Austria, August 1996. [21] J.-M. LaferteH , F. Heitz, P. PeH rez, E. Fabre, Hierarchical statistical models for the fusion of multiresolution image data, in: Proc. Int. Conf. Computer Vision, Cambridge, June 1995. [22] K. Chou, A. Willsky, A. Benveniste, Multiscale recursive estimation, data fusion, and regularization, IEEE Trans. Autom. Control 39 (3) (1994) 464}477. [23] K. Chou, A. Willsky, R. Nikoukhah, Multiscale systems, Kalman "lters, and Riccati equations, IEEE Trans. Autom. Control 39 (3) (1994) 479}491. [24] M. Luettgen, W. Karl, A. Willsky, E$cient multiscale regularization with applications to the computation of optical #ow, IEEE Trans. Image Process. 3 (1) (1994) 41}64. [25] M. Luettgen, A. Willsky, Likelihood calculation for a class of multiscale stochastic models, with application to texture discrimination, IEEE Trans. Image Processing 4 (2) (1995) 194}207. [26] P. Fieguth, Application of multiscale estimation to large scale multidimensional imaging and remote sensing problems, Ph.D. thesis, MIT Dept. of EECS, June 1995. [27] M. Daniel, A.S. Willsky, A multiresolution methodology for signal-level fusion and data assimilation with applications to remote sensing, Proc. IEEE 85 (1) (1997) 164}180.
586
P. Pe& rez et al. / Pattern Recognition 33 (2000) 573}586
[28] R. Kindermann, J.L. Snell, Markov Random Fields and their Applications, vol. 1, Amer. Math. Soc., Providence, RI, 1980. [29] P. PeH rez, F. Heitz, Restriction of a Markov random "eld on a graph and multiresolution statistical image modeling, IEEE Trans. Inform. Theory 42 (1) (1996) 180}190. [30] S. Lauritzen, Graphical Models, Oxford Science Publications, Oxford, 1996. [31] J. Whittaker, Graphical Models in Applied Multivariate Statistics, Wiley, Chichester, 1990. [32] J. Woods, Two dimensional discrete Markovian "elds, IEEE Trans. Inform. Theory 18 (1972) 232}240.
[33] S. Lauritzen, A. Dawid, B. Larsen, H.-G. Leimer, Independence properties of directed Markov "elds, Networks 20 (1990) 491}505. [34] P.A. Devijver, Real-time modeling of image sequences based on hidden Markov mesh random "eld models, Technical Report M-307, Philips Research Lab., June 1989. [35] J.-M. LaferteH , F. Heitz, P.PeH rez, E. Fabre, Hierarchical statistical models for the fusion of multiresolution data, in: Proc. SPIE Conf. on Neural, Morphological, and Stochastic Methods in Image and Signal Processing, San Diego, USA, July 1995.
About the Author*PATRICK PED REZ was born in 1968. He graduated from ED cole Centrale Paris, France, in 1990. He received the Ph.D. degree in Signal Processing and Telecom. from the University of Rennes, France, in 1993. He now holds a full-time research position at the Inria center in Rennes. His research interests include statistical and/or hierarchical models for large inverse problems in image analysis. About the Author*ANNABELLE CHARDIN was born in 1973. She graduated from ED cole Nationale SupeH rieure de Physique de Marseille. She is completing her Ph.D. degree in Signal Processing and Telecom. from the University of Rennes, France. About the Author*JEAN-MARC LAFERTED was born in 1968. He received the Ph.D. degree in Computer Science from the University of Rennes, France, in 1996. He now holds an assistant professor position at the computer science department of the University of Rennes.
Pattern Recognition 33 (2000) 587}602
Unsupervised image segmentation using Markov random "eld models S.A. Barker*, P.J.W. Rayner Signal Processing and Communications Group, Cambridge University Engineering Department, Cambridge CB2 1PZ, UK Received 15 March 1999
Abstract We present two unsupervised segmentation algorithms based on hierarchical Markov random "eld models for segmenting both noisy images and textured images. Each algorithm "nds the the most likely number of classes, their associated model parameters and generates a corresponding segmentation of the image into these classes. This is achieved according to the maximum a posteriori criterion. To facilitate this, an MCMC algorithm is formulated to allow the direct sampling of all the above parameters from the posterior distribution of the image. To allow the number of classes to be sampled, a reversible jump is incorporated into the Markov Chain. Experimental results are presented showing rapid convergence of the algorithm to accurate solutions. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Markov random "eld; Unsupervised segmentation; Reversible jump; Markov chain; Monte Carlo; Simulated annealing
1. Introduction The segmentation of noisy or textured images into a number of di!erent regions comprises a di$cult optimisation problem. This is compounded when the number of regions into which the image is to be segmented is also unknown. If each region within an image is described by a di!erent model distribution then the observed image may be viewed as a realisation from a map of these model distributions. This underlying map divides the image into regions which are labelled with di!erent classes. Image segmentation can therefore be treated as an incomplete data problem [1], in which the intensity data is observed, the class map is missing and the model parameters associated with each class need to be estimated. In the unsupervised case, the number of model classes is also unknown. The unsupervised segmentation problem has been approached by several authors. Of these, most propose
* Corresponding author. E-mail address: [email protected] (S.A. Barker)
algorithms comprising of two steps [2}4]. The image is assumed to be composed of an unknown number of regions, each modelled as individual Markov random "elds. The "rst of these steps is a coarse segmentation of the image into the most &likely' number of regions. This is achieved by dividing the image into windows, calculating features or estimating model parameters, then using a measure to combine closely related windows. The resulting segmentation is then used to estimate model parameters for each of the classes, before a supervised high-resolution segmentation is carried out via some form of relaxation algorithm. A similar methodology is used in Geman et al. [5] but the measure used, the Kolmogorov}Smirnov distance, is a direct measure of similarity of the distributions of grayscale values (in the form of histograms) between adjacent windows. Windows are then combined into a single region if the distance between their distributions is relatively small. A variant on this algorithm [6] is based on the same distance function but the distribution of grayscales in each window is compared with the distribution functions of the samples comprising each class over the complete image. If the distribution of one class is
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 4 - 6
588
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
found to be close enough to that of the window then it is designated as being a member of that class. Otherwise, a new outlier class is created. When the "eld stabilises, usually after several iterations, new classes are created from the outliers if they constitute more than one percent of the image. If not, the algorithm is re-run. A split and merge algorithm is proposed by Panjwani and Healey [7]. The image is initially split into large square windows but these are then re-split to form four smaller windows if a uniformity test for each window is not met. The process ends when windows as small as 4]4 pixels are reached. The windows are then merged to form regions using a distance measure based on the pseudo-likelihood. Won and Derin [8] obtain segmentations and parameter estimates by alternately sampling the label "eld and calculating maximum pseudo likelihood estimates of the parameter values. The process is repeated over di!ering numbers of label classes and the resulting estimates are applied to a model "tting criteria to select the optimum number of classes and hence, the image segmentation. The criterion used compensates the likelihood of the optimised model with a penalty term that o!sets image size against the number of independent model parameters used. The penalty term and its associated parameter values were selected arbitrarily. This method of exhaustive search over a varying number of classes was developed further by Langan et al. [9]. Here, an EM algorithm is "rst used to estimate parameters while alternately segmenting the image. To select between the resulting optimsations of the di!ering models the function of increasing likelihood against increasing model order is "tted to a rising exponential model. The exponential model parameters are selected in a least squares sense and the optimum model order is then found at a pre-speci"ed knee point in the exponential curve. The approach to unsupervised segmentation presented here comprises a Markov chain Monte Carlo (MCMC) algorithm to sample from the posterior distribution so that simulated annealing may be used to estimate the MAP solution. This methodology is similar to that used in [10] to segment an image using a known MRF model. Here the method is extended so that the sampling scheme and hence the MAP estimate is not just over the segmentation of the image into classes but the number of classes and their respective model parameters. The algorithm di!ers from those reviewed in that no windowing is required to estimate region model parameters. The algorithm's MCMC methodology removes the necessity for an exhaustive search over a subsection of the parameter space. This ensures an improvement in e$ciency over algorithms that require separate optimisations to be carried out for each model before a model order selection is made.
The remainder of this paper is divided as follows: Section 2 speci"es the image models used throughout the paper. The posterior distributions for the noisy image and texture models are derived in Sections 2.1 and 2.2. Section 3 describes the algorithms employed to sample from these distributions. The segmentation process or allocation of class labels to pixel sites is given, as is the sampling scheme for noise and MRF model parameters from their conditional densities. The method by which reversible jumps are incorporated into Markov chains to enable sampling of the number of classes into which an image might be segmented is then described. This process is then detailed for both noisy and textured image modes. Experimental results for the resulting algorithms are presented in Section 4 and the paper is concluded in Section 5.
2. Image models Let ) denote an M]N lattice indexed by (i, j) so that )"M(i, j);1)i)M,1)j)NN. Let Y"MYs"ys; s3)N be the observed grayscale image where pixels take values from the interval (0,1]. Then let X"MXs"xs; s3)N correspond to the labels of the underlying Markov random "eld which have values taken from "" M0,1,2,k!1N. If g de"nes a neighbourhood at site s, then let the s vector of labels comprising that neighbourhood be x s. g Similarly, let o de"ne a second neighbourhood strucs ture at site s but let this be de"ned on the observed image Y, so that y s is the vector of pixel grayscale o values over that neighbourhood. Finally, let all model parameters be included in the parameter vector W. If a Gibbs distribution is used to model the likelihood of observing the image Y given the label "eld X, and to model all a priori knowledge of spacial correlations within the image and label "elds then the conditional density for an observed pixel grayscale value and class label at site s, given the two neighbourhoods is, p(> "y , X "x D y s, x s, W) s s s s o g 1 " , e~U(Ys/ys,Xs/xs@yos,xgs,(), Z(y s, x s, W) o g
(1)
where ;( ) ) is the energy function and Z( ) ) is the normalising function of the conditional distribution. If we divide the parameter vector so that, W"[/ ; Mc3"N, c], c where / corresponds to a vector of model parameters c de"ning the likelihood of observing the pixel value y , s given its neighbourhood and label and c corresponds to c a vector of hyper-parameters de"ning the prior on the label "eld X, then the conditional distribution may be
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
589
factorised so that,
2.1. The Isotropic Markov random xeld model
p(> "y , X "x D y s, x s, W) s s s s o g
When considering the complete Gibbs distribution for the entire image the partition function (or normalising function) becomes far too complex to evaluate making it unfeasible to compare the relative probabilities of two di!erent MRF realisations. An approximation to the Gibbs distribution that allows an absolute probability for an MRF realisation to be calculated is the PseudoLikelihood, introduced by Besag [11]. The Pseudo-Likelihood is simply the product of the full conditionals, each given by Eq. (2), over the complete image ).
The Isotropic MRF model is used to model an image consisting of regions of constant but di!erent grayscales, corrupted with an i.i.d. noise process. For each pixel, the likelihood of its grayscale value given its underlying label, is given by an Gaussian distribution whose parameters are dependent on the label class. Hence, the grayscale values of the pixels comprising a region labelled as a single class c may be considered a realisation of an i.i.d. Gaussian noise process whose parameter vector is given by / "[k ,p ]. c c c The Potts model chosen to model a priori knowledge of spacial correlations within the label "eld incorporates potential functions de"ned using both singular and pairwise cliques on a nearest neighbour type neighbourhood. If the hyper-parameter vector is c"[a ;Mc3"N, b] then c the approximation to the posterior density given in Eq. (4) may be written,
PL(Y"y, X"x D W)
p(X"x, W, k D Y"y)
1 e~U1(Ys/ys @ yos, Xs/xs, /xs)~U2(Xs/xs @ xgs,c). " Z(y s, / s, x s, c) o x g (2)
1 "< expM!; (> "y D y s, Xs"xs, /xs)N 1 s s o Z(/ s) x s|) ]
expM!+ ; (X "x D x s, cN s|) 2 s s g , < + expM!; (X "c D x s, cN s|) c|" 2 s g
G
(3)
where Z(/ ) is the normalising constant of the likelihood c distribution for the observed pixel value given its neighbourhood and label. By applying Bayes law, an approximation to the posterior distribution for the MRF image model may be formed from the Pseudo-Likelihood. To make this a function of the model order (or number of label classes) k, proper priors must be de"ned for all the model parameters. The distribution will then be given by, p(X"x, W, k D Y"y)JPL(Y"y, X"x D W) k~1 ]p (k)p (c) < p (/ ), r r r c c/0
H
1 1 exp ! (y !k s)2 J< s x 2p2 J2np2 s x s|) xs
(4)
where p ( ) ) are the prior distributions. It is possible to r incorporate various information criteria [8,9] into the posterior distribution by adding compensatory terms to the prior on model order k The Isotropic and Gaussian MRF models used as the basis for segmentation algorithms throughout the remainder of this paper both take Potts models as their prior distribution on the label "eld X. The di!erences between the two models occur in their modelling of the likelihood of observing pixel grayscale values given the label "eld. The principle di!erence comprises the lack of conditioning on neighbouring pixel grayscale values in the Isotropic model. The two models are described in more detail in the following two subsections.
expM!+ (a s#b<(x , x s))N s|) x s g < + expM!(a #b<(c, x s))N s|) c|" c g
]
]p (b)p (k)< p (k )p (p )p (a ), r r r c r c r c c|"
(5)
where <(c, x s) is the potential function at site s when it is g allocated to class c. Throughout this paper the potential function is de"ned, <(c,x )"1+ (c=x ), where = is an g 4 t|g t operator de"ned to take the value !1 if its arguments are equal, otherwise, #1. Non-informative reference priors are chosen for the noise model parameters to ensure that the observed intensity data dominates any prior information incorporated into the model. Uniform priors are selected for Mk , a ; c3"N, b and k. For Mp ,∀c3"N the reference priors c c c may be found using Je!rey's formula for non-informative priors [12]. Normally these priors are improper, but here their range is restricted to allow normalisation and ensure that the criteria for model selection, described later is valid. 2.2. The Gaussian Markov random xeld model The Gaussian MRF (GMRF) is commonly used to model images comprising textures characterised by their spacial correlation. The image model used in this paper is hierarchical with individual textures modelled by GMRFs and the interaction between the regions comprising these textures, by a Pott's model (as was used in the Isotropic case).
590
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
The cth GMRF may be parameterised by its covariance matrix R and mean vector k . If the Y is the pixel c c c grayscale vector and its dimension is N , then this leads c to a joint distribution, p(Yc"yc D k , R ) c c expM!1[y !k ]T[R~1][y !l ]N 2 c c c c c . " (6) (2n)N2c @2DR D12 c If we assign a neighbourhood for the pixel located at site s and denote it o then the above process may be s written in terms of a non-causal AR representation with correlation coe$cients h(q), c y " + h(q)y #e , (7) s c s`q s q>s`q|os where e is a zero mean Gaussian noise process with s autocorrelation given by
G
p2, c E[e e ]" !h(q)p2, s s`q c c 0,
q"0, s#q3o , s otherwise.
(8)
It is possible, with no loss of generality to halve the number of correlation parameters by making h(q)"h(~q). c c Letting H comprise the vector of correlation parameters c for the GMRF labelled as c, then the conditional distribution for a single pixel may be written, p(y D k ,p ,H ) s c c c 1 1 " exp ! (y !k ) s c 2p2 J2np2 c c
G
A
BH
2 ! + h(q)[(y !k )#(y !k )] . (9) c s`q c s~q c q>s`q|os The interaction between regions is described by a Pott's model. As with the Isotropic MRF this models a priori knowledge of spacial correlations within the label "eld. The Pott's model chosen here is identical to that used in the Isotropic case (given in the previous section) except that there are no single clique parameters. These are omitted to reduce the complexity of the model order sampling step (see Section 3.2) and to weaken the prior on X. Hence, the hyper-parameter vector simply consists of one term, b. The posterior distribution for the complete hierarchical image model may now be approximated using the expression given in Eq. (4) and the conditional distributions given by Eq. (9). p(X"x, W, k D Y"y)J< p(y D k s, p s, H s) s x x x s|) expM!+ b<(x ,x s)N s|) s g ] < + expM!b<(c, x s)N s|) c|" g ]p (b)p (k)< p (k )p (p )p (# ), r r r c r c r c c|"
where <(c, x s) is the potential function as de"ned in the g previous section. Priors for all parameters are as de"ned earlier excepting the h(q) parameters. These are assigned conjugate priors for reasons given later, in Section 3.2. The priors will therefore consist of a family of normal distributions, N(0, j2) with j being a common hyperparameter. 3. MCMC sampling from the posterior distribution The image segmentation, parameter estimates and model order are all estimated to maximise the a posteriori probability (known as the MAP criterion) of the model given the observed image. This comprises an optimisation problem over the MRF's label map, parameter space and model order. The approach adopted here was developed by Geman and Geman [10]: to construct a Markov chain to sample from the target distribution and then perform a process of stochastic relaxation (i.e. simulated annealing). Various shortcuts, for example the iterative conditional modes algorithm [13]; using the EM algorithm to estimate model parameters [14], or MAP based parameter estimation algorithms [8] have not been adopted in this paper. This is because when sampling model order (using the reversible jump algorithms described in Sections 3.1 and 3.2) jumps to lower probability areas of the model space occur frequently and hence, to adopt any deterministic elements to the algorithm would appear inconsistent. The sampling scheme in this paper is based on the Gibbs sampler [10] but Metropolis}Hastings sub-chains [15] are incorporated to enable the model parameters and number of classes to be sampled. The sampling process follows a predetermined sequential scan, updating the pixel sites and various model parameters in a speci"c order. The scan consists of the following sampling processes: (i) (ii) (iii) (iv)
re-segment the image, sample noise model parameters, sample MRF model parameters, sample the number of classes.
The "rst three steps are relatively straightforward. All labels and parameters are sampled from their conditional distributions. The "rst step consists of Gibbs sampling the label "eld. The conditional distributions may be found by applying Bayes law to Eq. (5). The resulting distribution for the Isotropic MRF described in Section 2 is given by p(x "c D y , x s, k , p , a , b) s s g c c c 1 1 1 y !k 2 s c J exp ! ¹ 2 p J2np2¹ t c c t
(10)
G CA DH
# (a #b<(c, x s)) c g
,
B
(11)
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
where ¹ is the annealing temperature at iteration t of the t algorithm. The distribution for the GMRF is similar, p(x "c D y , y s, x s, k , p , H , b) s s o g c c c 1 1 J exp ! ! ((y !k ) c ¹ 2p2 s J2np2 t c c
A
DH
.
(12)
p(k , p D Y, X)J < p(y D k , p )p (k )p (p ) c c s c c r c r c s>xs/c 1 " p (2np2¹ )Nc c c t
A
BH
1 y !k 2 s c ]exp ! + , p 2¹ t s>xs/c c
(13)
where N "d(s : x "c). For the cth GMRF texture c s model, the model parameter conditional distribution is given by p(k , p ,H D >,X) c c c J < p(y D k ,p ,H )p (k )p (p )p (H ) s c c c r c r c r c s>xs/c
G
B
A
1 1 " exp ! + (y !k ) s c p (2np2¹ )nc 2¹ p2 c c t t c s>xs/c ! + h(q)[(y !k )#(y !k )] c s`q c s~q c q>s`q|os
BH 2
.
(15)
where N xg "d(s : x "c, x s"x ). g (c, ) s g It is theoretically possible to sample the b parameter from its conditional distribution
Metropolis}Hastings sampling is used to update noise and MRF model parameters. The proposal densities used are zero mean Gaussian with variances dependent on the parameter being sampled. The conditional distributions for the noise model parameters are found by multiplying the appropriate pseudo-likelihood terms by the model parameter priors. As speci"ed in Sections 2.1 and 2.2, for both models, non-informative priors are used for the k and p parameters. However, conjugate priors are used for the h(q) parameters of the GMRF. The resulting conditional distribution for the Isotropic case is,
G
p(a , c3" D X) c
exp(!1t[a #b<(c, x )]) N(c,xg) T c g " < , + exp(!1t[a #b<(i, x )]) g i|" T i (c|",+xg)
!+ h(q)[(y !k )#(y !k )])2 c s`q c s~q c q>s`q|os !b<(x ,x s) s g
uniform priors and are given by
"p(X D a , c3")p (a , c3") c r c
G C
1
591
(14)
To sample the hyper-parameters for the prior on the label "eld, the Metropolis}Hastings algorithm is used to sample an approximation to the full conditional distribution. The approximation used is again the pseudo-likelihood. The conditional distributions for the external "eld parameters, Ma , c3"N used in the Isotropic model take c
p(b D X)"p(X D b)p (b) r
A
B
exp(!1tb<(c, x )) N(c,xg) T g " < . + exp(!1t[a #b<(i, x ))] i|" T i g (c|",+xg) (16) Unfortunately, under particular underlying label map con"gurations the posterior density for b will not be proper. lim p(X"x@, W D Y"y)"d, b?~=
(17)
where d is a positive constant. This is a direct consequence of approximating the likelihood by the pseudolikelihood and results in estimates for b following a random walk toward !R. It may be possible to limit this problem by choosing a suitable prior, for example a Normal distribution centrered on a rough estimate of b. This will be discussed in more detail in Section 4. To sample the model order, reversible jumps are incorporated into the Markov chain. Reversible jumps were developed by Green [16] to allow a Metropolis} Hastings based algorithm to sample the model order, i.e. the number of classes from the posterior distribution in a typical incomplete data problem [1]. The algorithm is designed to preserve detailed balance, thus ensuring the ergodicity of the Markov chain. For a Metropolis}Hastings sampling algorithm to maintain detailed balance, the old and new model spaces must be identical. But here a change of model order occurs making the dimension of the model's parameter vector either increase or decrease, depending on whether the number of classes has risen or fallen. Such a di!erence in dimensionality means the model spaces are di!erent. This is overcome by padding the parameter vectors with random variables, thus equalising their dimensions. The parameter spaces are still di!erent and so, to allow for this, the mapping functions between old and new parameters must be such that the Radon-Nykodym derivative [17] of one model's measure with respect to the other is non-zero.
592
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
For example, if two models are considered, labelled m1 and m2, and W(1) and W(2) are their associated parameter vectors of dimension n and n respectively, then to jump 1 2 between their parameter spaces requires the generation of random vectors u(1) and u(2) such that, d(W(1))#d(u(1))"d(W(2))#d(u(2)), where d( ) ) indicates the dimension of the inclosed vector. A mapping must also be de"ned so that, [W(2), u(2)]"f ([W(1), u(1)]) and an inverse must exist. Once the new parameters have been generated, then any label variables may be reallocated, segmenting the data into the new number of classes. Again, this allocation must be time reversible. The steps just described complete the proposal generation step of the reversible-jump algorithm. The acceptance ratio is now calculated in the traditional Metropolis}Hastings manner, to preserve detailed balance. However, the Radon-Nikodym derivative (taking the form of a Jacobian determinant) is included which compensates for the di!erence in probability measures between the proposed and original models. If we proceed with the above example, then if Y is the observed data, x(1) and x(2) the two allocations of the hidden data labels, then the acceptance ratio for the reversible jump transition from model m1 to m2 is given by
G K
n (X(2), W(2) D Y)q(X(1) D X(2), W(1))q(u(1)) min 1, m2 n (X(1), W(1) D Y) q(X(2) D X(1),W(2)) q(u(2)) m1 ]
KH
L(W(2), u(2)) , L(W(1), u(1))
If the number of classes was to be incremented then a class was randomly chosen to be split in two. New parameters were generated for each of the two new classes. The observed data samples previously allocated to that class were then reallocated to one of the two new classes by Gibbs sampling from their conditional distributions. If the number of classes was to decrease then two classes were randomly chosen to be merged. A new parameter vector for the merged class was then generated as a function of the two old parameter vectors. All data samples allocated to either of the two classes being merged were then automatically assigned to the new merged class with probability one. To summarise, a reversible jump algorithm consists of four steps, (i) decide whether to increase or decrease the number of classes, (ii) generate the new model parameter vectors, (iii) reallocate the hidden data labels, (iv) accept or reject this proposal based on an acceptance ratio generated to ensure detailed balance is preserved. The following two subsections describe how reversible jump algorithms may be used to allow the number of di!erent label states (model order) to be sampled for the Isotropic and Gaussian MRF models described in Section 2.
(18)
where n (X(2), W(2) D Y)/n (X(1), W(1) D Y) is the ratio bem2 m1 tween posterior probabilities of the two models under consideration, q(X(1) D X(2), W(1))/q(X(2) D X(1), W(2)) is the ratio of reverse and forward label allocation probabilities and q(u(1))/q(u(2)) the ratio of the probability of generating the random vectors corresponding to the reverse and forward mapping functions. The "nal term DL(W(2), u(2))/L(W(1), u(1))D is the Radon-Nikodym derivative between the two model probability measures. This corresponds to the Jacobian determinant generated from the mapping function, [W(1), u(1)]P[W(2), u(2)]. This methodology was applied by Richardson and Green [18] to the problem of "tting a Gaussian mixture model with an unknown number of components to observed data. Here, the posterior distribution was sampled using Gibbs and Metropolis}Hastings algorithms which updated the label allocation and model parameters in a given sequence. At the end of this scan through the model space, a reversible jump was incorporated which updated the number of mixture components. At each sweep the algorithm would randomly choose between proposing to increment or decrement the number of classes.
3.1. Reversible jumps for the Isotropic MRF The approach to sampling the model of an Isotropic Markov random "eld adopted here is similar to that adopted by Richardson and Green [18] when sampling the number of components of a mixture distribution. As discussed in the previous section, the problem revolves around proposing to split a region labelled by a single class, c, into a region composed of two classes, c ,c . 1 2 Each class is de"ned by a parameter vector, W "Mk , p , a N. When splitting one class into two, three c c c c new parameters need to be created. To ensure the model spaces remain of equal dimension, the parameter space of the original model is supplemented by three random variables, u , u , u . Thus when splitting state c into c 1 2 3 1 and c a transform between [k , p , a ] and [k , p , a , 2 c c c c1 c1 c1 k , p , a ] may be de"ned with three degrees of freec2 c2 c2 dom. Hence, the three parameters for each of the two new classes may be derived from the old values, k , p , a , and c c c the random variables u , u , u . The new parameters are 1 2 3 calculated to preserve the 0th, 1st and 2nd order moments across the transformation. The resulting mapping
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
functions are given in the following set of equations: a "a !¹ ln(u ), c1 c t 1 a "a !¹ ln(1!u ), c2 c t 1 k "k !u p c1 c 2 c k "k #u p c2 c 2 c
S S
1!u1 , u1 (19)
u1 , 1!u1
1 p2 "(1!u )(1!u2)p2 . c2 3 2 c 1!u 1 The choice of random variables for, u , u , u , must 1 2 3 be such that, Mu , u , u 3(0,1]N. For this reason and to 1 2 3 allow a bias towards splitting the data into roughly equal partitions, beta distributions are used to sample u ,u ,u . 1 2 3 The Jacobian determinant of these mapping functions needed in the calculation of the acceptance ratio is
K
L(W ,W ) ¹ p2 c1 c2 " t c . L(W , u , u , u ) u2 (1!u2 )Ju (1!u ) c 1 2 3 1 1 3 3
The Kolmogorov}Smirnov distance is a measure of the closeness of two distribution functions. It may be applied to two samples of data to ascertain whether they have been drawn independently from the same distribution. If FK (i) and FK (i) are two independent sample 1 2 distribution functions (i.e. histograms) de"ned, 1 FK (i)" d(i : y )i), i n
(21)
where n is the number of data samples, y so that i 1)i)n, then the Kolmogorov}Smirnov distance is the maximum di!erence between distributions over all i,
1 p2 "u (1!u2)p2 , c1 3 2 cu 1
K
593
(20)
The pixel sites labelled by the class or classes selected to be split or merged must be reallocated on the basis of the new parameters generated. If a merge is being proposed, then all sites allocated to the two old classes are relabelled by the new merged class label, with probability one. The di$culty occurs when splitting one class into two. If a reasonable probability of acceptance is to be maintained, the proposed reallocation of labels to sites needs to be completed in such a way as to ensure the posterior probability of that particular segmentation is relatively high. To achieve this it would be desirable to propose a reallocation of labels by Gibbs sampling from the conditional distributions given by Eq. (11). Unfortunately, this is not possible since the allocation of classes in the neighbourhood g on which the probabilities will be s conditioned is not available. To overcome this problem, a deterministic &guess' is made at the future allocation of each pixel labelled by the class to be split. The &guess' is made at each pixel site using a distance measure between the model distribution functions of the new classes and the histogram of observed grayscale values of the surrounding region of pixels. The measure used here has a precedent in this type of algorithm: the Kolmogorov}Smirnov distance was used by Geman et al. [5] to allocate pixel sites between classes based on grayscale values or particular transformations of grayscale, indicative of texture type.
d(y(1), y(2))"max DFK (k)!FK (i)D. 1 2 i
(22)
The Kolmogorov}Smirnov distance is useful for two reasons; its value is independent of the underlying distribution function and it is una!ected by outlying data values. Hence, it may be expected to closely model the label "eld of an Isotropic MRF, as was found by Geman et al. [5]. The deterministic estimation of the new label allocation may now be used as the basis for the Gibbs sampling of pixels from their full conditional distributions. The acceptance ratio for splitting region c into c1 and c2, thus increasing the number of classes from k to k`, may now be given by p(X"x`,W`, k` D Y"y) 1 p(X"x,W, k D Y"y) q (u )q (u )q (u ) b 1 b 2 b 3
K
K
1 L(W ,W ) c1 c2 , q(re-segmentation) L(W , u , u , u ) c 1 2 3
]
(23)
where, p(X"x, W, kDY"y) is the approximation to the posterior density de"ned by Eq. (5), q (u) is the probabilb ity of proposing the random variable u drawn from a beta distribution, q (re-segmentation) is the probability of the reallocation of the pixels labelled by c into regions labelled by c1 and c2, and the "nal term is the Jacobian determinant given by Eq. (20). The acceptance ratio for the jump combining two states into one is simply the inverse of that given in Eq. (23). In this case q (resegmentation) is found in retrospect and u , u , u are 1 2 3 calculated by back-substitution into Eq. (19). 3.2. Reversible jumps for the Gaussian MRF Proposing a move when using a reversible jump to sample the model order of an hierarchical GMRF is more complex than generating a similar proposal when sampling the Isotropic MRF for the simple reason, the model parameter vector [k, p, H] is longer. When splitting a GMRF class, two new GMRF parameter vectors
594
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
need to be generated. The number of degrees of freedom for the generation of the new vectors is therefore dependent on the size of the old parameter vector and in particular, upon the size of the GMRF neighbourhood. If the dimension of the correlation coe$cient vector associated with this neighbourhood is denoted N , then the h number of random variables needed to equalise the two model parameter vectors will be N #2 which is equivah lent to the number of degrees of freedom in generating the new model parameters. If a large degree of freedom is allowed in generating a large number of parameters then the likelihood of proposing a reversible jump that has a reasonable chance of being accepted will be small and so convergence of the Markov chain will be slow. To alleviate this problem the model order is sampled from a marginal density and the method of composition sampling [1] is then employed to obtain the remaining parameters. Composition sampling requires the availability of factorisations of the posterior density. The marginal we propose using eliminates all but one correlation parameter from the posterior density, p(H, k, p, X D Y)"p(H(~i) D H(i), k, p, X, Y)p(H(i), k, p, X D Y), (24) where, H(~i)"[H(~i); c3"] and H(~i)"[h(1),2,h(i~1), c c c c h(i`1),2,h(NAR)]. Similarly H(i)"[H(i); c3"], k" c c c [k ; c3"] and p"[p ; c3"]. c c The above factorisation of the posterior density can only be achieved by incorporating the pseudo-likelihood approximation into the marginal. Applying Bayes law, the marginal may be expressed, p(h(i), k, p, X D Y)
A
B
JPL(Y D j, k, p, h(i), X) < p(xs D x s) pr(h(i), k, p). g s |)
(25)
All the terms in the above equation are easily de"ned except the marginal of the pseudo-likelihood, P¸(YDj, k, p, h(i), X). This is found by "rst constructing the full pseudo-likelihood and multiplying it by priors on all the parameters to be eliminated on forming the marginal. These parameters are the integrated out of the resulting expression to form that marginal. To facilitate the analytical integration over the correlation parameter space, conjugate priors are used which take the form of Normal distributions, N(0, j2). The pseudo-likelihood is therefore marginalised by evaluating the following product of integrals: PL(Y D j, k, p, h(i), X)
P A
B
"< < p(y D k ,p ,H ,y s) p(H(~i)) dH(~i) s c c c g c c (~i) c|" #c s>xs/c
~Nc`Nh~1 ~Nh`1 (2np2t) 2 c "< (2nj2t) 2 1 c|" DD D2 c
G
H
1 ]exp ! (h(h(i) D k , j,Y)) , c c 2p2t c
(26)
where h(h(i) D k ,Y)"!vT[D ]~1v c c c #+ (y@ !h(i)[y@ i#y@ i])2, s c s`q s~q s
C
D
) , v" + (y@ !h(i)[y@ i#y@ i]y s c s`q s~q (o~i)s s
C
D
, D " jI#+ yT sy (o~i) (o~i)s c s
(27)
(28)
(29)
yT s"[(y@ 1#y@ 1),2,(y@ i~1#y@ i~1), (o~i) s`q s~q s`q s~q ](y@ i`1#y@ i`1),2,(y@ Nh#y@ Nh)], s`q s~q s`q s~q y@ "y !k . s s c
(30) (31)
We now have an expression for the marginal likelihood which, when multiplied by the relevant priors gives the basis for the reversible jump sampling of the model order. To complete the algorithm, methods to propose parameters for the new classes and to propose new segmentations for the marginalised GMRF are required. To allow a one-to-one mapping between old and new parameter vectors random variables are introduced. Transforms need to be de"ned for both the splitting of one state into two, and the inverse, the combining of two into one. The parameter vector for a single state, including the additional random variables is [k , p , h(i), u m, u e, x x x m m u m, u e, u m, u e]. This may be split into two giving, s s t t [k 1, k 2, p 1, p 2, h(i)1, h(i)2, a , a , a ]. x x x x x x m s t In the above vectors u m, u e, u m, u e, u m, u e, a , a , a are m m s s t t m s t zero mean Gaussian random variables with di!ering variances. The split move is de"ned by the transformations: k 1"k !u m, k 2"k #u m#u e, x x m x x m m p 1"p !u m, p22"p #u m#u e, x x s x x s s h(i)1"h(i)!u m, h(i)1"h(i)#u m#u e. x x t x x t t
(32)
The combine move is designed to preserve the 1st and 2nd order central moments in a similar way to that used in the Isotropic case and by Richardson and Green [18] when sampling mixture distributions. In addition, a small
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
perturbation in the parameters is allowed, given by the random variables a , a , a . The resulting transm s w forms are: 1 k " [k 1N 1#k 2N 2]#a , x N x x x x m x 1 p " [(p21#k21)N 1#(p22#k22)N 2] x N x x x x x x x
(33)
h(h(i) D k ,Y)!vT[D s]~1v c c W
C
(34)
D
v" + (y@ !h(i)[y@ i#y@ i]y ) , r c r`q r~q (o~i)r r|Ws
C
(35)
D
D s" jI# + yT ry W (o~i) (o~i)r r|Ws
1 h(i)" [h(i)1N 1#h(i)2N 2]#a , x x x x x t N x where N is the number of pixels assigned to class x, x N 1 to class x and N 2 to class x . x 1 x 2 The accepted reversible jump framework [16,18] for proposing a split is to begin by generating new parameters for the two new states, then Gibbs sample the data allocated to the split state into each of the two new states based on their full conditionals. In our case we may generate new parameters in the traditional manner. However, we cannot Gibbs sample the data labels because of the non-causal nature of the GMRF. The conditionals are dependent on data labels yet to be allocated. To overcome this problem we use a re"nement of the methodology described in the previous section for sampling the model order of the Isotropic MRF. A likely reallocation of the label "eld is estimated in a deterministic fashion. This is then used to condition the full conditional distributions for the label parameters to allow a Gibbs sampling sweep of all sites labelled with the class being split. To estimate a likely label "eld con"guration for the GMRF, a square window, denoted = is considered 4 around each pixel in the image. The pseudo-likelihoods of allocating all the pixels in the window to each of the two new classes are calculated and the pixel is allocated to the class with the larger pseudo-likelihood. The pseudo-likelihood for each window is calculated as in Eq. (26), except the product is over all pixels in the window rather than all pixels labelled by state. Hence, the pseudo-likelihood of the pixel at site s being labelled by class c, given it surrounding window, is, P¸ (My ; r3= N D j, k , p , h(i),My r; r3= N) s c r s c c c g
(36)
and N is the size of the window surrounding pixel s. Ws y and y@ are de"ned as in Eqs. (30) and (31). (o~i)r r Using this estimation of the new label "eld, X(e), the proposal label "eld can be Gibbs sampled using the marginalised conditional distributions. The conditional distributions are given by p(x "c D j, k , p , y , y s, x(e)s ) s c c s o g Jp(y D j, k , p , h(i), y s)p(x "c D b(1), x(e)s ), s c c c o s g
(37)
where g is the neighbourhood of interactions within the s label "eld. The marginalised likelihood term in this equation may be evaluated, p(y D j, k , p , h(i), y s) s c c c o
P
"
p(y D k , p , H , y s)p(H(~i)) dH(~i) s c c c o #~i 1
" (2np2) c
2
(Nh~1) ~1 2 DD D s
G
H
1 exp ! (h (h(i) D k , y s)) , c o 2p2 s c (38)
where h (h(i) D k , y s)"[y@ !h(i)[y@ i#y@ i]]2 s c g s s`q s~q ], D ][1!yT s[D ]~1y (o~i) s (o~i)s s "[yT sy #jI], (o~i) (o~i)s
(39)
y and y@ are as de"ned in Eqs. (30) and (31). (o~i)s s The prior on X comprises an Ising model, hence the conditional for x , given its vector of neighborhood labels s x(e)s , is a conditional Gibbs distribution, g p(x "c D b, x(e)s ) s g
NWs~(Nh~1)2) 1 2 2np2t c
A B
G
where,
# + (y@ !h(i)[y@ i#y@ i])2, r c r`q r~q r|Ws
!k2!as, x
"
595
expM!b[d< (x "c D x(e)s )#< (x "c D x(e)s )]N f s g b s g . + expM!b[d< (x "x D x(e)s )#< (x "X D x(e)s )]N f s g b s g x|"
"
H
1 1 ] exp ! (h (h(i) D k ,Y)) , 2p2 t w c c 1 c DD D 2 Ws
(40)
596
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
The acceptance ratio for the resulting segmentation and parameter estimates for splitting region c into c1 and c2, is given by min[1, R], where, p(X"x`, W`, k` D Y"y) R" p(X"x, W, k D Y"y) ]
q(a )q(a )q(a ) m s t q(u m)q(u e)q(u m)q(u e)q(u m)q(u e) m s s t t m
K
K
L(W , W , a , a , a ) 1 c1 c2 m s t ] , q(re-segmentation) L(W , u m, u e, u m, u e, u m, u e) c m m s s t t (41) where p(X"x, W, k D Y"y) is the approximation to the posterior density de"ned by Eq. (10), q(u) is the probability of proposing the random variable u drawn from a Normal distribution, q(a) is the retrospective probability
of drawing the random variable a from a Gaussian distribution, q (re-segmentation) is the probability of the reallocation of the pixels labelled by c into regions labelled by c1 and c2, and the "nal term is the Jacobian determinant given by Eq. (20). If the proposed split (or merge) is accepted, then the remaining h parameters are repeatedly sampled from their full conditional distributions given by Eq. (14).
4. Experimental results There has been much debate on the subject of how convergence might relate to annealing schedule [19,20]. Although slowly cooling logarithmic schedules are generally considered more robust, we have adopted a geometric schedule. The principle reason for this choice was simply, the complexity of the algorithms makes only a relatively small number of iterations possible within a reasonable time-span, hence a fast annealing schedule is
Fig. 1. 200 iteration experiment on an image with four classes and beginning the algorithm with an arbitrary initial guess of two classes.
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
necessary. The schedule is given by ¹ "(1#a )a2(1~Nt t), t 1
(42)
where t is the iteration number, N is the total number of t iterations, and a and a are constants (which were arbit1 2 rarily set to 0.1 and 10.0, respectively). The isotropic segmentation algorithm, described previously in Section 3.1 has been applied to various computer synthesised mosaics. For the purposes of these experiments, b, the double pixel clique coe$cient, which prescribes the interaction strength between pixels within the image, was set a priori to a value of 1.5. In proposing a reversible jump, the Kolmogorov}Smirnov distances for were calculated using 9]9 windows to generate 40 bin histograms of the pixel grayscale distribution functions. Fig. 1 shows demonstrates the convergence of a 200 iteration experiment on a four class grayscale image with additive Gaussian noise. The classes are well separated
597
and the algorithm can be seen converge quickly to the correct segmentation. Fig. 2 shows the convergence of a 200 iteration run on a 6 class image whose within class densities overlap more fully. The converged segmentation gives a good representation of the image after so few iterations and the number of classes has been correctly diagnosed. To demonstrate the stability of this algorithm with respect to starting parameter values. Fig. 3 shows graphically the convergence of the algorithm in terms of number of states, from starting values of between 2 and 10. For reference, the plot is overlayed with the annealing schedule and it can be observed that the algorithm converges within 150 iterations from all starting points. As described in Section 3 it is possible to estimate the MRF hyper-parameter b from its posterior distribution. Unfortunately the use of the pseudo-likelihood approximation for the MRF partition function makes this distribution improper under certain MRF con"gurations.
Fig. 2. 200 iteration experiment on an image with six classes and beginning the algorithm with an arbitrary initial guess of two classes.
598
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
Fig. 3. Graph to demonstrate convergence of the algorithm from various starting con"gurations.
Fig. 4. (a) Graph showing convergence or otherwise of the beta parameter; (b) converged segmentation using the constrained algorithm.
This is shown in Fig. 4a where the unconstrained graph shows the drift of the b parameter towards R as its posterior distribution becomes improper. Since this only occurs at around 300 iterations, after the MRF has converged to the correct number of states and the segmentation has become largely stable, it is only necessary to limit this drift to ensure the MAP estimate remains stable as temperature tends to zero. The drift may be constrained simply my imposing a prior on the b parameter: in this case a Normal distribution was chosen,
N[1.0, 0.3]. Although the eventual estimate for b will be inaccurate, due to the use of the pseudo-likelihood estimate, the segmentation will remain stable until converged or a minimum temperature is reached. The converged segmentation for the constrained algorithm is shown in Fig. 4b. The results of a 500 iteration run on a more complex grayscale image are shown in Fig. 5. The original grayscale distributions of the "ve classes are far closer than in the previous example (as shown by the grayscale
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
599
Fig. 5. 500 iteration experiment on an image with "ve classes and beginning the algorithm with an arbitrary initial guess of six classes.
histogram) but the algorithm has still converged to a good estimate of the underlying image. A greyscale image of a house is segmented in Fig. 6. The algorithm was repeated from various starting conditions and the results reached remained consistent. It could be argued that a better segmentation was arrived at in the intermediate steps but the algorithm then "ts to a statistically if not visually better model in the "nal
iterations. The possibility of using an information criterion to arrive at a more useful segmentation could be considered. The hierarchical GMRF segmentation algorithm described in Section 3.2 has been tested on various computer generated mosaics of GMRF's. The neighbourhood size of each GMRF required ten h coe$cients and so was of type n"4 [21].
600
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
Fig. 6. 400 iteration experiment on a greyscale image of a house. The algorithm begins with an arbitrary initial guess of six classes and consistently converges to "ve.
An example of such a test is shown in Fig. 7. Again the b parameter has been set a priori to 1.5. The algorithm was run for 600 iterations from a starting temperature of 2 to a "nishing temperature of 0.1. The temperature schedule for the experiment was simply linear, (N !t) ¹" t ( ¹ !¹ )#¹ , t .*/ .*/ N * .!9 t
(43)
where t is the iteration number, N is the total number of t iterations, and ¹ and ¹ are the minimum and .*/ .!9 maximum temperatures of the schedule. When estimating a likely reallocation of the label "eld during reversible jump proposal generation, windows of 16]16 pixels were used. For all these simulations, it is interesting to observe how at high temperatures the algorithms "t the model to
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
601
Fig. 7. 600 iteration experiment on a four state synthetic GMRF image. The algorithm begins with an arbitrary initial guess of three classes and converges to four.
the gray level histogram of the images. Only as the temperature falls to around the critical temperature [22] of the system, do the spacial correlations in the image
begin to be modelled. This is a direct consequence of using a Potts model for the con"guration of the label "eld.
602
S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602
5. Conclusion In this paper we have presented unsupervised segmentation algorithms for both noisy and textured images. The model parameters, number of classes and pixel labels are all directly sampled from the posterior distribution using an MCMC algorithm. A single parameter is de"ned prior to the data (or at least an informative prior distribution needs to be de"ned) which de"nes the overall strength of neighbouring pixel interactions within the image. Excepting this, the algorithms are completely unsupervised. Experiments results have been presented in which synthesised images have been rapidly segmented to give accurate estimate of the original underlying image.
[8]
[9]
[10]
[11] [12] [13]
References [1] M.A. Tanner, Tools for Statistical Inference, Springer, Berlin, 1993. [2] B.S. Manjunath, R. Chellappa, Unsupervised texture segmentation using Markov random "elds, IEEE Trans. Pattern Anal. Machine Intell. 13 (5) (1991) 478}482. [3] F.S. Cohen, Z. Fan, Maximum likelihood unsupervised texture segmentation, CVGIP: Graphical Models and Image Processing 54 (3) (1992) 239}251. [4] H.H. Nguyen, P. Cohen, Gibbs random "elds, fuzzy clustering, and the unsupervised segmentation of images, CVGIP: Graphical Models and Image Processing 55 (1) (1993) 1}19. [5] C. Gra$gne, D. Geman, S. Geman, P. Dong, Boundary detection by constrained optimization, IEEE Trans. Pattern Anal. Machine Intell. 12 (7) (1990) 609}628. [6] C. Kervrann, F. Heitz, A Markov random "eld modelbased approach to unsupervised texture segmentation using local and global statistics, IEEE Trans. Image Processing 4 (6) (1995) 856}862. [7] D.K. Panjwani, G. Healey, Markov Random "eld models for unsupervised segmentation of textured color images,
[14]
[15] [16]
[17] [18]
[19] [20] [21]
[22]
IEEE Trans. Pattern Anal. Machine Intell. 17 (10) (1995) 939}954. C.S. Won, H. Derin, Unsupervised segmentation of noisy and textured images using Markov random "elds, CVGIP: Graphical Models and Image Processing 54 (4) (1992) 308}328. J.W. Modestino, D.A. Langan, J. Zhang, Cluster validation for unsupervised stochastic model-based image segmentation, Proc. ICIP94, 1994, pp. 197}201. S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell. 6 (6) (1984) 721}741. J. Besag, Statistical analysis of non-lattice data, The Statistician 24 (3) (1975) 179}195. J.M. Bernardo, A.F.M. Smith, Bayesian Theory, Wiley, New York, 1994. J. Besag, On the statistical analysis of dirty pictures, J. Roy. Statist. Soc. Ser. B 48 (1986) 259}302. B. Chalmond, An iterative gibbsian technique for reconstruction of m-ary images, Pattern Recognition 22 (6) (1989) 747}761. L. Tierny, Markov chains for exploring posterior distributions, Ann. Statist. 22 (5) (1994) 1701}1762. P.J. Green, Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (4) (1996) 711}732. P. Billingsley, Measure Theory, Addison-Wesley, Reading, MA, 1960. S. Richardson, P.J. Green, On Bayesian analysis of mixtures with an unknown number of components, J. Roy. Statist. Soc. Ser. B 59 (4) (1997) 731}792. R. Srichander, E$cient schedules for simulated annealing, Eng. Optim. 24 (1995) 161}176. P.N. Strenski, S. Kirkpatrick, Analysis of "nite length annealing schedules, Algorithmica 6 (1991) 346}366. R. Chellappa, L.M. Kanal, A. Rosenfeld, (Eds.), Two-dimensional discrete Gaussian Markov random "eld models for image processing, Progress in Pattern Recognition, vol. 2, Elsevier, Amsterdam, 1985, pp. 79}112. R.J. Baxter, Exactly solved models in statistical mechanics, Academic Press, London, 1982.
About the Author*SIMON A. BARKER received his M.Eng. degree from the University of Surrey, UK, in 1989 and recently completed studying towards his Ph.D. degree at Cambridge University Engineering Department. He is now a member of the Core Research Group of Lernout & Hauspie Speech Products, Burlington, Massachusetts, US. His current research interests include image segmentation, Markov Random "elds, model selection techniques and acoustic modelling. About the Author*PETER J.W. RAYNER received the M.A. degree from Cambridge University, UK, in 1968 and the Ph.D. degree from Aston University in 1969. Since 1968 he has been with the Department of Engineering at Cambridge University and is Head of the Signal Processing Research Group. In 1998 he was appointed to an ad-hominem Professorship in Signal Processing. He teaches courses in random signal theory, digital signal processing, and image processing. His current research interests include image sequence restoration, audio restoration, non-linear estimation and detection and time series modelling and classi"cation.
Pattern Recognition 33 (2000) 603}616
An AH perspective on deterministic optimization for deformable templates A.L. Yuille*, James M. Coughlan Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, USA Received 15 March 1999
Abstract Many vision problems involve the detection of the boundary of an object, like a hand, or the tracking of a onedimensional structure, such as a road in an aerial photograph. These problems can be formulated in terms of Bayesian probability theory and hence expressed as optimization problems on trees or graphs. These optimization problems are di$cult because they involve search through a high-dimensional space corresponding to the possible deformations of the object. In this paper, we use the theory of AH heuristic algorithms (Pearl, Heuristics, Addison-Wesley, Reading, MA, 1984) to compare three deterministic algorithms } Dijkstra, Dynamic Programming, and Twenty Questions } which have been applied to these problems. We point out that the "rst two algorithms can be thought of as special cases of AH with implicit choices of heuristics. We then analyze the twenty questions, or minimum entropy, algorithm which has recently been developed by Geman and Jedynak (Geman and Jedynak, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1) (1996) 1}14) as a highly e!ective, and intuitive, tree search algorithm for road tracking. First we discuss its relationship to the focus of attention planning strategy used on causal graphs, or Bayes nets (Pearl, Probalilistic Reasoning in Intelligent Systems, Morgan Kau!man, San Mateo, CA, 1988). Then we prove that, in many cases, twenty questions is equivalent to an algorithm, which we call A#, which is a variant of AH. Thus all three deterministic algorithms are either exact, or approximate, versions of AH with implicit heuristics. We then discuss choices of heuristics for optimization problems and contrast the relative e!ectiveness of di!erent heuristics. Finally, we brie#y summarize some recent work (Yuille and Coughlan, Proceedings NIPS'98, 1998; Yuille and Coughlan, Trans. Pattern Anal. Mach. Intell. (1999), submitted for publication) which show that the Bayesian formulation can lead to a natural choice of heuristics which will be very e!ective with high probability. ( 2000 Published by Elsevier Science Ltd. All rights reserved. Keywords: Visual search; Deformable templates; Bayesian models; AH algorithms; Twenty questions; Dijikstra; Dynamic programming
1. Introduction A promising approach to object detection and recognition assumes that we represent the objects by deformable templates [1}8]. These deformable templates have been successfully applied in such special purpose domains as, for example, medical images [6], face recognition
* Corresponding author. Tel.: #415-561-1620; fax: #415561-1610
[1,4,7,8], and galaxy detection [5]. Deformable templates specify the shape and intensity properties of the objects and are de"ned probabilistically so as to take into account the variability of the shapes and the intensity properties. Objects are typically represented in terms of tree or graph structures with many nodes. There is a formidable computational problem, however, in using such models for the detection and recognition of complex objects from real images with background clutter. The very #exibility of the models means that we have to do tree/graph search over a large space of possible object con"gurations in order to determine if
0031-3203/00/$20.00 ( 2000 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 5 - 8
604
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
the object is present in the image. This is possible if the objects are relatively simple, there is little background clutter, and/or if prior knowledge is available to determine likely con"gurations of the objects. But it becomes a very serious problem for the important general case where the objects are complex and occur in natural scenes. Statistical sampling algorithms [2,5], and in particular the jump di!usion algorithm [3], have been very successful but do not, as yet, work in real time except on a restricted, though important, class of problems. Moreover, the only time convergence bounds currently known for such algorithms are extremely large. This motivates the need to investigate algorithms which perform fast intelligent search over trees and graphs. Several deterministic algorithms of this type have recently been adapted from the computer science literature. Standard Dynamic Programming [9] has been applied to "nd the optimal solutions for a limited, though important, class of problems [10}14]. The time complexity is typically a low-order polynomial in the parameters of the problem but this is still too slow for many applications. Dijkstra's algorithm [15] is a related greedy algorithm, which also uses the Dynamic programming principle. It has been successfully adapted [16] to some vision applications. More recently, Geman and Jedynak have proposed the novel twenty questions, or minimal entropy, algorithm for tree search [17]. (Note that Geman and Jedynak refer to their algorithm as `active searcha and use `twenty questionsa in a more restricted sense. But we feel that twenty questions is more descriptive.) The twenty questions algorithm is intuitively very attractive because it uses information theory [18] to guide the search through the representation tree/graph. More recently, it has been rediscovered by Kontsevich and applied to analyzing psychophysical data [19]. The basic intuition of twenty questions has also been proposed as a general approach to vision in order to explain how the human visual system might be able to solve the enormous computational problems needed to interpret visual images and interpret visual scenes [20]. We examine these algorithms from the perspective of heuristic AH algorithms which have been studied extensively by Pearl [21]. These algorithms include a heuristic function which guides the search through the graph and choice of this function can signi"cantly a!ect the convergence speed and the optimality. Both Dynamic Programming and Dijkstra are known to be special cases of AH with implicit choice of heuristics. Our analysis of twenty questions shows that it is closely related to a new algorithm, which we call A#, and which can be shown to be a variant of AH. (A# is formulated almost identically to AH but di!ers from it by making more use of prior expectations during search.)
The main di!erence between these algorithms, therefore, reduces to their implicit choice of heuristic. We analyze the properties of heuristics which are important for optimization problems involving Bayesian models. We show that these models motivate certain heuristics which, in related work [22,23], can be proven (mathematically) to be very e!ective. In Section 2 we set up the framework by describing AH algorithms, the way they can represent problems in terms of trees and graphs, and their applications to deformable templates. In Section 3 we describe Dynamic programming and Dijkstra from an AH perspective. In Section 4 we introduce the twenty questions algorithm and show its relations to the focus of attention strategy used in causal, or Bayes, nets [24]. In Section 5 we demonstrate a close relationship between twenty questions and AH. Finally, in Section 6 we argue that better heuristics can be derived by exploiting the Bayesian formulation and brie#y summarize our recent proofs of this claim [22,23].
2. Framework: deformable templates and AH The deformable templates we consider in this paper will only specify intensity properties at the boundaries of objects (i.e. they will ignore all regional intensity information). They will represent the template con"guration X by a tree or a graph, see Figs. 2 and 3. There will be a probabilistic model P(IDX) for generating an image I given the con"guration X3s of the deformable template. This model will specify an intensity distribution at the template boundary and a background distribution elsewhere. There will also be a prior probability P(X) on the set of possible con"gurations of the template. The goal is to "nd the most probable con"guration of the template. In mathematical terms, we wish to the a posteriori estimate XH such that XH"arg max P(IDX)P(X). X|s
(1)
The distributions P(X) and P(IDX) are typically chosen to be Markov. This means that they have local interactions only occurring within "nite, and usually small, neighborhoods. These neighborhood connections can be mapped directly onto connections in the graph [24]. For example, X might correspond to a con"guration x1, x2,2, xN, where the xi's denote points in the image lattice corresponding to positions of the deformable template elements (see Fig. 3), and P(X)" P(x1)
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
of i. Such distributions are suitable for objects like roads where the statistical properties tend to be similar everywhere on the road and can, for example, be the typical generic smoothness priors used in vision. There are less suitable, however, to model objects like hands where the shape variations along the side of a "nger are quite di!erent from those at a "ngertip. The imaging model P(IDX) typically assumes that the image measurements are conditionally independent and hence can be expressed in the form P(IDX)"
605
Fig. 1. The AH algorithm tries to "nd the most expensive path from A to B. For a partial path AC the algorithm stores g(C), the best cost to go from A to C, and an overestimate h(C) of the cost to go from C to B.
and computes a measure f of the `promisea of each partial path (i.e. leaf in the search tree). When computing these partial paths it makes use of the dynamic programming principle [19], illustrated in the next paragraph, to reduce the computation. The measure f for any node C is de"ned as f (C)"g(C)#h(C), where g(C) is the best cumulative cost found so far from A to C and h(C) is an overestimate of the remaining cost from C to B (Fig. 1). The closer this overestimate is to the true cost then the faster the algorithm will run. (AH can also be expressed in terms of cost minimization in which case the overestimate must become an underestimate. For example, the arcs might represent lengths of roads connecting various `citiesa (nodes), and h(C) could be estimated as the straight-line distance from city C to B.) New paths are considered by extending the most promising node one step. The Dynamic Programming principle says that if at any time two or more paths reach a common node, then all the paths but the best (in terms of f ) are deleted. (This provision is unnecessary in searching a tree, in which case there is only one path to each leaf.) It is straightforward to prove that AH is guaranteed to converge to the correct result provided the heuristic h(.) is an upper bound for the true cost from all nodes C to the goal node B. A heuristic satisfying these conditions is called admissible. Conversely, a heuristic which does not satisfy them is called inadmissible. For certain problems it can be shown that admissible AH algorithms have worst case time bounds which are exponential in the size of the goal path. Alternatively, for some of these problems it can be shown that there are inadmissible
606
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
algorithms which converge with high probability close to the true solution in polynomial expected time [21]. Inadmissible heuristics may therefore lead to faster convergence in many practical applications.
3. Dijkstra and Dynamic Programming Dijkstra and Dynamic Programming are two algorithms which have been successfully applied to detecting deformable templates in computer vision. See [16] for an example of Dijkstra. Dynamic Programming has been used by many people, see for example [13,14]. We now discuss them from an AH perspective. Dijkstra's algorithm [15] is a general purpose graph search algorithm which searches for the minimum cost path. It requires that the cost of going from node to node is always positive or zero. It can be converted into a maximization problem by changing the signs of the costs. If so, it becomes exactly like AH with heuristic h(.)"0, see [21]. Because the node-to-node costs in the Dijkstra formulation are constrained to be positive this means that h(.)"0 is an admissible heuristic (recall we change the signs of the costs when we convert to AH). Therefore, we can consider Dijkstra to be an admissible version of AH.1 Moreover, it should be recalled that admissible AH algorithms, and hence Dijkstra, can be slow. Moreover, there is no guarantee that Dijkstra has chosen the best admissible heuristic. Dijkstra can be shown to have polynomial complexity in terms of the graph size which can be slow on a large graph (note that there is no contradiction with the fact that admissible AH can be exponential in the solution size). In fact, experimentally Dijkstra can slow down and require large memory resources when the target path is hard to "nd, due to noise, (Geiger } personal communication) and in such a case an inadmissible heuristic, or even an alternative admissible heuristic, may be preferable, see Section 6. We now turn to Dynamic Programming (DP). This algorithm acts on a representation of type given in Fig. 2(A). The algorithm starts by "nding the smallest cost path from any point in the "rst column to each point in the second column. At each point in the second column we store the cheapest cost (for the minimization formulation of the problem) to get there. DP then "nds
1 Observe that any admissible AH algorithm can be converted into Dijkstra by simply absorbing the heuristic into the cost of getting to the node (and changing the signs). I.e. by setting g6 (.)"g(.)#h(.). In this sense admissible AH and Dijkstra are equivalent. But this equivalence is misleading. By making the heuristic explicit AH can determine design principles for choosing it and allows analysis of the e!ectiveness of di!erent choices [21]. We will return to this issue in Section 6.
Fig. 2. Examples of representation trees or graphs. On the left, (A), each column represents the n lattice points x1,2, xn in the image and a trajectory from left to right through the columns is a possible con"guration of the deformable template. The arrows show the fan out from each lattice point and put restrictions on the set of paths the algorithm considers. (B) The ternary representation tree used in Geman and Jedynak. The arcs represent segments in the image some of which correspond to the road being tracked. The start and end point of each arc will be points in the image lattice. The root arc, at bottom, is speci"ed by the user. To track the road the algorithm must search through subsequent arcs. In (C) we represent the lattice points in the image directly. A boundary segment of a person's silhouette can be represented by points on a lattice linking salient points P and Q (see Ref. [16]). To determine the boundary involves searching for the best path through the graph.
the smallest cost path from the "rst column to each point in the third column. The dynamic programming principle means that these costs can be computed from the stored cheapest costs in the second column plus the costs of getting from the second column to the third. This procedure iterates as we go from column to column. It means that the complexity of DP, with this representation, will be of order NQM where N is the number of lattice points in the relative region of the image, Q is the fan out factor, and M is the (quantized) length of the deformable template. Using DP in this fashion corresponds to a special case of AH acting on the graph made up of all the columns, see [26]. A heuristic is picked so that it is the same for all points on the column and is so large that AH always "nds the best cost to all elements of a column before proceeding to investigate the next column. This requires that the heuristic cost must be greater than any possible future rewards and hence this is an admissible AH algorithm. We can contrast DP with a more general AH algorithm which, by using alternative heuristics, does not search in this militarist column by column strategy. AH will have to keep a record of all nodes which have been explored and maintain an ordered search tree of promising nodes to explore. By contrast, DP only needs to store the state of the last column and has no need to store a search list because the form of the heuristic implies that the nodes to be explored next are precisely those in the next column. This advantage is reduced, however, because DP is
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
607
Fig. 3. Examples of a deformable template using Dynamic Programming, from Coughlan [13]. The input images are shown on the left and the outputs on the right. The deformable templates, marked by the positions of the crosses, "nd the true maximum a posteriori estimate, hence locating the correct positions of the hands, without knowing the positions of the start and "nish points.
conservative and will continue to explore all elements of the columns even though many may have very low costs. This can be summarized by saying that DP does a breadth-"rst search while more general AH can perform a mixed depth-"rst and breadth-"rst strategy (depending on the heuristic and on the data). Another signi"cant di!erence between the two algorithms is that AH requires sorting to determine which node to expand next and the cost of sorting can sometimes be prohibitive (by constrast DP needs no sorting). One interesting aspect of Dynamic Programming is that for these type of problems its complexity may not increase signi"cantly even if the start and "nish points of the deformable template are unknown, see Fig. 3 from [13]. If the branching factor Q is large then the advantage of knowing the initial and "nal points rapidly becomes lost and the complexity is still O(NQM).
4. Twenty questions and focus of attention Geman and Jedynak [17] design a deformable template model for describing roads in satellite images and a novel algorithm twenty questions for tracking them given an initial position and starting direction. The twenty question algorithm, using the representation (B), has an empirical complexity of order ¸ [17]. They represent roads in terms of a tree of straight-line segments, called &arcs', each approximately twelve pixels long with the root arc being speci"ed by the user. Each arc has three &children' which can be in the same direction as its &parent' or may turn slightly to the left or the right by a small, "xed angle. In this way the arcs form a discretization of a smooth, planar curve with bounded curvature, see Fig. 2. A road is de"ned to be a sequence of ¸ arcs (not counting the root arc). The set s of all possible
608
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
roads has 3L members. The set A of all arcs has (1/2)(3L`1!1) members. Many of the following results will depend only on the representation being a tree (i.e. not including closed loops) and so other branching patterns are possible, see Fig. 4. Geman and Jedynak assume a priori that all roads are equally likely and hence P(X)"3~L for each X. This is equivalent to assuming that at each branch the road has an equal probability of going straight, left or right. Much of the analysis below is independent of the speci"c prior probabilities P(X), see Fig. 4. They assume that there are a set of test observations which can be made on the image using non-linear "lters. For each arc a3A, a test > consists of applying a "lter a whose response y 3M1,2,10N is large when the image a near arc a is road-like. (An arc a is considered road-like if the intensity variation of pixels along the arc is smaller than the intensity di!erences between pixels perpendicular to the arc.) The distribution of the tests M> N (regarded a as random variables) depends only on whether or not
Fig. 4. A variation of Geman and Jedynak's tree structure with a di!erent branching pattern. The prior probabilities may express a preference for certain paths, such as those which are straight.
the arc a lies on the road candidate X and the tests are assumed to be conditionally independent given X. Thus the probabilities can be speci"ed by
G
p (> ) P(> DX)" 1 a a p (> ) 0 a
if a lies on X,
(2)
otherwise.
The probability distributions p (.) and p (.) are deter1 0 mined by experiment (i.e. by running the tests on and o! the road to gain statistics). These distributions overlap, otherwise the tests would give unambiguous results (i.e. `roada or `not-roada) and the road could be found directly. The theoretical results we obtain are independent of the precise nature of the tests and indeed the algorithm can be generalized to consider a larger class of tests, but this will not be done in this paper. The true road may be determined by "nding the MAP estimate of P(XD all tests). However, there is an important practical di$culty in "nding the MAP: the number of possible candidates to search over is 3L, an enormous number, and the number of possible tests is even larger (of course, these numbers ignore the fact that some of the paths will extend outside the domain of the image and hence can be ignored. But, even so, the number of possible paths is exorbitant). To circumvent this problem, Geman and Jedynak propose the twenty questions algorithm that uses an intelligent testing rule to select the most informative test at each iteration. They introduce the concept of partial paths and show that it is only necessary to calculate the probabilities of these partial paths rather than those of all possible road hypotheses. They de"ne the set C to consist of all paths a which pass through arc a. Observe, see Fig. 8, that this condition speci"es a unique path from the root arc to a. Thus MX3C N can be thought of as the set of all possible a extensions of this partial path. Their algorithm only needs to store the probabilities of certain partial paths, z "P(X3C D test results), rather than the probabilities a a of all the 3L possible road paths. Geman and Jedynak describe rules for updating these probabilities z but, in a fact, the relevant probabilities can be calculated directly (see next section). It should be emphasized that calculating these probabilities would be signi"cantly more di$cult for general graph structures where the presence of closed loops introduces di$culties which require algorithms like dynamic programming to overcome [28]. The testing rule is the following: after having performed tests > 1 through > k, choose the next test > k`1" n n n > so as to minimize the conditional entropy H(XDb , > ) c k c given by:
G
H(xDb , > )"!+ P(> "y Db ) + P(XDb , > "y ) k c c c k k c c yc x
H
]log P(XDb , > 1"y ) , k c c
(3)
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
where b "My 1,2, y kN is the set of test results from steps k n n 1 through k (we use capitals to denote random variables and lower case to denote numbers such as the results of tests). The conditional entropy criterion causes tests to be chosen which will be expected to maximally decrease the uncertainty of the distribution P(XDb ). k`1 We also point out that their strategy for choosing tests has already been used in Bayes Nets [24]. Geman and Jedynak state that there is a relationship to Bayes Nets [17] but they do not make it explicit. This relationship can be seen from the following theorem. Theorem 1. The test which minimizes the conditional entropy is the same test that maximizes the mutual information between the test and the road conditioned on the results of the proceeding tests. More precisely, arg min H(XDb , c k > )"arg max I(> ; XDb ). c c c k Proof. This result follows directly from standard identities in information theory [18]: I(> ; XDb )"H(XDb )!H(XDb , > ) a k k k a "H(> Db )!H(> DX, b ). h (4) a k a k This maximizing mutual information approach is precisely the focus of attention strategy used in Bayes Nets [24], see Fig. 5. It has proven an e!ective strategy in medical probabilistic expert systems, for example, where it can be used to determine which diagnostic test a doctor should perform in order to gain most information about a possible disease [28]. Therefore, the twenty questions algorithm can be considered as a special case of this strategy. Focus of attention, however, is typically applied
609
to problems involving graphs with closed loops and hence it is di$cult to update probabilities after a question has been asked (a test has been performed). Moreover, on graphs it is both di$cult to evaluate the mutual information and to determine which, of many possible, tests will maximize the mutual information with the desired hypothesis state X. By contrast, Geman and Jedynak are able to specify simple rules for deciding which tests to perform. This is because: (i) their tests, Eq. (2), are simpler than those typically used in Bayes Nets and (ii) their tree structure (i.e. no closed loops) makes it easy to perform certain computations. The following theorem, which is stated and proven in their paper, simpli"es the problem of selecting which test to perform. As we will show later, this result is also important for showing the relationship of twenty questions to A#. The theorem is valid for any graph (even with closed loops) and for arbitrary prior probabilities. It relies only on the form of the tests speci"ed in Eq. (2). The key point is the assumption that roads either contain the arc which is being tested or they do not. Theorem 2. The test > which minimizes the conditional c entropy is the test which minimizes a convex function /(z ) c where /(z)"H(p )z#H(p )(1!z)!H(zp #(1!z)p ). 1 0 1 0 Proof. From the information theory identities given in Eq. (4) it follows that minimizing H(XDb , > ) with respect k c to a is equivalent to minimizing H(> DX, b )!H(> Db ). c k c k Using the facts that P(> DX, b )"P(> DX), z "P(X3C Db ), c k c c c k P(> Db )"+ P(> DX)P(XDb )"p (> )z #p (> )(1!z ), c k X c k 1 c c 0 c c where P(> DX)"p (> ) if arc c lies on X and c 1 c
Fig. 5. A Bayes Net is a directed graph with probabilities. This can be illustrated by a game show where the goal is to discover the job of a participant. In this case the jobs are `unemployeda, `Harvard professora and `Ma"a Bossa. The players are not allowed direct questions but they can ask about causal factors } e.g. `bad lucka or `ambitiona } or about symptoms } `heart attacka, `eating disordera, `big egoa. The focus of attention strategy is to ask the questions that convey the most information. Determining such questions is straightforward in principle, if the structure of the graph and all the probabilities are known, but may require exorbitant computational time if the network is large.
610
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
P(> DX)"p (> ) otherwise, we "nd that c 0 c H(> DX, b )"+ P(XDb )M!+ P(> DX) log P(> DX)N c k k c c X Yc "z H(p )#(1!z )H(p ), H(> Db ) c 1 c 0 c k "H(z p #(1!z )p ). c 1 c 0
(5)
The main result follows directly. The convexity can be veri"ed directly by showing that the second-order derivative is positive. h For the tests chosen by Geman and Jedynak it can be determined that /(z) has a unique minimum at z6 +0.51. For the game of twenty questions, where the tests give unambiguous results, it can be shown that the minimum occurs at z6 "0.5. (In this case the tests will obey p (> "y )p (> "y )"0 ∀y and this enforces that 1 c c 0 c c c H(z p #(1!z )p )"z H(p )#(1!z )H(p )!z log z c 1 c 0 c 1 c 0 c c !(1!z )log(1!z ) and so /(z)"z log z#(1!z) c c log(1!z) which is convex with minimum at z"0.5.) Thus the minimal entropy criterion says that we should test the next untested arc which minimizes /(z ). c By the nature of the tree structure and the prior there can be very few (and typically no) untested arcs with z 'z6 c and most untested arcs will satisfy z )z6 . Restricting c ourselves to this subset, we see that the convexity of /(.), see Fig. 6, means that we need only "nd an arc c for which z is as close to z6 as possible. It is straightforward to show c that most untested arcs, particularly distant descendants of the tested arcs, will have probabilities far less than z6 and so do not even need to be tested (each three way split in the tree will introduce a prior factor 1 which 3
Fig. 6. Test selection for twenty questions is determined by the /(z) function. This is convex with at minimum at z6 . Most untested arcs a will have probabilities z less than z6 and twenty a questions will prefer to explore the most probable of these paths. It is conceivable that a few untested arcs have probability greater than z6 . In this case they may or may not be tested. The exact form of the /(.) function depends on speci"c details of the problem.
multiplies the probabilities of the descendant arcs, so the probabilities of descendants will decay exponentially with the distance from a tested arc). It is therefore simple to minimize /(z ) for all arcs such that z )z6 and then we c c need simply compare this minimum to the values for the few, if any, special arcs for which z 'z6 . This, see [17], c allows one to quickly determine the best test to perform. Observe, that because the prior is uniform there may often be three or two arcs which have the same probability. To see this, consider deciding which arc to test when starting from the root node } all three arcs will be equally likely. It is not stated in [17] what their algorithm does in this case but we assume, in the event of a tie, that the algorithm picks one winner at random.
5. Twenty questions, A# and AH In this section we de"ne an algorithm, which we call A#, which simply consists of testing the most probable untested arc. We show that this is usually equivalent to twenty questions. Then we show that A# can be reexpressed as a variant of AH. The only di!erence between AH and A# is that A# (and twenty questions) makes use of prior expectations in an attempt to speed up the search. (Both A#and twenty questions are formulated with prior probabilities which can be used to make these predictions). The di!erence in search strategies can be thought of, metaphorically, as the distinction between eugenics and breeding like rabbits. AH proceeds by selecting the graph node which has greatest total cost (cumulative and heuristic) and then expands all the children of this node. This is the rabbit strategy. By contrast, A#selects the best graph node and then expands only the best predicted child node. This is reminiscent of eugenics. The twenty questions algorithm occasionally goes one stage further and expands a grandchild of the best node (i.e. completely skipping the child nodes). In general, if prior probabilities for the problem are known to be highly non-uniform, then the eugenic strategy will on average be more e$cient than the rabbit strategy. The algorithm A#is based on the same model and the same array of tests used in Geman and Jedynak's work. What is di!erent is the rule for selecting the most promising arc c on which to perform the next test > . The arc c c that is chosen is the arc with the highest probability z that satis"es two requirements: Test > must not have c c been performed previously and c must be the child of a previously tested arc. For twenty questions the best test will typically be the child of a tested arc though occasionally, as we will describe later, it might be a grandchild or some other descendant. Theorem 3. A# and Twenty questions will test the same arc provided z )z6 for all untested arcs c. Moreover, the c
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
only cases when the algorithms will diwer is when A# chooses to test an arc both siblings of which have already been tested. Proof. The "rst part of this result follows directly from Theorem 2: /(z) is convex with minimum at z6 so, provided z )z6 for all untested c, the most probable untested c arc is the one that minimizes the conditional entropy, see Fig. 6. The second part is illustrated in Fig. 7. Let c be the arc that A# prefers to test. Since A# only considers an arc c that is the child of previously tested arcs, there are only three cases to consider: when none, one, or two of c's siblings have been previously tested. In the "rst two cases, when none or one of c's siblings has been tested, the probability z is bounded: by z (z6 or by z (z6 , respecc c c tively. Clearly, since c is the arc with the maximum probability, no other arc can have a probability closer to z6 ; thus arc c minimizes /(z ) and both algorithms are c consistent. In the third case, however, when both of c's siblings have been tested, it is possible for z to be larger c than z6 . In this case it is possible that other arcs with smaller probabilities would lower / more than /(z ). For c example, if /(z /3)(/(z ), then the twenty questions alc c
Fig. 7. The three possible possibilities for A#'s preferred arc a where dashed lines represent tested arcs. In A, both a's siblings have been tested. In this case the twenty question algorithm might prefer testing one of a's three children or some other arc elsewhere on the tree. In cases B and C, at most one of a's siblings have been tested and so both twenty questions and A# agree.
611
gorithm would prefer any of c's (untested) children, having probability z /3, to c itself. But conceivably there may c be another untested arc elsewhere with probability higher than z /3, and lower than z6 , which twenty questions c might prefer. h Thus the only di!erence between the algorithms may occur when the previous tests will have established c's membership on the road with such high certainty that the conditional entropy principle considers it unnecessary to test c itself. In this case twenty questions may perform a `leap of faitha and test c's children or it may test another arc elsewhere. If twenty questions chooses to test c's children then this would make it potentially more e$cient than A# which would waste one test by testing c. But from the backtracking histogram in [17] it seems that testing children in this way never happened in their experiments. There may, however, have been cases when untested arcs are more probable than z6 and the twenty questions algorithm tested other unrelated arcs. If this did indeed happen, and the structure of the problem might make this impossible, then it seems that twenty questions might be performing an irrelevant test. We expect therefore that A# and twenty questions will usually pick the same test and so should have almost identical performance on the road tracking problem. This analysis can be generalized to alternative branching structures and prior probabilities. For example, for a binary tree we would expect that the twenty questions algorithm might often make leaps of faith and test grandchildren. Conversely, the larger the branching factor then the more similar A# and twenty questions will become. In addition, a non-uniform prior might also make it advisable to test other descendants. Of course we can generalize A# to allow it to skip children too if the children have high probability of being on the path. But we will not do this here because, as we will see, such a generalization will reduce the similarity of A#with AH. Our next theorem shows that we can give an analytic expression for the probabilities of the partial paths. Recall that these are the probabilities z that the road goes a through a particular tested arc a, see Fig. 8. (Geman and Jedynak give an iterative algorithm for calculating these probabilities.) This leads to a formulation of the A# algorithm which makes it easy to relate to AH. The result holds for arbitrary branching and priors. Theorem 4. The probabilities z "P(X3C Dy ,2,y ) of a a 1 M partial paths to an untested arc a, whose parent arc has been tested, can be expressed as 1 Ma p (y j) P(X3C Dy ,2,y )" < 1 a t(a , a ), a 1 M j j~1 Z p (y ) M j/1 0 aj
(6)
where A "Ma : j"12M N is the set of (tested) arcs a j a lying on the path to a, see Fig. 8, and t(a , a ) is the prior i i~1
612
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
with probability p (y ) or p (y ) respectively. We obtain: 1 i 0 i P(XC Dy ,2,y ) a 1 M P(X3C Dy ,2,y ) a 1 M
G
J + P(X) < p (y ) 1 i X|Ca i/1,2,M>XbCiWCa
G
]
H
< p (y ) 0 i i/1,2,M>XbCiWCa
G
p (y ) 1 i " + P(X) < W a p (y ) 2 a i X|C i/1, ,M>XbC C 0 i
G
]
Fig. 8. For any untested arc a, there is a unique path a , a ,2 1 2 linking it to the root arc. As before, dashed lines indicate arcs that have been tested.
probability of arc a following arc a (a is the initializai i~1 0 tion arc). Proof. Suppose a is an arc which has not yet been tested but which is a child of one that has. Assume we have test results (y ,2,y ), then there must be a unique subset 1 M A "a ,2,a a of tests which explore all the arcs from a 1 M the starting point to arc a, see Fig. 8. The probability that the path goes through arc a is given by P(X3C Dy ,2,y )" + P(XDy ,2,y ) a 1 M 1 M X|Ca P(y ,2,y DX)P(X) 1 M "+ . (7) P(y ,2,y ) a 1 M X|C The factor P(y ,2,y ) is independent of a and so we can 1 M remove it (we will only be concerned with the relative values of di!erent probabilities and not their absolute values). Recall that the tests are independent and if arc i lies on, or o!, the road then a test result y is produced i
H
H H
< p (y ) , (8) 0 i i/1,2,M where the notation X3C WC means the set of all roads i a which contain the (tested) arc i and arc a. The "nal factor < p (y ) can be ignored since it is also independent of a. i 0 i Now suppose none of arc a's children have been tested. Then since the sum in Eq. (8) is over all paths which go through arc a this means that set of arcs i : X3C on the i road X for which tests are performed must be precisely those in the unique subset A going from the starting a point to arc a. More precisely, Mi"1,2,M : X3C WC N i a "A . Therefore: a p (y ) p (y ) Ma p (y ) 1 i " < 1 i " < 1 aj . < (9) p (y ) p (y ) p (y ) i/1,2,M > x|CiWCa 0 i i|Aa 0 i j/1 0 aj Now + a P(X) is simply the prior probability that the X|C path goes through arc a. We can denote it by P . Bea cause of the tree structure, it can be written as P "<Ma t(a , a ), where t(a , a ) is the prior proba 1/1 i i~1 i i~1 ability that the road takes the child arc a given that it has i reached its parent a . If all paths passing through a are i~1 equally likely (using Geman and Jedynak's prior on the ternary graph) then t(a , a )"1 for all a and we have: i i~1 3 3L~@Aa@~1 , (10) P , + P(X)" a 3L X|Ca where ¸ is the total length of the road and DA D is the a length of the partial path. Therefore in the general case: 1 Ma p (y j) P(X3C Dy ,2,y )" < 1 a t(a , a ), a 1 M j j~1 Z p (y ) M j/1 0 aj where Z is a normalization factor. h M
(11)
We now show that A# is just a simple variant of AH. We "rst design an admissible AH algorithm using the smallest heuristic which guarantees admissibility. Then we will show that A# is a variant of AH with a smaller heuristic and hence is inadmissible. To adapt AH to apply to the road tracking problem we must convert the ternary road representation tree into a graph by introducing a terminal node to which all the
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
leaves of the tree are connected. We set the cost of getting to this terminal node from any of the leaves of the tree to be constant. Then deciding to go from one node to an adjacent node, and evaluating the cost, is equivalent to deciding to test the arc between these nodes and evaluating the test result. It follows directly from Theorem 4, or see [17], that the best road is the one which maximizes the log of Eq. (11):
G
H
p (y ) E(X)"+ log 1 ai #log t(a , a ) , i i~1 p (y i) 0 a i
(12)
where X"Ma ,2, a N and t(a , a ) is the prior prob1 L i i~1 ability that arc a follows a . Observe that the Z factor i i~1 m from Eq. (11) is the same for all paths so we have dropped it from the right-hand side of Eq. (12). By Theorem 4 again, the cost for a partial path of length M which terminates at arc a is given by a
G
H
p (y ) Ma g(a)" + log 1 aj #log t(a , a ) . j j~1 p (y j) 0 a j/1
(13)
To determine the smallest admissible heuristic for AH, we observe that the cost to the end of the road has a least upper bound of h(a)"(¸!M )M" #" N, a 0 p
(14)
where j "max logMp (y)/p (y)N and j "max logt(. , .) 0 y 1 0 p over all possible prior branching factors in the tree. It is clearly possible to have paths which obtain this upper bound though, of course, they are highly unlikely because they require all the future path segments to have the maximal possible log likelihood ratio, we will return to this issue in Section 6. The heuristic given by Eq. (4) is therefore the smallest possible admissible heuristic. We can therefore de"ne the admissible AH algorithm with smallest heuristic to have a cost f given by
G
Ma p (y ) f (a)"g(a)#h(a)" + log 1 aj #log t(a , a ) j j~1 p (y j) 0 a j/1 #(¸!M )M" #" N. a 0 p
H (15)
Observe that we can obtain Dijkstra by rewriting Eq. (15) as
A
B
Ma p (y ) f" + log 1 aj #log t(a , a ) !" !" ) j j~1 0 p p (y j) 0 a j/1 #¸M" #" N 0 p
(16)
and because the length of all roads are assumed to be the same, the "nal term ¸M" #" N is a constant and can be 0 p ignored. This can be directly reformulated as Dijkstra with g"+Ma (log[p (y j)/p (y j)]#log t(a , a )!" !" ) j/1 1 a 0 a j j~1 0 p and h"0. (Though, strictly speaking, the term Dijkstra
613
only applies if the terms in the sum are all guaranteed to be non-negative. Only with this additional condition is Dijkstra guaranteed to converge.) The size of " #" has a big in#uence in determining 0 p the order in which paths get searched by the admissible AH algorithm. The bigger " #" then the bigger the 0 p overestimate cost, see Eq. (14). The larger the overestimate then the more the admissible AH will prefer to explore paths with a small number of tested arcs (because these paths will have overgenerous estimates of their future costs). This induces a breadth-xrst search strategy [26] and may slow down the search. We now compare A#to AH. The result is summarized in Theorem 5. Theorem 5. A# is an inadmissible variant of AH. Proof. From Theorem 4 we see that A# picks the path which minimizes Eq. (13). In other words, it minimizes the g(a) part of AH but has no overestimate term. In other words, it sets h(a)"0 by default. There is no reason to believe that this is an overestimate for the remaining part of the path. Of course, if " #" )0 then h(a)"0 0 p would be an acceptable overestimate. For the special case considered by Geman and Jedynak this would require that max logMp (y)/p (y)N!log 3)0. For their proby 1 0 ability distributions it seems that this condition is not satis"ed, see Fig. 5 in Ref. [17]. h This means that A#, and hence twenty questions, are typically suboptimal and are not guaranteed to converge to the optimal solution. On the other hand, admissible AH uses upper bounds means that it prefers paths with few arcs to those with many, so it may waste time by exploring in breadth. We will return to these issues in the next section.
6. Heuristics So far we have shown that Dynamic Programming, Dijkstra and Twenty Questions are either exact, or approximate, versions of AH. The only di!erence lies in their choice of heuristic and their breeding strategies (rabbits or eugenics). Both DP and Dijkstra choose admissible heuristics and are therefore guaranteed to "nd the optimal solution. The downside from this is that they are rather conservative and hence may be slower than necessary. By contrast, Twenty Questions is closely related to A#which uses an inadmissible heuristic. It is not necessarily guaranteed to converge but empirically is very fast and "nds the optimal solution in linear time (in terms of the solution length). The choice of heuristics is clearly very important. What principles can be used to determine good heuristics for a speci"c problem domain?
614
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
Fig. 9. In the example from Pearl, the goal is to "nd the shortest path in the lattice from (0, 0) to (m, n). The choice of heuristic will greatly in#uence the search strategy.
An example from Pearl [21] illustrates how di!erent admissible heuristics can a!ect the speed of search. Pearl's example is formulated in terms of "nding the shortest path in a two-dimensional lattice from position (0, 0) to the point (m, n), see Fig. 9. The "rst algorithm Pearl considers is Dijkstra, so it has heuristic h(. , .)"0 which is admissible since the cost of all path segments is positive. It is straightforward to show that this requires us to expand Z (m, n)"2(m#n)2 a nodes before we reach the target (by the nature of AH we must expand all nodes whose cost is less than, or equal to, the cost of the goal node). The second algorithm uses a shortest distance heuristic h(x, y)"J(x!m)2#(y!n)2 which is also admissible. In this case the number of nodes expanded, Z (m, n), can b be calculated. The expression is complex so we do not write it down. The bottom line, however, is that the ratio of Z (m, n)/Z (m, n) is always less than 1 and so the shortest b a distance heuristic is preferable to Dijkstra for this problem. The maximum value of the ratio, approximately 0.18, is obtained when n"m. Its minimum value occurs when n"0 (or equivalently when m"0) and is given by 1/2m which tends to zero for large m. Thus choosing one admissible heuristic in preference to another can yield signi"cant speed-ups without sacri"cing optimality. It has long been known [21] that inadmissible algorithms, which use probabilistic knowledge of the domain, can converge in linear expected time to close to the optimal solution even when admissible algorithms will take provably exponential time. The problems we are interested in solving are already formulated as Bayesian estimation problems which means that probabilistic knowledge of the domain is already available. How can it be exploited?
Consider, for example, the admissible AH algorithm which we de"ned for the road tracking problem, see Section 5. Let us consider the special case when the branching factor is always three and all paths are equally likely. Then we demonstrated that the smallest admissible heuristic is h(a)"(¸!M )max logMp (y)/p (y)N. a y 1 0 This heuristic, however, is really a worst case bound because the chances of getting such a result if we measure the response on the true path are very small. In fact, if we assume that the image measurements are independent (as our Bayesian model does, see Section 2), then the law of large numbers says the total response of all test results along the true path should be close to h (a)" p (¸!M + p (y) logMp (y)/p (y)N. This average bound a y 1 1 0 will typically be considerably smaller than the worst case bound used to determine h(a) above. We would therefore expect that, in general, the average case heuristic h (a) will be far quicker than the worst case heuristic p h(a) and should usually lead to equally good results. (Of course, the average case heuristic will become poor as we approach the end point, where ¸!M is small, and so a we will have to replace it by the worst case heuristic in such situations.) Our recent work [22,23] has made these intuitions precise by exploiting results from the theory of types, see [18], which quantify how rapidly the law of large numbers starts being e!ective. For example, Sanov's theorem can be used to determine the probability that the average cost for a set of n samples from the true road di!ers from the expected average cost + p (y) log Mp (y)/p (y)N. The theorem shows that the y 1 1 0 probability of any di!erence will decrease exponentially with the number of samples n. Conversely, we can ask with what probability will we get an average cost close to + p (y) log Mp (y)/p (y)N from a set of n samples not from y 1 1 0 the true path (i.e. the probability that the algorithm will be fooled into following a false path). Again, it follows from Sanov's theorem that the probability of this happening decreases exponentially with n where the coe$cient in the exponent is the Kullback}Leibler distance between p (y) and p (y). 1 0 Our papers [22,23] prove expected convergence time bounds for optimization problems of the type we have analyzed in this paper. For example, in [23] we prove that inadmissible heuristic algorithms can be expected to solve these optimization problems (with a quanti"ed expected error) while examining a number of nodes which varies linearly with the size of the problem. (The expected sorting time per node is shown to be constant). Moreover, the di$culty of the problem is determined by an order parameter which is speci"ed in terms of the characteristics of the domain (the distributions P , P , P(X) and the branching factor of the search on off tree). At critical values of the order parameter the problem undergoes a phase transition and becomes impossible to solve by any algorithm.
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
7. Summary In summary, we argue that AH, and heuristic algorithms [21], give a good framework to compare and evaluate di!erent optimization algorithms for deformable templates. We describe how both Dijkstra and Dynamic Programming can be expressed in terms of AH [21,26]. We then prove a close relationship between the twenty questions algorithm [17] and a novel algorithm which we call A#. In turn, we prove that A#is an inadmissible variant of AH. We note that both A# and twenty questions, unlike AH and Dijkstra, maintain explicit probabilities of partial solutions which allows them to keep track of how well the algorithm is doing and to warn of faulty convergence. In addition, their explicit use of prior knowledge allows them to improve their search strategy (in general) by making predictions. However, both AH and Dijkstra are designed to work on graphs, which include closed loops, and it may be di$cult to extend twenty questions and A# to such representations. From the AH perspective, the role of heuristics is very important. Most algorithms, implicitly or explicitly, make use of heuristics the choice of which can have a big e!ect on the speed and e!ectiveness of the algorithm. It is therefore important to specify them explicitly and analyze their e!ectiveness. For example, it appears that probabilistic knowledge of the problem domain can lead to heuristics, adapted to the domain, which are provably very e!ective [21}23]. By contrast, algorithms such as Dijkstra, which have no explicit heuristics, have no mechanisms for adapting the algorithm to a speci"c domain. For example, Dijkstra's algorithm applied to detecting visual shapes is very e!ective at low noise levels but can break down (in terms of memory and time) at high noise levels (Geiger } private communication). Characterizing the noise probabilistically and using this to guide a choice of heuristic can lead to a more e$cient algorithms, see [25].
Acknowledgements This report was inspired by an interesting talk by Donald Geman and by David Mumford wondering what relation there was between the twenty question algorithm and AH. One of us (ALY) is grateful to Davi Geiger for discussions of the Dijkstra algorithm and to M.I. Miller and N. Khaneja for useful discussions about the results in this paper and their work on Dynamic Programming. This work was partially supported by NSF Grant IRI 92-23676 and the Center for Imaging Science funded by ARO DAAH049510494. ALY was employed by Harvard while some of this work was being done and would like to thank Harvard University for many character building experiences. He would also like to thank Prof.'s Irwin
615
King and Lei Xu's hospitality at the Engineering Department of the Chinese University of Hong Kong and, in particular, to Lei Xu for mentioning Pearl's book on heuristics and for the use of his copy.
References [1] M.A. Fischler, R.A. Erschlager, The representation and matching of pictorial structures, IEEE. Trans. Comput. C-22 (1973) 67}92. [2] U. Grenander, Y. Chow, D.M. Keenan, Hands: a Pattern Theoretic Study of Biological Shapes, Springer, Berlin, 1991. [3] U. Grenander, M.I. Miller, Representation of knowledge in complex systems, J. Roy. Statist. Soc. 56 (4) (1994) 569}603. [4] T.F. Cootes, C.J. Taylor, Active shape models } &Smart Snakes', British Machine Vision Conference, Leeds, UK, September 1992, 266}275. [5] B.D. Ripley, Classi"cation and clustering in spatial and image data, in: M. Schader (Ed.), Analyzing and Modeling Data and Knowledge, Springer, Berlin, 1992. [6] L.H. Straib, J.S. Duncan, Parametrically deformable contour models, Proceedings of Computer Vision and Pattern Recognition, San Diego, CA, 1989, pp. 98}103. [7] L. Wiscott, C. Von der Marlsburg, A neural system for the recognition of partially occluded objects in cluttered scenes, Neural Comput. 7 (4) (1993) 935}948. [8] A.L. Yuille, Deformable templates for face recognition, J. Cognitive Neurosci. 3 (1) (1991) 59}70. [9] R.E. Bellman, Applied Dynamic Programming, Princeton University Press, Princeton, NJ, 1962. [10] U. Montanari, On the optimal detection of curves in noisy pictures, Comm. ACM 14 (5) (1971) 335}345. [11] M. Barzohar, D.B. Cooper, Automatic "nding of main roads in aerial images by using geometric}stochastic models and estimation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1993, pp. 459}464. [12] D. Geiger, A. Gupta, L.A. Costa, J. Vlontzos, Dynamic programming for detecting, tracking and matching elastic contours, IEEE Trans. Pattern Anal. Mach. Intell. PAMI17 (1995) 294}302. [13] J. Coughlan, Global optimization of a deformable hand template using dynamic programming, Harvard Robotics Laboratory, Technical report, 95-1, 1995. [14] N. Khaneja, M.I. Miller, U. Grenander, Dynamic programming generation of geodesics and sulci on brain surfaces, PAMI 20 (11) (1998) 1260}1265. [15] D. Bertsekas, Dynamic programming and optimal control, vol. 1, Second Ed., Athena Scienti"c Press, 1995. [16] D. Geiger, T-L. Liu, Top-Down Recognition and BottomUp Integration for Recognizing Articulated Objects, Preprint, Courant Institute, New York University, 1996. [17] D. Geman, B. Jedynak, An active testing model for tracking roads in satellite images, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1) (1996) 1}14. [18] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley Interscience Press, New York, 1991.
616
A.L. Yuille, J.M. Coughlan / Pattern Recognition 33 (2000) 603}616
[19] L. Kontsevich, Private Communication, 1996. [20] W. Richards, A. Bobick, Playing twenty questions with nature, in: Z. Pylyshyn (Ed.), Computational Processes in Human Vision: An Inter-disciplinary Perspective, Ablex, Norwood, NJ, 1988. [21] J. Pearl, Heuristics, Addison-Wesley, Reading, MA, 1984. [22] A.L. Yuille, J.M. Coughlan, Convergence rates of algorithms for visual search: detecting visual contours, Proceedings NIPS'98, 1998. [23] A.L. Yuille, J.M. Coughlan, Visual search: fundamental bounds, order parameters, phase transitions, and convergence rates, Trans. Pattern Anal. Mach. Intell. (1999), submitted for publication.
[24] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kau!man, San Mateo, CA, 1988. [25] J.M. Coughlan, A.L. Yuille, C. English, D. Snow, E$cient deformable template detection and localization without user initialization, CVIU (1998), submitted for publication. [26] P.H. Winston, Arti"cial Intelligence, Addison-Wesley, Reading, MA, 1984. [27] S. Russell, P. Norvig, Arti"cial Intelligence: a Modern Approach, Prentice-Hall, New York, 1995. [28] S.L. Lauritzen, D.J. Spiegelhalter, Local computations with probabilities on graphical structures and their application to expert systems, J. Roy. Statist. Soc. B 50 (2) (1988) 157}224.
About the Author*ALAN YUILLE receive his BA in Mathematics at the University of Cambridge in 1976. He completed his Ph.D. in Theoretical Physics at Cambridge in 1980 and worked as a postdoc in Physics at the University of Texas at Austin and the Institute for Theoretical Physics at Santa Barbara. From 1982 to 1986 he worked at the Arti"cial Intelligence Laboratory at MIT before joining the Division of Applied Sciences at Harvard from 1986 to 1995 rising to the rank of Associate Professor. In 1995 he joined the Smith-Kettlewell Eye Research Institute in San Francisco. His research interests are in mathematical modelling of arti"cial and biological vision. He has over one hundred peer-reviewed publications in vision, neural networks, and physics. He has co-authored a book with J.J. Clark (`Data Fusion for Sensory Information Processing Systema), and edited a book `Active Visiona with A. Blake.
About the Author*JAMES COUGHLAN received his BA in physics at Harvard in 1998. He is currently working as a postdoc with Alan Yuille at the Smith-Kettlewell Eye Research Institute in San Francisco. His research interests are in computer vision and the applications of Bayesian probability theory to arti"cial intelligence.
Pattern Recognition 33 (2000) 617}634
A theory of proximity based clustering: structure detection by optimization Jan Puzicha!,*, Thomas Hofmann", Joachim M. Buhmann! !Institut fu( r Informatik III, University of Bonn, Bonn, Germany "Artixcial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA Received 15 March 1999
Abstract In this paper, a systematic optimization approach for clustering proximity or similarity data is developed. Starting from fundamental invariance and robustness properties, a set of axioms is proposed and discussed to distinguish di!erent cluster compactness and separation criteria. The approach covers the case of sparse proximity matrices, and is extended to nested partitionings for hierarchical data clustering. To solve the associated optimization problems, a rigorous mathematical framework for deterministic annealing and mean-xeld approximation is presented. E$cient optimization heuristics are derived in a canonical way, which also clari"es the relation to stochastic optimization by Gibbs sampling. Similarity-based clustering techniques have a broad range of possible applications in computer vision, pattern recognition, and data analysis. As a major practical application we present a novel approach to the problem of unsupervised texture segmentation, which relies on statistical tests as a measure of homogeneity. The quality of the algorithms is empirically evaluated on a large collection of Brodatz-like micro-texture Mondrians and on a set of real}word images. To demonstrate the broad usefulness of the theory of proximity based clustering the performances of di!erent criteria and algorithms are compared on an information retrieval task for a document database. The superiority of optimization algorithms for clustering is supported by extensive experiments. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Clustering; Proximity data; Similarity; Deterministic annealing; Texture segmentation; Document retrieval
1. Introduction Data clustering is one of the core methods for numerous tasks in pattern recognition, exploratory data analysis, computer vision, machine learning, data mining, and in many other related "elds. In a rather informal sense, the goal of clustering is to partition a given set of data into homogeneous groups. Cluster homogeneity is thus the central notion which needs to be formalized in order to give data clustering a precise meaning. In this paper, we focus on homogeneity measures which are de"ned in
* Corresponding author. E-mail addresses: Mjan, [email protected] (J. Puzicha), [email protected] (T. Hofmann)
terms of pairwise similarities or dissimilarities between data entities or objects. The underlying data is usually called similarity or proximity data [1]. In proximity data, the elementary measurements are comparisons between two objects of a given data set. This data format di!ers from vectorial data where each measurement corresponds to a certain &feature' evaluated at an external scale. Notice however, that pairwise dissimilarities can be canonically generated from vectorial data whenever a distance function or metric is available. There exist numerous approaches to the data clustering problem, only to mention some of the most important: central clustering or vector quantization (e.g., the K-means algorithm [2]), linkage or agglomerative methods [3], mixture models [4], fuzzy clustering [5], and competitive learning [6,7]. These and other approaches o!er a variety of clustering algorithms, and,
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 6 - X
618
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
more fundamentally, di!er with respect to the framework in which data clustering is formulated. Typically, there are limitations to which kind of data a method can be applied. It is a commonly expressed opinion [1] that the application of partitional clustering methods which are based on explicit objective functions are only suitable for vectorial data. In addition, optimization methods are supposed to be inherently non-hierarchical. In this paper, a theory is outlined which advocates a formulation of proximity-based data clustering as a combinatorial optimization problem. The theory proposes solutions to the two fundamental problems: (i) the speci"cation of suitable objective functions, and (ii) the derivation of e$cient optimization algorithms. To address the modeling problem, an axiomatization of clustering objective functions based on fundamental invariance and robustness properties is presented (Section 2). As will be rigorously shown, the proposed axioms imply restrictions on both the way cluster homogeneities are calculated from pairwise dissimilarities as well as the "nal combination of contributions from di!erent clusters. These ideas are extended to cover two generalizations: clustering with sparse or incomplete proximity data and hierarchical clustering. These extensions are indispensable for the analysis of large data sets, since it is in general prohibitive (and also unnecessary due to redundancy) to exhaustively perform all N2 pairwise comparisons for N objects. Moreover, in large-scale applications group structure typically occurs at di!erent resolution levels, which strongly favors hierarchical partitioning schemes. Once a suitable objective function has been identi"ed, in principle any known optimization technique could be applied to "nd optimal solutions. Yet, the NP-hardness of most data partitioning problems renders the application of exact methods for large-scale problems intractable [8]. Therefore, heuristic optimization techniques are promising candidates to "nd at least approximate clustering solutions. In particular, stochastic optimization (Section 3) o!ers robustness and matches the peculiarities of data analysis problems [9]. Two closely related methods will be derived: a Monte Carlo algorithm known as the Gibbs sampler [10] and a deterministic variant known as mean-xeld annealing [11]. Both approaches rely on the introduction of a computational temperature, and o!er a number of advantages: (i) they are general enough to cover all clustering objective functions, (ii) they yield scalable algorithms (in terms of the complexity-quality tradeo!), and (iii) the temperature de"nes a &natural' resolution scale for hierarchical clustering problems [12]. The theory and the derived algorithms are tested and validated in two application areas: unsupervised segmentation of textured images [13,14] and information retrieval in document databases.
2. Axiomatization for clustering objective functions Assume the data is given in the form of a proximity matrix D3RN2 with entries D quantifying the dissimilarij ity between objects o and o from a given domain of i j N objects. Furthermore, assume the number of clusters K to be "xed. A Boolean representation of data partitionings is introduced in terms of assignment matrices M3M , where N,K
G
H
K M " M3M0, 1NNCK: + M "1, 1)i)N . il N,K l/1
(1)
M is an indicator variable for an assignment of object il o to cluster C , hence M "1 if and only if object o i l il i belongs to cluster C . The assignment constraints in the l de"nition of M assure that the data assignment is N,K complete and unique. 2.1. Elementary axioms General principles of proximity-based clustering are expressed by the following axioms: De5nition 1 (Clustering criterion). A cost function H : M ]RN2PR is a clustering criterion if the folN,K N,K lowing set of axioms is ful"lled: Axiom 1 (Permutation invariance). H is invariant N,K with respect to permutations of (a) object indices and (b) label indices. More precisely, for all D3RN2, M3M , N,K permutations n over M1,2, NN, and n6 over M1,2, KN: H (M; D)"H (Mn6 ; Dn), N,K N,K n n where An6 is obtained from A by row permutation with n n and column permutation with n6. Axiom 2 (Monotonicity). For all D3RN2, M3M , N,K nd3R`, and pairs of data indices (i, j): K + M M "M1NN il jl 0 l/1 H (M; D) MxNH (M; Dij), N,K w N,K
(2)
where Dij is obtained from D by the local modi"cation D PD #nd. ij ij H is a strict clustering criterion if for each (i, j) at N,K least one of the inequalities in (2) is strict. Axiom 1 prevents that the quality of a data partitioning depends on additional information hidden in the data labels or cluster labels. Axiom 2 states that increasing the dissimilarity between objects in the same cluster can never be advantageous. The same is true for decreasing
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
the dissimilarity between objects belonging to di!erent clusters. To further limit the functional dependency on proximities, an additivity axiom is introduced. Additivity reduces the noise sensitivity of a clustering criterion by averaging, as opposed to other approaches which completely discard the dissimilarity values and only keep their order relation (e.g., single linkage). De5nition 2 (Additivity). A clustering criterion H is N,K additive if it has the following functional form:1
De5nition 3 (Invariance). An objective functions H is invariant with respect to linear data transformations, if the following set of axioms is ful"lled: Axiom 3 (Scale invariance). For all D3RN2, M3M, c3R`: H(M; cD)"cH(M; D). Axiom 4 (Shift invariance). For all D3RN2, M3M, nd3R : H(M; D#nd)"H(M; D)#Nnd.
N N H (M; D)" + + t (i, j, D , M). N,K N,K ij i/1 j/1, jEi t : M1,2, NN2]R]M PR is called the contribuN,K N,K tion function. Furthermore, for M "1 and M "1, ia jl t (i, j, D , M) does not depend on assignments to clusN,K ij ters kOa, l. The contribution function t of an additive clustering N,K criterion is in fact further restricted by Axioms 1 and 2. Proposition 1. Every additive clustering criterion can be rewritten as a combination of (D-monotone) intra- and inter-cluster contribution functions, t(1) and t(2) respectively, K N H (M; D)" + + M N,K il l/1 i,i j"1, Ej K M t(1) (D , n )! + M t(2) (D , n , n ) . jl N,K ij l jk N,K ij l k k/1, kEl Here n "+N M denotes the size of cluster l. l i/1 il
C
619
D
Each additive clustering criterion can thus be linearly decomposed into one part measuring intra-cluster compactness (to be minimized) and a second part measuring inter-cluster separation (to be maximized). A proof of Proposition 1 is given in the Appendix. 2.2. Invariance axioms While Axioms 1 and 2 ensure elementary requirements for a clustering criterion, and additivity narrows the focus to a particular simple class of objective functions, the following invariance and robustness properties have to be considered as the core of the proposed axiomatization. Assuming N and K to be "xed, the explicit dependency in our notation is dropped whenever possible.
1 Self-dissimilarities D are excluded for simplicity. All objecii tive functions can however be easily modi"ed, if diagonal contributions should be included.
Axiom 3 ensures that rescaling of the data solely rescales the cost function. A scale-invariant criterion has the advantage not to introduce an implicit bias towards a particular data scale. Axiom 4 is crucial for data that is only meaningful on an interval scale and not on an absolute or ratio scale. Invariant clustering criteria are thus non-committal with respect to scale and origin of the data, a property which is especially useful in applications where these quantities are not a priori known.2 For additive clustering criteria scale invariance restricts t(1) and t(2) in Proposition 1 to a linear data dependency. The "rst argument of t(1,2) is therefore dropped with the understanding t(1,2)(D , ) )"D t(1,2)( ) ). The ij ij number of additive clustering criteria is further reduced by the shift invariance axiom. For cluster compactness measures the following result is obtained. Proposition 2. For every invariant additive clustering criterion with t(2)"0 (intra-cluster compactness measure), the contribution function t(1) can be written as 1 1 t(1)(n )"j #(1!j) , j3R. l n !1 n (n !1) l l l The number of admissible weighting functions t(2) is less signi"cantly reduced by the invariance property. Therefore, a special class of contribution functions is considered which possess a natural decomposition. De5nition 4 (Decomposability). An inter-cluster contribution function t(2) is decomposable, if there exists a function f (n ) such that either l f (n )" + n t(2)(n , n ) or f (n )" + n t(2)(n , n ). l k l k l k k l kEl kEl Proposition 3. For every invariant additive clustering criterion with t(1)"2C and decomposable t(2), the contribution
2 Both invariance axioms can in fact be weakened with respect to additional multiplicative or additive constants, but this does not result in qualitatively di!erent criteria.
620
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
function can be written as an azne combination 7 t(2)(n , n )" + j t (n , n )!C, l k r r l k r/1 with an additive owset3 C"1/(N!1) and the following elementary functions and its (l, k)-symmetric counterparts (t , t , t ): 5 6 7 1 1 t " , t " , 1 N!n 2 (K!1)n l l N N t " , t " . 3 K(K!1)(N!n )n 4 K(K!1)n n l l l k Proofs of both propositions are given in the appendix. For simplicity, we focus on the special case of symmetric proximity matrices in the sequel, for which one may set j "j "j "0 without loss of generality. In summary, 5 6 7 we have derived 2 elementary intra-cluster compactness criteria K K N H1#1(M; D)" + n D , H1#2(M; D)" + D , l l K l l/1 l/1 where
K N +K D k/1,kEl lk, H142!(M; D)"! + K!1 K l/1 K N +K nD k/1,kEl k lk, Hps2b(M; D)"! + K N!n l l/1 where
2.3. Robustness axioms The following set of axioms is concerned with the sensitivity of the quality criterion with respect to perturbations of the proximities. Since inaccurate (e.g., quantized) or noisy measurements are common, robustness is a key issue. Two di!erent notions of robustness are distinguished:
(3)
+N +N M M D D " i/1 j/1,jEi il jl ij (4) l +N +N M M i/1 j/1,jEi il jl is the average dissimilarity in cluster C . Moreover, by l restriction to decomposable contribution functions and symmetric data we have derived 4 elementary inter-cluster separation criteria: K +K D H141!(M; D)"! + n k/1,kEl lk, l K!1 l/1 K +K nD H141"(M; D)"! + n k/1,kEl k lk, l N!n l l/1
concerns the weighting of the average cluster compactness (D ) or separation (+ n D ) with either the cluster l kEl k lk size (H1#1, H141!, H141") or a constant (H1#2, H142!, H142"). The second distinction concerns the way the average cluster separation is computed for separation measures: either these averages are performed by pooling all dissimilarities together (H141", H142"), or by a twostage procedure which "rst calculates averages for every pair of clusters (D ) and combines those with a constant lk weight (H141!, H142!). These di!erences are crucial for the robustness properties, formalized in the following axioms.
(5)
(6)
Axiom 5 (Weak robustness). A family of objective functions H"(H ) N is robust in the weak sense if for all N,K N| nd3R, e3R` there exists N 3N such that for all 0 N*N , M3M , D3RN2, and pairs of data indices 0 N,K (i, j): 1 DH (M D D)!H (M D Dij)D(e, N,K N N,K
(8)
where Dij is de"ned as in Axiom 2. Axiom 6 (Strong robustness). H is robust in the strong sense if condition (8) holds for all Di3RN2 de"ned by Di"D#nd(d #d !d d ) . ik il ik il k,l More intuitively, robustness assures that single measurements (weak robustness) or measurements belonging to a single object (strong robustness) do not have a macroscopic in#uence on the costs of a con"guration. The robustness properties of the invariant criteria are summarized without proof in the following table: H1#1
H1#2 H141! H141" H142! H142"
+N +N M M D D " i/1 j/1 il jk ij (7) lk +N +N M M i/1 j/1 il jk denotes the average inter-cluster dissimilarity. The most fundamental distinction between the di!erent criteria
Weak robustness Strong robustness
3 The constant C is a technical requirement to obtain the correct sign for the additive o!set in Axiom 4, it will be dropped in the sequel.
We emphasize the most remarkable facts: (i) H1#1 is the only criterion which ful"lls the strong robustness axiom, (ii) no invariant inter-cluster separation criterion is robust in the strong sense, (iii) H142" is the only criterion
Yes
No
Yes
Yes
No
Yes
Yes
No
No
No
No
No
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
with constant cluster weights being robust in the weak sense. All cluster separation criteria lack strong robustness. The reason for this are con"gurations with only one large (macroscopic) cluster and (K!1) small (microscopic) clusters where the number of inter-cluster dissimilarities scales only with O(N) compared to the total number scaling with O(N2). Strong robustness can be obtained by restricting M to the following sets of admissible solutions: (1) at least two macroscopic clusters for H141" and (2) K macroscopic clusters for H141! and H142". These considerations yield the following ranking of invariant clustering criteria with respect to their asymptotic robustness properties: H1#1zH141"zMH141!, H142"NzMH142!, H1#2N.
(9)
By the axiomatization, the criterion H1#1 thus is clearly distinguished from all other additive criteria. Among the cluster separation measures H141" has been identi"ed as the most promising candidate due to its robustness properties. Interestingly, there is an intrinsic connection to the K-means objective function for central clustering. Assume that the data were generated from a vector space representation v 3Rd by D "(v !v )2, then i ij i j H1#1(M, D)+H,.(M, D), with K N H,.(M, D)" + + M (v !y )2, il i l l/1 i/1 and the usual centroid de"nition
(10)
+N M v y " j/1 jl j.4 l +N M j/1 jl Moreover, there exists an intimate relation to Ward's agglomerative clustering method [15]. If the distance between a pair of clusters is de"ned by the cost increase after merging both clusters, any objective function H is heuristically minimized by greedy merging starting from singleton clusters. In case of H1#1 this procedure exactly yields Ward's method. It was often conjectured that Ward's method depends on the de"nition of centroids and the usage of a squared error measure, however, as demonstrated, Ward's method can be understood as a greedy algorithm to minimize H1#1, which does not involve centroids or any other geometrical concepts. It is worth mentioning that the graph partitioning objective function de"ned by N K N H'1(M; D)" + + + M M D , il jl ij l/1 i/1 j/1, jEi
(11)
4 The almost equal relation &+' refers to the additional diagonal contributions D which is negligible for large N. Alternaii tively, the de"nition of additivity could be extended to cover the re#exive case to get a true identity.
621
is scale invariant and robust in the strong sense, but not shift invariant. H'1 has been utilized for data analysis in the operations research context [16] and also for texture segmentation [17]. The missing shift invariance is obvious: in the limit of *dPR the optimal solution is an equipartitioning, in the limit of *dP!R the con"guration inevitably collapses into one cluster. A &good' balance between positive and negative contribution is necessary to avoid this type of degeneration [17] (see Section 4). This is a consequence of the ratio scale interpretation of dissimilarities which requires the speci"cation of a scale origin. The recently proposed normalized cut approach [18] provides interesting normalized cluster criteria which are not additive. The normalized cut cost function has been introduced only for two clusters. But as it is equal to the minimization of the normalized association H/#(M; D) K +N +N M M D i/1 j/1, jEi il jl ij (12) "+ (+N +N M #M !M M )D jl il jl ij l/1 i/1 j/1, jEi il it is naturally extended to multiple clusters. It should be noted that by using similarities and maximizing (12) a qualitatively di!erent criterion is obtained, which is well-de"ned even for highly sparse dissimilarity data. Both criteria are scale invariant and robust in the strong sense, but they are not shift invariant. The normalized cut is well-de"ned only for positive proximities and is thus only de"ned on a ratio scale. 2.4. Sparse proximity data As discussed in the introduction, it is important for large-scale applications of proximity-based clustering to develop methods which apply to arbitrary sparse proximity matrices. In order to distinguish between known and unknown dissimilarities an irre#exive graph (<, E) with <"M1,2, NN, EL<]< is introduced. For notational convenience denote by N L< the set of graph i neighbors of node or object i, i.e., N "Mj3<: (i, j)3EN. i The above axiom system can be extended to cover this case [19], but we restrain from a formal treatment which involves a lot of technical details. The essential result is that there are two ways of averaging which result in invariant criteria for sparse proximity matrices. For intra-cluster compactness objective functions the average intra-cluster dissimilarity is calculated either by one-step averages D(I) or by the cascaded averaging D(II), where l l + M M D D(I)" (i, j)|E il jl ij, l + M M (i, j)|E il jl + NM D +N M j| i jl ij i/1 il + N M j| i jl . D(II)" l +N M i/1 il
(13)
622
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
In a similar way, average inter-cluster dissimilarities D are generalized. Since the di!erent possibilities of lk weighting clusters and averaging are independent, they can be freely combined. The two sparse data variants obtained from H1#1 are given by
A
B
K N + M M D (i, j)|E il jl ij, H1#1" + + M I il + M M (i, j)|E il jl l/1 i/1 K N + NM D H1#1" + + M j| i jl ij. II il + N M j| i jl l/1 i/1
(14)
2.5. Hierarchical clustering The second extension of data clustering objective functions concerns the problem of hierarchical clustering [20]. This is in particular important, as it has often been claimed, that partitional methods are inherently nonhierarchical [1]. By a hierarchical clustering solution data partitionings MK are speci"ed on all levels of K" 2,2, K . In order for the di!erent data partitionings .!9 to be consistent, it is required that between consecutive solutions always one cluster is split into two sub-clusters. For simplicity a consistent numbering of clusters is enforced.
ing. In the limiting case of w /w P0 an objective K`1 K function is obtained which corresponds to greedy splitting, while the inverse limit w /w P0 corresponds K K`1 to greedy cluster merging. In this limit Ward's method is guaranteed to "nd the minimum, if H "H1#1 is N,K N,K utilized, and hence the de"nition naturally includes this agglomerative technique as a special case. The presented optimization approach has the advantage to guarantee a consistent and strictly nested hierarchy of clusters. Of course, not all K-partitionings have to be considered as de"ning &natural' solutions. Therefore the following validation criterion is proposed in order to eliminate intermediate levels of the hierarchy. To all K-partition costs H (M(k)) complexity costs H#.9" N,K jN log(K), j3R` are added. Only those K-partitions are kept which possess a range of j where they have the lowest total costs. The motivation of complexity costs proportional to log K stems from the expected cost decay for a random instance in the NPR limit. Notice, that this index does not necessarily identify a single &true' partition, but only eliminates implausible partitions, which are sub-optimal for all choices of j. In particular, it is not a model selection criterion in the Bayesian sense.
3. Optimization by annealing De5nition 5. A sequence of data partitionings MK3M N,K with K"2,2, K is hierarchical, if cluster indices .!9 a(K) exist such that
G
MK`1 il MK " il MK`1#MK`1 il i(K`1)
if lOa, if l"a(K).
De"nition 5 implies that a hierarchical clustering solution is completely described by the "nest data partitioning MK.!9 and the sequence of splits a(K), which implicitly encode the topology of a complete binary tree. Following the same underlying principle as in De"nition 2, hierarchical clustering criteria are de"ned by additive composition of the single level solutions. De5nition 6 (Hierarchical clustering). A clustering criterion H is hierarchical if it has the following functional N form: K.!9 H (MK.!9; a(K), D)" + w H (MK; D), N K N,K K/2 where H are clustering criteria and w 3R` with N,K K 0 +K.!9 w "1. H is additive, if all H are additive. K/2 K N N,K H is invariant if all H are invariant. N N,K The least biased choice is a constant weighting w "1/(K !1) of all data partitionings. But if prior K .!9 knowledge about the data &granularity' is available it can be incorporated by an appropriate non-constant weight-
3.1. Simulated annealing and Gibbs sampling Simulated annealing [21] is a popular stochastic optimization heuristic which has successfully been applied to large-scale problems, e.g., in computer vision and in operations research. Simulated annealing performs a random search which can be modeled by an inhomogeneous discrete-time Markov chain (M(t)) N converging towards t| its equilibrium distribution, the Gibbs distribution 1 PH(M)" exp(!H(M)/¹), Z T (15) Z " + exp(!H(M)/¹). T MM | Formally, denote by P"MP : MP[0, 1]: +M M | P(M)"1N the space of probability distributions on M, by S(P) the entropy of P and by F (P)"SHTP!¹S(P)" + P(M)H(M) T M M | # ¹ + P(M) log P(M) (16) M M | the generalized free energy, which plays the role of an objective function over P. For arbitrary assignment problems H the Gibbs distribution PH minimizes the generalized free energy, i.e., PH"arg minP P F (P). For | T the data clustering problem we focus on local algorithms
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
with state transitions restricted to pairs of con"gurations which di!er in the assignment of at most one object or site. Denote by s (M, e ) the matrix obtained by substitui a ting the ith row of M by the unity vector e . For cona venience introduce a site visitation schedule as a map v : NPM1,2, NN such that, in the limit, every site is visited in"nitely often. A sampling scheme known as the Gibbs sampler [10] is advantageous, if it is possible to e$ciently sample from the conditional distribution of PH at site v(t), given the assignments at all other sites MjOv(t)N. For a given schedule v the Gibbs sampler is de"ned by the non-zero transition probabilities exp[!H(s (M, e ))/¹(t)] v(t) a S (s (M, e ), M)" . (17) t v(t) a +K exp[!H(s (M, e ))/¹(t)] l/1 v(t) l The basic idea of annealing is to gradually lower the temperature ¹(t), on which the transition probabilities depend. For the zero temperature limit a local optimization algorithm known as Iterative Conditional Mode [22] (ICM) is obtained. 3.2. Deterministic and mean-xeld annealing An approach known as deterministic annealing (DA) combines the advantages of a temperature controlled continuation method with a fast, purely deterministic computational scheme. The key idea of DA is to calculate the relevant expectation values of system parameters, e.g., the variables of the optimization problem with analytical techniques. In DA a combinatorial optimization problem with objective function H over M is relaxed to a family of stochastic optimization problems with objective functions F over a subspace Q-P. The subspace Q disT cussed in the context of assignment problems is the space of all factorial distributions given by
G
H
N K (18) Q" Q3P : Q(M)" < + M q , ∀M3M . il il i/1 l/1 With this speci"c choice of Q, DA is more speci"cally called mean-xeld annealing (MFA) whenever PHNQ. Q is distinguished from other subspaces of P in many respects: (i) the dimensionality of Q increases only linearly with N, (ii) an e$cient alternation algorithm exists for a very general class of objective functions, (iii) in the limit ¹P0 locally optimal solutions to the combinatorial optimization problem can be recovered. F is an enT tropy-smoothed version of the original optimization problem which becomes convex over Q for su$ciently large ¹. DA tracks solutions from high to low temperatures, where gradually more and more details of the original objective function appear. The most important properties of factorial distributions are 1. All correlations w.r.t. Q vanish for assignment variables at di!erent sites.
623
2. The parameters q can be identi"ed with the Q-averil ages SM T. il Calculating stationary conditions by di!erentiation of the free energy (16) with respect to the parameters q il a system of coupled transcendental, so-called mean-xeld equations, is obtained, which is solved by a convergent asynchronous update scheme. Theorem 1. Let H be an arbitrary clustering criterion and v a site visitation schedule. Then, for arbitrary initial conditions, the following asynchronous update scheme converges to a local minimum of the generalized free energy F (16) T over Q: exp[!1 h ] T il q/%8" il +K exp[!1 h ] k/1 T ik
K
LSHT where h " q0-$"SHT i Q0-$ el and i"v(t). (19) il il s( , ) Lq il Notice, that the variables h , called mean-xelds by il analogy to the naming convention in statistical physics, are only auxiliary parameters to compactify the notation. The update scheme is essentially a non-linear Gau{} Seidel relaxation to iteratively solve the coupled transcendental equations. Combining the convergent update scheme with an annealing schedule yields a GNC-like [23] algorithm, because F is convex over Q for ¹ su$T ciently large. For a derivation and more mathematical details please see Refs. [9,14]. There is a tight relationship between the quantities g "H(s (M, e )) involved in implementing the Gibbs il i l sampler in Eq. (17) and mean-"eld equations. Rewriting Eq. (19) we arrive at M ilH(M) Q(M) h " + il M M q il | " + H(s (M, e ))Q(M)"Sg TQ. i l il M M |
(20)
Thus the mean-"eld h is a Q-averaged version of the il local costs g . il 3.3. Gibbs sampling for clustering of proximity data To e$ciently implement the Gibbs sampler one has to optimize the evaluation of H for a sequence of locally modi"ed assignment matrices. It is an important observation that the quantities g only have to be computed il up to an additive shift, which may depend on the site index i, but not on l. The choice of g (M)"H il (s (M, e ))!H(s (M, 0)) leads to compact analytical i l i expressions, because the contributions of the reduced system without site i are subtracted. For the functional
624
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
representation of additive clustering criteria in Proposition 1 the general formula for the Gibbs "elds is given by g "2 + M t(1)(D , n`i) il jl ij l jEi ! 2 + + M t(2)(D , n`i, n ) jk ij l k jEi kEl #
+ M M [t(1)(D , n`i)! t(1)(D , n~i)] jl kl jk l jk l jEi, kEi, j
!
+ M + M [t(2)(D , n`i, n ) jl kk jk l k jEi, kEi, j kEl
! t(2)(D , n~i, n )]. jk l k
(21)
Here, n`i"n #1!M is the cluster size after adding l l il object o to cluster C and n~i"n !M the cluster size i l l l il after eliminating the object from that cluster. In order to derive the Gibbs "elds for speci"c cost functions the occuring di!erences between the contribution functions are calculated. Exemplary, we display the Gibbs "eld equation for H1#1 explicitly,
C
D
1 1 gpc1" + # M D il jl ij n`i n~i l jEi l 1 ! + + M M D . jl kl jk n`in~i l l jEi kEi, j
technical di$culty in calculating the mean-"eld equations are the averages of the normalization constants. Although every Boolean function has a polynomial normal form, which would in principle eliminate the denominator, some approximations have to be introduced to avoid exponential order in the number of conjunctions. This is done by independently averaging the numerator and the normalization in the denominator. Using the correlation properties of factorial distributions this leads to h (M)"g (SMT). Thus approximate mean-"eld il il equations are obtained by simply replacing the Boolean variables in the Gibbs "eld equations by its probabilistic counterparts. MFA o!ers the possibility to track the phase transitions (cluster splits) in order to obtain a meaningful tree topology. This strategy has been "rst pursued by Rose et al. for vector quantization [12,24] and can be implemented by tentatively splitting clusters at each temperature level and fusing degenerated clusters after convergence of the mean-"eld equations until the maximal number of clusters K is reached. .!9 4. Applications and experimental results
(22)
In order to obtain an e$cient implementation of the Gibbs sampler, we propose to utilize book keeping quantities, e.g., for intra-cluster compactness criteria the cluster sizes n , a "+ M D , and A "+N M a . l il jEi jl ij l i/1 il il The computation of all Gibbs "elds based on the bookkeeping quantities then requires O(NK) arithmetical operations. After locally changing an assignment of a single object, the update of all book-keeping quantities requires O(N) operations, for a complete sweep thus O(N2). The generalization to the case of sparse proximity matrices is straightforward. However, the averages D(II) yield more complex equations, since it involves l object-speci"c normalization functions, while D(I) has a l cluster-speci"c normalization. In the case of hierarchical clustering, the additive combination of contributions from di!erent data partitionings allows us to reduce the calculation of Gibbs "elds to the case of at' clustering, , where gK denotes the Gibbs "eld corg "+K.!9 gK K/2 ia(l,K) ia il responding to cluster C in the K-cluster partitioning and a a(l, K) denotes the index of the (super-)cluster the leaf cluster C is associated with in the coarse level partitionl ing with K clusters. 3.4. Mean-xeld annealing for clustering of proximity data According to Theorem 1 and exploiting Eq. (20) the problem of calculating the mean-"elds h is reduced to il the problem of Q-averaging the quantities g . The main il
Performance comparisons for di!erent optimization algorithms are straightforward, because the obtained solution quality is assessed by the utilized objective function. In contrast, the empirical veri"cation is much more di$cult for the modeling problem. A decisive evaluation of the quality of a clustering solution is only possible if information about ground truth is available. This is the case in both application discussed in the sequel: texture segmentation and document retrieval. These problems are interesting on their own, but the obtained results promise that the developed methods are of relevance in many other application domains as well. 4.1. Unsupervised segmentation of textured images 4.1.1. Proximity-based texture segmentation The unsupervised segmentation of textured images is widely recognized as a challenging computer vision problem. The main conceptual di$culty is the de"nition of an appropriate homogeneity measure in mathematical terms. Many explicit texture models have been considered in the last three decades. Textures are often represented by feature vectors, e.g., by the means and variances of a "lter bank output [25], wavelet coe$cients [26] or as parameters of an explicit Markov random "eld model [27]. Feature-based approaches, however, often su!er from the inadequacy of the metric utilized in parameter space to appropriately represent visual dissimilarities between di!erent textures, a problem which is severe for unsupervised segmentation. It is an important observation due to Geman et al. [17], that the segmentation
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
problem can be de"ned in terms of pairwise dissimilarities between textures without the need for explicit texture features. Our approach to unsupervised texture segmentation is based on a Gabor wavelet scale-space image representation with frequency-tuned "lters [13,14,25,28]. The similarity between pairs of texture patches is then measured by a statistical test applied to the empirical feature distribution functions of locally sampled Gabor coe$cients. We have intensively investigated several tests with respect to their texture discrimination ability [29]. As a result throughout this work a s2-statistic will be used. Denote by I(x) the vector of Gabor coe$cients at a position x. I(x) contains information about the spatial relationship between pixels in the neighborhood of x, but may not capture the complete characteristics of the texture. Therefore, the (weighted) empirical distribution of Gabor coe$cients in a window around x is considered, i f (r)(t)" i y
N
+ = (DDx !yDD) + = (DDx !yDD), r i r i y > tk~1xIr(y):tk for t3[t ; t ]. (23) k~1 k = denotes a non-negative, monotonely decreasing r window function centered at the origin. t "0(t 0 1 (2(t is a suitable binning. Here we consider only L the computationally simplest choice of squared windows with a constant weight inside and zero weight outside. The window size is chosen proportional to the standard deviation p of the Gabor "lter [25]. The dissimilarity r between textures at two positions x and x is evaluated i j independently for each Gabor "lter according to L ( f (r)(t )!fK (t ))2 i k k D(r)"s2" + ij fK (t ) k k/1 where fK (t )"[ f (r)(t )#f (r)(t )]/2. (24) k i k j k The dissimilarities D(r) are "nally combined by the ¸ ij 1 norm, D "+ D(r). The ¸ -norm is less sensitive to di!erij r ij 1 ences in single channels than the ¸ -norm proposed in = Ref. [17] and empirically showed the best performance within the ¸ -norm family [29]. p 4.1.2. Sparse pairwise clustering We selected a representative set of 86 micro-patterns from the Brodatz texture album [30] to empirically test the segmentation algorithms on a wide range of textures.5 A database of random mixtures (512]512 pixels each) containing 100 entities of K"5 textures each (as 5 We a priori excluded the textures d25-d26, d30-d31, d39d48, d58-d59, d61-d62, d88-d89, d91, d96-d97, d99, d107-d108 by visual inspection due to missing micro-pattern properties, i.e., all textures are excluded, where the texture property is lost when considering small image parts.
625
depicted in Fig. 1) was constructed from this collection. All segmentations are based on a "lter bank of twelve Gabor "lters with four orientations and three scales. For each image a subset of 64]64 sites was considered. For each site we used a square window of size 8]8 for the smallest scale. A sparse neighborhood including the 4 nearest neighbors and (on average) 80 randomly selected neighbors was chosen. As an independent reference algorithm using a frequency-based feature extraction we selected the approach of Jain and Farrokhnia [25], which we refer to as Gabor Feature Clustering (GFC). The vector of Gabor coe$cients at a position x is non-linearly transformed by i using the absolute value of the hyperbolic tangents of the real part. Then a Gaussian smoothing "lter is applied and the resulting feature vectors are rescaled to zero mean and unit variance. The texture segmentation problem is formulated as a clustering problem using the K-means clustering criterion with an Euclidean norm (10). We have chosen a deterministic annealing algorithm for clustering of vectorial data due to Rose et al. [31], which was empirically found to yield slightly better results than the clustering technique proposed in Ref. [25]. In order to obtain comparable results we used the same 12 Gabor "lters and extracted feature vectors on the 64]64 regular sub-lattice of sites. As an example for an agglomerative clustering method we selected Ward's method [1], which experimentally achieved substantially better results than single and complete linkage. For all methods small and narrow regions were removed in a simple postprocessing step to avoid the typical speckle-like noise inherent to all clustering methods under consideration [17]. Table 1 summarizes the obtained mean and median values for all cost functions under consideration, evaluated on the database of mixture images with K"5 textures each. In addition, we report the percentage of outliers with more than 20% segmentation error, which we de"ne as structural segmentation errors, since typically complete textures are missed. For H1#1 (H1#1) I II a median segmentation error rate as low as 3.7% (3.6%) was obtained. Both cost functions yield very similar results as expected and exhibit only few outliers. We recommend the use of H1#1, because it can be implemented I more e$ciently. For H1#2 both mean and median error I are larger.6 We conclude, that in most cases the invariant 6 The missing robustness properties render H1#1 inapplicable. II As a compensation one may add prior costs penalizing the generation of extremely small clusters, e.g., by adding j (D)+ n~1. Yet, such prior costs violate the invariance axioms, s l l as j (D)&+ D /(N!1) in order to ful"ll scale invariance, s iEj ij but on the other hand j (D)"j (D#nd) by the shift invaris s ance requirement. The robustness de"ciency is partially compensated by choosing an appropriate prior, but at the cost of empirically "xing an additional, data-dependent algorithmic parameter, which has to be considered as a major de"ciency in the context of unsupervised texture segmentation.
626
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
Fig. 1. Typical segmentation results using 5 clusters for di!erent cost functions before postprocessing. Misclassi"ed sites are depicted in black.
Table 1 Mean and median error compared to ground truth for segmenting 100 randomly generated images with K"5 textures each using MFA
Median Mean 20%-quantile
H1#1 (%) I
H1#1 (%) II
H1#2 (%) I
H'1 (%)
H/#(%)
Ward (%)
GFC (%)
3.7 5.8 6
3.6 6.0 5
5.0 7.7 9
4.0 7.7 11
4.0 6.6 9
7.7 11.5 18
6.7 10.6 18
Note: The columns correspond to di!erent cost functions H. For H1#2 a prior with j "(150/N)E[D ] was used, while the data were s ij I shifted by nd"0.1!E[D ] for H'1. ij
Fig. 2. Segmentations obtained by H'1 for several data shifts: original image and segmentations with a mean dissimilarity of !0.05, 0, 0.05, 0.1, 0.15, 0.2 and 0.25 are depicted. Segments collapse for negative shifts. For large positive shifts the sampling noise induced by the random neighborhood system dominates the data contributions.
cost functions based on a pairwise data clustering formalization capture the true structure of the image. Furthermore, the robustness property of H1#1 has proven to be advantageous. The feature-based GFC as well as Ward's method are clearly outperformed. The unnormalized cost function H'1 severely su!ers from the missing shift-invariance property as shown in Fig. 2. Depending on the shift the unnormalized cost function often completely misses several texture classes. There may not even exist a parameter value to "nd all "ve textures. Even worse, the optimal value depends on the data at hand and varies for di!erent images. With H'1 a median error rate of 4.0% with substantially more
outliers was achieved. The data were shifted to a mean dissimilarity of E[D ]"0.1, a value which was obtained ij after extensive experimentation. For the normalized cut H/# a median error rate of 4.0% and 9% outliers were achieved, which is better than the unnormalized graph partitioning cost function, but worse than the invariant normalized criteria H1#1. The dissimilarity data has been scaled to a maximal value 1 and has than been transformed by D/%8"exp(!D /c) as suggested by Shi et al. ij ij [18] with a parameter c"0.25 determined by extensive benchmarking. We thus conclude that shift and scale invariance are important properties to avoid parameter "ne tuning of
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
627
Fig. 3. Segmentation error for di!erent neighborhood sizes for K"5, K"10 and K"16 before postprocessing.
sensitive parameters and that the increased computational complexity for additional normalizations in H1#1 is well-spent. In Fig. 3 the e!ect of data sparseness is examined. The asymptotic segmentation error is already obtained for highly sparse data matrices. The neighborhood size needed grows moderately with the number K of segments. Clustering of sparse data is therefore a successful concept for large-scale applications. 4.1.3. Gibbs sampling and MFA Another important question concerns the quality of the MFA algorithm as opposed to stochastic procedures. The quality of the proposed clustering algorithm was evaluated by comparing the costs and the errors of the achieved segmentation with the local ICM algorithm and with the stochastic Gibbs sampling method. The error results are summarized in Table 2. For the graphical representation the distribution of the di!erences of costs were chosen. Exemplary the cost and error di!erences for H1#1 using MFA versus ICM and MFA I versus Gibbs sampling are depicted in Fig. 4. Compared with the ICM algorithm a substantial improvement both in terms of energy and segmentation quality has to be noted. As expected the ICM algorithm gets frequently stuck in inferior local minima. On the other hand the ICM algorithm runs notably faster then the other ones. The comparison with the Gibbs sampler is more di$cult, as the performance of the Gibbs sampler improves with slower cooling rates. We decided to use an approximately similar running time for both MFA and Gibbs
Table 2 Mean and median error compared to ground truth for segmenting 100 randomly generated images with K"5 textures each for ICM, MFA and the Gibbs}sampler before postprocessing
Median Mean 20%-quantile
ICM (%)
MFA (%) Gibbs sampling (%)
6.5 11.6 23
5.4 7.6 7
5.4 7.6 7
Note: The results have been obtained for the cost function H1#1. I
sampler in our current implementation.7 MFA and Gibbs sampling yield similar results. In all cases the di!erences are small. A rather conservative annealing schedule has been used. Empirically, little improvement has been observed for the Gibbs sampler with slower annealing, although it is well-established that for logarithmic annealing schedules the Gibbs sampling scheme converges to the global minimum in probability [33]. Because of this global optimization properties of Gibbs sampling, we conclude that MFA yields near-optimal solutions in most runs. Since the loss in segmentation quality for MFA by a faster annealing schedule is substantially lower than for Gibbs sampling, the MFA
7 About 300 s on a SUN Ultra-Sparc. For MFA this can be improved to 3}5 s using multiscale annealing techniques [32].
628
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
Fig. 4. The empirical density of (a) the cost di!erence and (b) the segmentation errors of MFA versus ICM and versus the Gibbs sampler evaluated over 100 images.
algorithm is a good choice within a certain window of the speed-quality trade-o!. 4.1.4. Hierarchical clustering The result of a hierarchical segmentation on a test image containing K"16 Brodatz textures is depicted in Fig. 5. All textures have been correctly identi"ed and the borders are localized precisely. Stable solutions according to our criterion have been detected for K"11 and K"16. The hierarchical structure detected is in accordance with the psychophysical expectation. A segmentation example for an aerial image of San Francisco with the same set of parameters is shown in Fig. 6. Applying the proposed validation criterion the segmentations with K"3, 4 and 9 are selected. K"6 possesses signi"cant local stability. The hierarchical organization is very intuitive: the "rst split separates land and ocean. At later stages homogeneously tilled areas are distinguished from vegetation. The results for Ward's method and for the complete linkage algorithm are less satisfying. In the segmentation obtained by Ward's method land and ocean are mixed, while for complete linkage several spurious segments occur. We conclude that by the optimization approach to hierarchical clustering semantically consistent segmentation hierarchies are obtained. These methods therefore o!er an attractive alternative to the widely used family of agglomerative clustering algorithms. 4.2. Clustering for information retrieval 4.2.1. Proximity-based clustering of document databases Information retrieval in large databases is one of the key topics in data mining. The problem is most severe in cases where the query cannot be formulated precisely, e.g., in natural language interfaces for documents or in
image databases. Typically, one would like to obtain those entries which best match a given query according to some similarity measure. Yet, it is often di$cult to reliably estimate similarities, because the query may not contain enough information, e.g., not all possibly relevant keywords might occur in a query for documents. Therefore, one often applies the cluster hypothesis [34]: if an entry is relevant to a query, similar entries may also be relevant to the query although they may not possess a high similarity to the query itself. Clustering thus provides a way of pre-structuring a database for the purpose of improved information retrieval [35]. Following state of the art techniques, we utilized a word stemmer and a stop word list to automatically generate index terms. A document is represented by a (sparse) binary vector B, where each entry corresponds to the occurrence of a certain index term. As a measure of association between two documents we utilized the cosine measure which normalizes the intersection with the geometrical mean of the number of index terms, BtB i j . D "1! ij JDB DDB D i j
(25)
Other commonly applied measures are the Jaccard coef"cient and the Dice coe$cient [36]. We have tested the di!erent clustering criteria and algorithms on the Medline (MED) database consisting of N"1033 abstracts of medical articles. The MED collection has 30 queries with known relevance assessment. Based on this ground truth data, we evaluate the two most important quantities for retrieval performance: the precision P (% of returned documents which was relevant) and the recall fraction R (% of relevant documents which was actually returned). There is obviously a tradeo! between these quantities and we plot retrievals as
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
629
Fig. 5. Mixture image with 16 Brodatz micro-textures. For the segmentation K"24 and DND/2"150N evaluated dissimilarities were used.
points in the precision/recall plane. The measures are combined in terms of the so-called e!ectiveness [34], E(b)"1!(1#b2)PR/(b2P#R), where b weights precision versus recall. Since we are mainly interested in a comparison between di!erent clustering solutions, we assume a simpli"ed retrieval model, where the user interactively speci"es the most promising cluster. Then, all documents in that cluster are returned. In a more realistic application this can be based on information on documents which are already known to be relevant, or on cluster summaries presented to the user (e.g., as shown in Fig. 8). Fig. 7 shows plots of (P, R) pairs for solutions obtained by di!erent clustering algorithms and di!erent K on the MED collection with ideal cluster search. We summarize the most remarkable facts: (i) Among the linkage algorithms Ward's method shows consis-
tently the best performance. (ii) Among the optimization methods, H1#1 and H141" perform consistently better than the graph partitioning objective function H'1, although the additive data shift has been empirically optimized. On coarser levels with small K Ward's method performs better, while a global optimization of H1#1 shows signi"cant improvement for larger K. The reason for this behavior is the violation of the &natural' data granularity for small K. In that regime the global maximization of cluster compactness leads to an unfavorable division of meaningful smaller clusters. If more documents should be returned at a lower reliability it might thus be advantageous to take a "ner data partitioning and to additionally return documents of more than one cluster. Table 3 summarizes the e!ectiveness maximized over di!erent K for perfect retrieval (EH) and best cluster match based on the query (E).
630
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
Fig. 6. Aerial image and hierarchical segmentation of a section of San Francisco.
As an illustrative example for hierarchical clustering of document databases we decided to cluster documents having the term clustering in their title. We collected 1568 abstracts from journal and conference papers. The top levels of a hierarchical solution with K "60 are vis.!9 ualized in Fig. 8. The clusters are characterized by terms having a high frequency of occurrence and being typical at the same time. More speci"cally, we utilized t "p2/p6 , l l where p is the frequency inside a cluster C and p6 denotes l l the global frequency.
5. Conclusion We have presented a rigorous optimization approach to similarity-based data clustering. The axiomatic approach attains a systematization of clustering criteria and yields theoretical insight which has proven to be highly relevant for practical applications. Our framework also provides a connection between graph partitioning optim-
ization problems studied in operations research, and linkage algorithms like Ward's method known from cluster analysis. In particular, we have shown that partitional methods are not limited to vectorial data and a characterization of clusters by centroids, nor do they exclude the incomplete data case or nested cluster hierarchies. The second contribution of this paper concerns the derivation of e$cient optimization heuristics based on annealing. These methods are essentially applicable to arbitrary clustering criteria, and are as universal as, for example, the agglomerative clustering scheme. Empirically, annealing methods have shown to yield signi"cantly better solutions than local descend algorithms like ICM. Although they are not guaranteed to "nd the global minimum, the solutions found are often &good enough', if the typical modeling uncertainty of unsupervised learning problems is taken into account (in the sense that the ground truth will most of the time not perfectly correspond to a global minimum of the objective function). To stress the generality of our optimization approach we have presented two large-scale applications from very
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
631
Fig. 7. Precision vs. recall results for di!erent clustering algorithms on the MED collection. (a) Non-hierarchical optimizations method and Ward's method, (b) hierarchical methods.
Fig. 8. Hierarchical clustering of &clustering' documents. Numbers denote cluster sizes followed by the "ve most characteristic terms. The solutions K"2, 10, 20 were selected according to the proposed pruning criterion.
632
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
Table 3 E!ectiveness of document retrieval for di!erent clustering models
EH(0.5) EH(1.0) EH(2.0) E(0.5) E(1.0) E(2.0)
H1#1 N,K
H1#1 N
H141" N,K
H'1 N,K
Ward's
Complete
Single
0.34 0.37 0.36 0.61 0.62 0.61
0.38 0.41 0.42 0.65 0.69 0.68
0.36 0.35 0.31 0.66 0.64 0.59
0.38 0.38 0.31 0.68 0.67 0.64
0.38 0.45 0.44 0.69 0.70 0.68
0.41 0.52 0.54 0.78 0.82 0.85
0.74 0.76 0.73 0.94 0.97 0.97
(70) (38) (36) (38) (38) (38)
(56) (56) (44) (66) (58) (16)
(60) (38) (32) (50) (32) (32)
(65) (40) (32) (45) (38) (38)
(84) (75) (33) (51) (39) (22)
(168) (140) (76) (78) (74) (70)
(426) (424) (426) (290) (268) (268)
Note: Number in brackets denotes corresponding optimal value of K.
di!erent application domains. The results on unsupervised texture segmentation show that similarity-based methods outperform other state of the art techniques. The data sparsening prevents an intractable scaling: even a large number of di!erent textures can be reliable distinguished with reasonably small random graphs. In the context of document retrieval, where similarity-based clustering methods are commonly used, we have shown that optimization methods are a serious alternative to linkage algorithms and are able to identify meaningful document clusters. In contrast to agglomerative methods they have the further advantage not to require a complete re-computation if new documents are added to the database.
Acknowledgements This work was supported by the German Research Foundation DFG under grant d BU 914/3-1 and by the Federal Ministry of Education and Science BMBF under grant d 01 M 3021 A/4.
Appendix Proof of Proposition 1. For notational convenience denote by M the lth column of M. We have to show that l all possible dependencies of t on its arguments take the form stated in the proposition. Therefore, we rewrite t in a sequence of equalities, referring to the number of the axiom applied. For a given t there exist functions tK , t(1) and t(2) such that t(i, j, D , M) ij
K "*1b+ + M M t(1)(D , M ) il jl ij l l/1 K ! + M M t(2)(D , M , M ) il jk ij l k l,k/1 lEk K "*1a+ + M M t(1)(D , n ) il jl ij l l/1 K ! + M M t(2)(D , n , n ). il jk ij l k l,k/1 lEk A reduced set of arguments for a function is used to indicate the corresponding invariance property of the function. For example, t(1)(D , n ) is de"ned by ij l t(1)(Dij, n (M ))"t(1)(D , M ) and, furthermore, indil l ij l cate that for all n (M )"n (M K ): t(1)(D , M )" l l l l ij l t(1)(D , M K ). The weighting functions t(1) and t(2) are ij l non-decreasing in the "rst argument by Axiom 2. h Proof of Proposition 2. From the shift invariance axiom we obtain the following condition: K N + + M M t(1)(n )"N il jl l l/1 i,j/1 iEj K Q + n (n !1)t(1)(n )"N. l l l l/1 As will be proven in the subsequent lemma, +K l/1 f (n )"N requires f to be an a$ne combination of n and l l N/K. This implies t(1)(n ) to be an a$ne combination of l t (n )"1/(n !1) and t (n )"N/(Kn (n !1)). h 1 l l 2 l l l
K "*1a+ + M M tK (D , l, k, M) il jk ij l,k/1
Lemma A.1. Let f : RPR be a diwerentiable function, such that +K f (n )"N for all (n , n ,2, n )3RK with l/1 l 1 2 K ` +K n "N. Then f can be written as an azne combinal/1 l tion of n and N/K. l
K "*2+ + M M tK (D , l, k, M , M ) il jk ij l k l,k/1
Proof. Calculating the directed derivative with w3RK, w "K!1/K for an arbitrary but "xed l and l
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634
w "!1/K, for all kOl, we obtain k
tary weighting functions
Lf (n ) 1 K Lf (n ) l" + k. Ln K Ln l k k/1
1 1 , t " , t " 5 N!n 1 N!n k l 1 1 t " , t " , 2 (K!1)n 6 (K!1)n l k N N t " , t " , 3 K(N!n )n 7 K(N!n )n l l k k N t " . h 4 K(K!1)n n l k
Since this has to hold for an arbitrary cluster index l, all the derivatives have in fact to be equal: Lf (n )/Ln "Lf (n )/Ln . The ansatz f (n )"a n #b l l k k l l yields N f (n )"j n #(1!j) for j3R. l l K
h
Proof of Proposition 3. The decomposition property of t(2) allows us to apply Lemma A.1 resulting in N f (n )"jn #(1!j) with l l K f (n )"n l l
K + n t(2)(n , n ) k l k k/1, kEl
and symmetric solutions obtained from interchanging the "rst two arguments of t(2). Setting j"1 we obtain + n t(2) (n , n )"1. To kEl k l k consider the dependency on the second argument we calculate directional derivatives with w "1, w "!1, k a and w "0, otherwise, where aOk. This yields c Ln t(2)(n , n ) Ln t(2)(n , n ) a l a" k l k Ln Ln a k b(n ) Nt(2)(n , n )"a(n )# l . l k l n k Inserting this back into the original condition in order to determine the functions a and b results in (N!n )a(n )#(K!1)b(n )"1 l l l 1 1 Na(n )" 'b(n )" . l l K!1 N!n l A similar calculation is carried out for the case of j"0. The resulting functions a and b are given by N 1 a(n )" l K (N!n )n l l and N 1 b(n )" . l K(K!1) n n l k From these and the symmetric conditions for interchanging the "rst and second argument, we obtain 7 elemen-
633
References [1] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cli!s, NJ 07632, 1988. [2] J. MacQueen, Some methods for classi"cation and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281}297. [3] G. Lance, W. Williams, A general theory of classi"cation sorting strategies: II. Clustering systems, Comput. J. 10 (1969) 271}277. [4] G. McLachlan, K. Basford, Mixture Models, Marcel Dekker, New York, Basel, 1988. [5] J. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact, well-sperated clusters, J. Cybernet. 3 (1974) 32}57. [6] S. Ahalt, P. Chen, D. Melton, Competitive learning algorithms for vector quantization, Neural Networks 3 (3) (1990) 277}290. [7] J. Buhmann, H. KuK hnel, Complexity optimized data clustering by competitive neural networks, Neural Comput. 5 (1) (1993) 75}88. [8] P. Brucker, On the complexity of clustering problems, in: R. Henn, B. Korte, W. Oletti (Eds.), Optimierung und Operations Research, Lecture Notes in Economics and Mathematical Systems, Springer, Berlin, 1978, pp. 45}55. [9] T. Hofmann, J.M. Buhmann, Pairwise data clustering by deterministic annealing, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1) (1997) 1}14. [10] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (6) (1984) 721}741. [11] G. Bilbro, W. Snyder, Mean "eld approximation minimizes relative entropy, J. Opt. Soc. Amer. 8 (2) (1991) 290}294. [12] K. Rose, E. Gurewitz, G. Fox, A deterministic annealing approach to clustering, Pattern Recognition Lett. 11 (11) (1990) 589}594. [13] T. Hofmann, J. Puzicha, J. Buhmann, Deterministic annealing for unsupervised texture segmentation, in: Proceedings of the EMMCVPR'97, Lecture Notes in Computer Science, vol. 1223, Springer, Berlin, 1997, pp. 213}228. [14] T. Hofmann, J. Puzicha, J. Buhmann, Unsupervised texture segmentation in a deterministic annealing framework,
634
[15] [16]
[17]
[18]
[19]
[20]
[21]
[22] [23] [24]
[25]
J. Puzicha et al. / Pattern Recognition 33 (2000) 617}634 IEEE Trans. Pattern Anal. Mach. Intell. 20 (8) (1998) 803}818. J. Ward, Hierarchical grouping to optimize an objective function, J. Amer. Statist. Assoc. 58 (1963) 236}244. M. GroK tschel, Y. Wakabayashi, A cutting plane algorithm for a clustering problem, Math. Programm. Ser. B 45 (1989) 59}96. D. Geman, S. Geman, C. Gra$gne, P. Dong, Boundary detection by constrained optimization, IEEE Trans. Pattern Anal. Mach. Intell. 12 (7) (1990) 609}628. J. Shi, J. Malik, Normalized cuts and image segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'97), 1997, pp. 731}737. T. Hofmann, Data clustering and beyond: a deterministic annealing framework for exploratory data analysis, Shaker Verlag, Ph.D. Thesis, 1997. T. Hofmann, J. Puzicha, J. Buhmann, An optimization approach to unsupervised hierarchical texture segmentation, in Proceedings of the IEEE International Conference on Image Processing (ICIP'97), 1997. S. Kirkpatrick, C. Gelatt, M. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983) 671}680. J. Besag, On the statistical analysis of dirty pictures, J. Roy. Statist. Soc. Ser. B 48 (1986) 25}37. A. Blake, A. Zisserman, Visual Reconstruction, MIT Press, Cambridge, MA, 1987. D. Miller, K. Rose, Hierarchical, unsupervised learning with growing via phase transitions, Neural Comput. 8 (8) (1996) 425}450. A. Jain, F. Farrokhnia, Unsupervised texture segmentation using Gabor "lters, Pattern Recognition 24 (12) (1991) 1167}1186.
[26] O. Pichler, A. Teuner, B. Hosticka, A comparison of texture feature extraction using adaptive Gabor "ltering, pyramidal and tree-structured wavelet transforms, Pattern Recognition 29 (5) (1996) 733}742. [27] J. Mao, A. Jain, Texture classi"cation and segmentation using multiresolution simultaneous autoregressive models, Pattern Recognition 25 (1992) 173}188. [28] J. Daugman, Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical "lters, J. Opt. Soc. Amer. A 2 (7) (1985) 1160}1169. [29] J. Puzicha, T. Hofmann, J. Buhmann, Non-parametric similarity measures for unsupervised texture segmentation and image retrieval, in Proceedings of the Conference on Computer Vision and Pattern Recognition, 1997. [30] P. Brodatz, Textures: A Photographic Album for Artists and Designers, Dover Publications, New York, 1966. [31] K. Rose, E. Gurewitz, G. Fox, Vector quantization by deterministic annealing, IEEE Trans. Inform. Theory 38 (4) (1992) 1249}1257. [32] J. Puzicha, J. Buhmann, Multiscale annealing for real-time unsupervised texture segmentation, Technical Report IAI-97-4, Institut fuK r Informatik, UniversitaK t Bonn (a short version appeared in: Proceedings of ICCV'98, pp. 267}273), 1997. [33] B. Hajek, Cooling schedules for optimal annealing, Math. Oper. Res. 13 (1988) 311}324. [34] C. Van Rijsbergen, Information Retrieval, Butterworths, London, Boston, 1979. [35] P. Willett, Recent trends in hierarchic document clustering: a critical review, Inform. Process. Manage. 24 (5) (1988) 577}597. [36] P. Sneath, R. Sokal, Numerical Taxonomy, W.H. Freeman and Company, San Francisco, CA, 1973.
About the Author*JAN PUZICHA received the Diploma degree in Computer Science from the University of Bonn, Germany, in 1995. In November 1995, he joined the Computer Vision and Pattern Recognition group at the University of Bonn, where he is currently completing his Ph.D. Thesis on optimization methods for grouping and segmentation. His research interests include image processing, remote sensing, autonomous robots, data analysis, and data mining. About the Author*THOMAS HOFMANN received the Diploma and Ph.D. degrees in Computer Science from the University of Bonn, in 1993 and 1997, respectively. His Ph.D. research was on statistical methods for exploratory data analysis. In April 1997 he joined the Center for Biological and Computational Learning at the Massachusetts Institute of Technology as a postdoctoral fellow. His research interests are in the areas of pattern recognition, neural networks, graphical models, natural language processing, information retrieval, computer vision, and machine learning. About the Author*JOACHIM M. BUHMANN received a Ph.D. degree in theoretical physics from the Technical University of Munich in 1988. He held postdoctoral positions at the University of Southern California and at the Lawrence Livermore National Laboratory. Currently, he heads the research group on Computer Vision and Pattern Recognition at the Computer Science department of the University of Bonn, Germany. His current research interests cover statistical learning theory and its applications to image understanding and signal processing. Special research topics include exploratory data analysis, stochastic optimization, and computer vision.
Pattern Recognition 33 (2000) 635}649
Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling Anand Rangarajan Image Processing and Analysis Group, Departments of Diagnostic Radiology and Electrical Engineering, Yale University, New Haven, CT, USA Received 15 March 1999
Abstract Deterministic annealing and relaxation labeling algorithms for classi"cation and matching are presented and discussed. A new approach } self annealing } is introduced to bring deterministic annealing and relaxation labeling into accord. Self-annealing results in an emergent linear schedule for winner-take-all and linear assignment problems. Self-annihilation, a generalization of self-annealing is capable of performing the useful function of symmetry breaking. The original relaxation labeling algorithm is then shown to arise from an approximation to either the self-annealing energy function or the corresponding dynamical system. With this relationship in place, self-annihilation can be introduced into the relaxation labeling framework. Experimental results on synthetic matching and labeling problems clearly demonstrate the three-way relationship between deterministic annealing, relaxation labeling and self-annealing. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Deterministic annealing; Relaxation labeling; Self-annealing; Self-ampli"cation; Self-annihilation; Softmax; Softassign
1. Introduction Labeling and matching problems abound in computer vision and pattern recognition (CVPR). It is not an exaggeration to state that some form or the other of the basic problems of template matching and data clustering has remained central to the CVPR and neural networks (NN) communities for about three decades [1]. Due to the somewhat disparate natures of these communities, di!erent frameworks for formulating and solving these two problems have emerged and it is not immediately obvious how to go about reconciling some of the di!erences between these frameworks so that they can bene"t from each other. In this paper, we pick two such frameworks, deterministic annealing [2] and relaxation labeling [3] which arose mainly in the neural networks and pattern recognition communities, respectively. Deterministic annealing has its origins in statistical physics and more recently in
E-mail address: [email protected] (A. Rangarajan)
Hop"eld networks [4]. It has been applied with varying degrees of success to a variety of image matching and labeling problems. In the "eld of neural networks, deterministic annealing developed from its somewhat crude origins in the Hop"eld}Tank networks [4] to include fairly sophisticated treatment of constraint satisfaction and energy minimization by drawing on well-established principles in statistical physics [5]. Recently, for both matching [6] and classi"cation [7] problems, a fairly coherent framework and set of algorithms have emerged. These algorithms range from using the softmax [8] or softassign [9] for constraint satisfaction and dynamics that are directly derived from or merely mimic the expectation}maximization (EM) approach [10]. The term relaxation labeling (RL) originally referred to a heuristic dynamical system developed in Ref. [11]. RL speci"ed a discrete time dynamical system in which class labels (typically in image segmentation problems) were re"ned while taking relationships in the pixel and label array into account. As interest in the technique grew, many bifurcations, o! shoots and generalizations of the basic idea developed; examples are the product combination rule [12], the optimization approach [13], projected
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 7 - 1
636
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
gradient descent [3], discrete relaxation [14], and probabilistic relaxation [15]. RL in its basic form is a discrete time update equation that is suitably (and fairly obviously) modi"ed depending on the problem of interest } image matching, segmentation, or classi"cation. The more principled deviations from the basic form of RL replaced the discrete time update rule by gradient descent and projected gradient descent [3,13] on energy functions. However, recently it has been shown [16] that the original heuristic RL dynamical system minimizes the labeling energy function. It is now fairly clear that both continuous time projected gradient descent and discrete time RL dynamical systems can be used to minimize the same labeling energy function. Much of this development pre"gured or ran parallel to the evolution of deterministic annealing (DA) dynamical systems with at least one major di!erence. While the concerns of continuous time versus discrete time dynamics were common to both RL and DA approaches, within the DA approaches a fundamental distinction was drawn between matching and labeling problems [17]. This distinction was almost never emphasized in RL. In labeling problems, a set of labels have to be assigned to a set of nodes with the constraint that a node should be assigned only one label. A variety of problems not necessarily restricted to CVPR require labeling constraints; some examples are central and pairwise clustering [7,18], consistent labeling [3], and graph coloring. In matching problems, on the other hand, a set of model nodes have to be assigned to a set of data nodes with the constraint that each model node should be assigned to one and only one data node and vice versa. A variety of problems require matching constraints; some examples are quadratic assignment [2,19], TSP [9,20] graph matching [21,22], graph partitioning (with minor di!erences) [20,23] and point matching [24,25]. The original neural network approaches used a penalty function approach at "xed temperature [4]. With the importance of deterministic annealing and exact constraint satisfaction becoming clear, these approaches quickly gave way to the softmax [20,23,26}28], softassign [9,22,29], Lagrangian relaxation [29,30] and projected gradient descent [31}34] approaches usually performed within deterministic annealing. Here, we return to the original relaxation labeling dynamical system since ironically, it is in the RL discrete time dynamical system that we "nd a closest parallel to recent discrete time deterministic annealing algorithms. Even after restricting our focus to discrete time dynamical systems, important di!erences like the manner in which constraint satisfaction is performed, relaxation at a "xed temperature and the nature of the update mechanism remain. A new approach } self-annealing } is presented to unify relaxation labeling and deterministic annealing. We show that the self annealing dynamical system which is derived from a corresponding energy
function corresponds to deterministic annealing with a linear schedule. Also, the original RL update equation can be derived from the self-annealing dynamical system via a Taylor-series approximation. This suggests that a close three-way relationship exists between DA, RL and self-annealing with self-annealing acting as a bridge between DA and RL.
2. Deterministic annealing Deterministic annealing arose as a computational shortcut to simulated annealing. Closely related to mean xeld theory, the method consists of minimizing the free energy at each temperature setting. The free energy is separately constructed for each problem. The temperature is reduced according to a pre-speci"ed annealing schedule. Deterministic annealing has been applied to a variety of combinatorial optimization problems } winner take all (WTA), linear assignment, quadratic assignment including the traveling salesman problem, graph matching and graph partitioning, clustering (central and pairwise), the Ising model, etc. } and to nonlinear optimization problems as well with varied success. In this paper, we focus on the relationship between deterministic annealing and relaxation labeling with emphasis on matching and labeling problems. The archetypal problem at the heart of labeling problems is the winner take all and similarly for matching problems, it is linear assignment that is central. Consequently, our development dwells considerably on these two problems. 2.1. The winner take all The winner take all problem is stated as follows: Given a set of numbers ¹ , i3M1 ,2, NN, "nd iH" i arg max (¹ , i3M1 ,2, NN) or in other words, "nd the i i index of the maximum number. Using N binary variables s , i3M1 ,2, NN, the problem is restated as i max s
+¹s ii i
s.t. + s "1, i i
(1)
and s 3M0, 1N, ∀i. i
(2)
The deterministic annealing free energy is written as follows:
A
B
1 F (v)"!+ ¹ v #j + v !1 # + v log v . i i i 85! i i b i i i
(3)
In Eq. (3), v is a new set of analog mean "eld variables summing to one. The transition from binary variables s to analog variables v is deliberately highlighted here.
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
Also, b is the inverse temperature to be varied according to an annealing schedule. j is a Lagrange parameter satisfying the WTA constraint. The x log x form of the barrier function keeps the v variables positive and is also referred to as an entropy term. We now proceed to solve for the v variables and the Lagrange parameter j. We get (after eliminating j) exp(b¹ ) i , v(b)" i + exp(b¹ ) j j
∀i, i3M1 ,2, NN.
(4)
This is referred to as the softmax nonlinearity [8]. Deterministic annealing WTA uses the nonlinearity within an annealing schedule. (Here, we gloss over the technical issue of propagating the solution at a given temperature vbn to be the initial condition at the next temperature b .) When there are no ties, this algorithm "nds the n`1 single winner for any reasonable annealing schedule } quenching at high b being one example of an `unreasonablea schedule.
637
The assignment problem is distinguished from the WTA by requiring the satisfaction of two-way WTA constraints as opposed to one. Consequently, the Lagrange parameters cannot be solved for in closed form. Rather than solving for the Lagrange parameters using steepest ascent, an iterated row and column normalization method is used to obtain a doubly stochastic matrix at each temperature [9,29]. Sinkhorn's theorem [35] guarantees the convergence of this method. (This method can be independently derived as coordinate ascent w.r.t. the Lagrange parameters.) With Sinkhorn's method in place, the overall dynamics at each temperature is referred to as the softassign [9]. Deterministic annealing assignment uses the softassign within an annealing schedule. (Here, we gloss over the technical issue of propagating the solution at a given temperature vbn to be the initial condition at the next temperature b .) When there are n`1 no ties, this algorithm "nds the optimal permutation for any reasonable annealing schedule. 2.3. Related problems
2.2. The linear assignment problem The linear assignment problem is written as follows: Given a matrix of numbers A , a, i3M1 ,2, NN, "nd the ai permutation that maximizes the assignment. Using N2 binary variables s , a,i3M1 ,2, NN, the problem is reai stated as max s
+A s ai ai ai
(5)
s.t. + s "1, + s "1, and s 3M0, 1N, ai ai ai i a
∀a, i.
(6)
The deterministic annealing free energy is written as follows:
A B
B
F (v)"!+ A v #+ k + v !1 !1 ai ai a ai ai a i
A
1 #+ l + v !1 # + v log v . i ai ai ai b i a ai
(7)
In Eq. (7), v is a doubly stochastic mean "eld matrix with rows and columns summing to one. (k, l) are Lagrange parameters satisfying the row and column WTA constraints. As in the WTA case, the x log x form of the barrier function keeps the v variables positive. We now proceed to solve for the v variables and the Lagrange parameters (k,l) [2,29]. We get v(b)"exp(bA !b[k #l ]) ai ai a i ∀a, i, a, i3M1 ,2, NN.
(8)
Having speci"ed the two archetypal problems, the winner take all and assignment, we turn to other optimization problems which frequently arise in computer vision, pattern recognition and neural networks. 2.3.1. Clustering and labeling Clustering is a very old problem in pattern recognition [1,36]. In its simplest form, the problem is to separate a set of N vectors in dimension d into K categories. The precise statement of the problem depends on whether central or pairwise clustering is the goal. In central clustering, prototypes are required, in pairwise clustering, a distance measure between any two patterns is needed [18,37]. Closely related to pairwise clustering is the labeling problem where a set of compatibility coe$cients are given and we are asked to assign one unique label to each pattern vector. In both cases, we can write down the following general energy function: 1 E (s)"! + C s s ai_bj ai bj -!" 2 (9) aibj + s "1, and s 3M0, 1N, ∀a, i. ai ai a (This energy function is a simpli"cation of the pairwise clustering objective function used in Refs. [18,37], but it serves our purpose here.) If the set of compatibility coe$cients C is positive de"nite in the subspace of the one-way WTA constraint, the local minima are WTAs with binary entries. We call this the quadratic WTA (QWTA) problem, emphasizing the quadratic objective with a one-way WTA constraint. For the "rst time, we have gone beyond objective functions that are linear in the binary variables s to objective functions quadratic in s. This transition is min s s.t.
638
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
very important and entirely orthogonal to the earlier transition from the WTA constraint to the permutation constraint. Quadratic objectives with binary variables obeying simplex like constraints are usually much more di$cult to minimize than their linear objective counterparts. Notwithstanding the increased di$culty of this problem, a deterministic annealing algorithm which is fairly adept at avoiding poor local minima is: $%& LE-!"(v)"+ C q "! v , (10) ai ai_bj bj Lv ai bj exp(bq ) ai . v(b)" (11) ai + exp(bq ) b bi The intermediate q variables have an increased signi"cance in our later discussion on relaxation labeling. The algorithm consists of iterating the above equations at each temperature. Central and pairwise clustering energy functions have been used in image classi"cation and segmentation or labeling problems in general [18]. 2.3.2. Matching Template matching is also one of the oldest problems in vision and pattern recognition. Consequently, the sub"eld of image matching has become increasingly variegated over the years. In our discussion, we restrict ourselves to feature matching. Akin to labeling or clustering, there are two di!erent styles of matching depending on whether a spatial mapping exists between the features in one image and the other. When a spatial mapping exists (or is explicitly modeled), it acts as a strong constraint on the matching [24]. The situation when no spatial mapping is known between the features is similar to the pairwise clustering case. Instead, a distance measure between pairs of features in the model and pairs of features in the image are assumed. This results in the quadratic assignment objective function } for more details see Ref. [22]: min s s.t.
1 s s E (s)"! + C ai_bj ai bj '. 2 aibj + s "1, + s "1, and ai ai i a
(12)
s 3M0, 1N, ∀a, i. ai If the quadratic bene"t matrix MC N is positive de"nite ai_bj in the subspace spanned by the row and column constraints, the minima are permutation matrices. This result was shown in Ref. [2]. Once again, a deterministic annealing free energy and algorithm can be written down after spotting the basic form (linear or quadratic objective, one-way or two-way constraint): $%& !LE'.(v)"+ C q " v , ai ai_bj bj Lv ai bj
(13)
v(b)"exp(bq !b[k #l ]). (14) ai ai a i The two Lagrange parameters k and l are speci"ed by Sinkhorn's theorem and the softassign. These two equations (one for the q and one for the v) are iterated until convergence at each temperature. The softassign quadratic assignment algorithm is guaranteed to converge to a local minimum provided the Sinkhorn procedure always returns a doubly stochastic matrix [19]. We have written down deterministic annealing algorithms for two problems (QWTA and QAP) while drawing on the basic forms given by the WTA and linear assignment problems. The common features in the two deterministic annealing algorithms and their di!erences (one-way versus two-way constraints) [17] have been highlighted as well. We now turn to relaxation labeling. 3. Relaxation labeling Relaxation labeling as the name suggests began as a method for solving labeling problems [11]. While the framework has been extended to many applications [15,16,38}41] the basic feature of the framework remains: Start with a set of nodes i (in feature or image space) and a set of labels j. Derive a set of compatibility coe$cients (as in Section 2.3.1) r for each problem of interest and then apply the basic recipe of relaxation labeling for updating the node-label (i to j) assignments: q(n)(j)"+ r (j, k)p(n)(k), i ij j jk p(n)(j)(1#aq(n)(j)) i . p(n`1)(j)" i i + p(n)(k)(1#aq(n)(k)) k i i
(15) (16)
Here the p's are the node-label (i to j) label variables, the q are intermediate variables similar to the q's de"ned earlier in deterministic annealing. a is a parameter greater than zero used to make the numerator positive (and keep the probabilities positive). The update equation is typically written in the form of a discrete dynamical system. In particular, note the multiplicative update and the normalization step involved in the transition from step n to step (n#1). We have deliberately written the relaxation labeling update equation in a quasi-canonical form while suggesting (at this point) similarities most notably to the pairwise clustering discrete time update equation. To make the semantic connection to deterministic annealing more obvious, we now switch to the old usage of the v variables rather than the p's in relaxation labeling. q(n)"+ C v(n), ai ai_bj bj jb v(n)(1#aq(n)) ai . v(n`1)" ai ai + v(n)(1#aq(n)) bi bi b
(17) (18)
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
As in the QAP and QWTA deterministic annealing algorithms, a Lyapunov function exists [42,43] for relaxation labeling. We can now proceed in the reverse order from the previous section on deterministic annealing. Having written down the basic recipe for relaxation labeling, specialize to WTA, AP, QWTA and QAP. While the contraction to WTA and QWTA may be obvious, the case of AP and QAP are not so clear. The reason: two-way constraints in AP are not handled by relaxation labeling. We have to invoke something analogous to the Sinkhorn procedure. Also, there is no clear analog to the iterative algorithms obtained at each temperature setting. Instead the label variables directly and multiplicatively depend on their previous state which is never encountered in deterministic annealing. How do we reconcile this situation so that we can clearly state just where these two algorithms are in accord? The introduction of self-annealing promises to answer some of these questions and we now turn to its development.
4. Self-annealing Self-annealing has one goal, namely, the elimination of a temperature schedule. As a by-product we show that the resulting algorithm bears a close similarity to both deterministic annealing and relaxation labeling. The self-annealing update equation for any of the (matching or labeling) problems we have discussed so far is derived by minimizing [44]
derive many linear on-line prediction algorithms using the KL divergence. Here, we apply the same approach to the QWTA and QAP. Examine the following QAP objective function using the KL divergence as the distance measure: F (v, p) 4!2!1
(19)
where d(v, p) is a distance measure between v and an `olda value p. (The explanation of the `olda value will follow shortly.) When F is minimized w.r.t. v, both terms in Eq. (19) come into play. Indeed, the distance measure d(v, p) serves as an `inertiaa term with the degree of "delity between v and p determined by the parameter a. For example, when d(v, p) is 1DDv!pDD2, the update equa2 tion obtained after taking derivatives w.r.t. v and p and setting the results to zero is p "v(n), i i
K
LE(v) v(n`1)"p !a . i i Lv i v/v(n`1)
(20)
This update equation reduces to `vanillaa gradient descent provided we approximate LE(v)/Lv D (n`1) by i v/v LE(v)/Lv D (n). a becomes a step-size parameter. However, i v/v the distance measure is not restricted to just quadratic error measures. Especially, when positivity of the v variables is desired, a Kullback}Leibler (KL) distance measure can be used for d(v, p). In Ref. [44], the authors
A A
B
1 1 v "! + C v v # + v log ai !v #p ai_bj ai bj a ai ai ai 2 p ai aibj ai
A
B
B
# + k + v !1 # + l + v !1 . a ai i ai a i i a
(21)
We have used the generalized KL divergence d(x, y)"+ (x log x /y !x #y ) which is guaranteed to i i i i i i be greater than or equal to zero without requiring the usual constraints + x "+ y "1. This energy function i i i i looks very similar to the earlier deterministic annealing energy function (12) for QAP. However, it has no temperature parameter. The parameter a is "xed and positive. Instead of the entropy barrier function, this energy function has a new KL measure between v and a new variable p. Without trying to explain the self-annealing algorithm in its most complex form (QAP), we specialize immediately to the WTA.
A
F (v, p)"!+ ¹ v #j + v !1 4!85! i i i i i #
1 F(v, p)"E(v)# d(v, p), a
639
A
B B
1 v + v log i !v #p . i i i a p i i
(22)
Eq. (22) can be alternately minimized w.r.t. v and p (using a closed form solution for the Lagrange parameter j) resulting in v(n) exp(a¹ ) i , v(n`1)" i i + v(n) exp(a¹ ) j j j ∀i, i3M1 ,2, NN.
v(0)'0, i (23)
The new variable p is identi"ed with v(n) in Eq. (23). When i an alternating minimization (between v and p) is prescribed for F , the update equation (23) results. Initial 4!85! conditions are an important factor. A reasonable choice is v0"1/N#m , p0"v0, ∀i, i3M1 ,2, NN but other ini i i i itial conditions may work as well. A small random factor m is included in the initial condition speci"cation. To summarize, in the WTA, the new variable p is identi"ed with the `pasta value of v. We have not yet shown any relationship to deterministic annealing or relaxation labeling. We now write down the quadratic assignment selfannealing algorithm:
640
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
Pseudo-code for self-annealing QAP Initialize v to 1/N#m , p to v ai ai ai ai Begin A: Do A until integrality condition is met or number of iterations 'I . A Begin B: Do B until all v converge or number ai of iterations 'I B q Q+ C v ai bj ai_bj bj v Qp exp(aq ) ai ai ai Begin C: Do C until all v converge or number ai of iterations 'I C Update v by normalizing the rows: ai v v Q ai ai + v i ai Update v by normalizing the columns: ai v v Q ai ai + v a ai End C End B p Qv ai ai End A
(24)
Convergence of the self-annealing quadratic assignment algorithm to a local minimum can be easily shown when we assume that the Sinkhorn procedure always returns a doubly stochastic matrix. Our treatment follows [19]. A discrete time Lyapunov function for the self-annealing quadratic assignment algorithm is Eq. (21). (The Lagrange parameter terms can be eliminated since we are restricting v to be doubly stochastic.) The change in energy is written as $%& F *F " (v(n), p)!F (v(n`1), p) 4!2!1 4!2!1 4!2!1 1 1 v(n) "! + C v(n)v(n)# + v(n) log ai ai_bj ai bj a ai 2 p ai aibj ai 1 # + C v(n`1) v(n`1) ai_bj ai bj 2 aibj v(n`1) 1 ! + v(n`1) log ai . ai a p ai ai
1 *F " + C *v *v 4!2!1 2 ai_bj ai bj aibj v(n) 1 # + v(n) log ai *0, ai v(n`1) a ai ai
(26)
$%& v(n`1)!v(n). The "rst term in Eq. (26) is where *v " ai ai ai nonnegative due to the positive de"niteness of MC N ai_bj in the subspace spanned by the row and column constraints. The second term is non-negative by virtue of being a Kullback}Leibler distance measure. We have shown the convergence to a "xed point of the self-annealing QAP algorithm.
5. Self-annealing and deterministic annealing
This is the full blown self-annealing QAP algorithm with Sinkhorn's method and the softassign used for the constraints but more importantly a built in delay between the `olda value of v namely p and the current value of v. The main update equation used by the algorithm is 1 1 log v(n`1)"+ C v(n)!k !l # log p . ai ai_bj bj a i a ai a bj
The Lyapunov energy di!erence has been simpli"ed using the relations + v "N. Using the update equation ai ai for self-annealing in Eq. (24), the energy di!erence is rewritten as
(25)
Self-annealing and deterministic annealing are closely related. To see this, we return to our favorite example } the winner take all (WTA). The self-annealing and deterministic annealing WTAs are now brought into accord: Assume uniform rather than random initial conditions for self-annealing. v(0)"1/N, ∀i, i3M1 ,2, NN. With i uniform initial conditions, it is trivial to solve for v(n): i exp (na¹ ) i , ∀i, i3M1 ,2, NN. v(n)" i + exp(na¹ ) j j
(27)
The correspondence between self-annealing and deterministic annealing is clearly established by setting b "na,"1, 2,2 . We have shown that the self-annealn ing WTA corresponds to a particular linear schedule for the deterministic annealing WTA. Since the case of AP is more involved than WTA, we present anecdotal experimental evidence that self-annealing and deterministic annealing are closely related. In Fig. 1, we have shown the evolution of the permutation norm (1!+ v2 /N) and the AP free energies. A linear ai ai schedule is used for the inverse temperature b with the initial inverse temperature b "a and the linear in0 crement b also set to a. The correspondence between DA r and SA is nearly exact for the permutation norm despite the fact that the free energies evolve in a di!erent manner. The correspondence is exact only when we match the linear schedule DA parameter a to the self-annealing parameter a. It is important that SA and DA be in lockstep, otherwise we cannot make the claim that SA corresponds to DA with an emergent linear schedule. The self-annealing and deterministic annealing QAP objective functions are quite general. The QAP bene"t
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
641
Fig. 1. Left: 100 node AP with three di!erent schedules. The agreement between self- and deterministic annealing is obvious. Right: The evolution of the self- and deterministic annealing AP free energies for one schedule.
matrix C is preset based on the chosen problem } inai_bj exact, weighted, graph matching, or pairwise clustering. The deterministic annealing pseudo-code follows (we have already written down the self-annealing pseudocode in the previous section): Pseudo-code for deterministic annealing QAP Initialize b to b , v to 1/N#m 0 ai ai Begin A: Do A until b*b f Begin B: Do B until all v converge or number ai of iterations 'I B q Q+ C v ai bj ai_bj bj v Qexp (bq ) ai ai Begin C: Do C until all v converge or number ai of iterations 'I C Update v by normalizing the rows: ai v v Q ai ai + v i ai Update v by normalizing the columns: ai v v Q ai ai + v a ai End C End B bQb #b r End A Note the basic similarity between the self-annealing and deterministic annealing QAP algorithms. In self-annealing, a separation between past (p) and present (v) replaces
relaxation at a "xed temperature. Moreover, in the WTA and AP, self-annealing results in an emergent linear schedule. A similar argument can be made for QAP as well but requires experimental validation (due to the presence of bifurcations). We return to this topic in Section 7. 5.1. Self-annihilation Self-annealing results in an emergent linear schedule for the WTA and AP. In Section 2 and in the preceding discussion of the relationship between self-annealing and deterministic annealing, we glossed over the important issue of symmetry breaking. The problem of resolving ties or symmetries arises in both the WTA and AP and in graph isomorphism (a special case of QAP) [30]. Examine the special case of the WTA objective function (1) with at least two ¹ being i equal maxima. Neither the DA update Eq. (4) nor the SA update equation (23) is capable of breaking symmetry. To break symmetry in DA, it is necessary to add a self amplixcation term } c/2+ v2 which is functionally equivi i alent to adding the term (c/2)+ v (1!v ) (to the WTA) i i i [30]. A similar situation obtains for AP as well. Here, two or more competing permutations may maximize the AP energy and again it is necessary to break symmetry. Otherwise, we obtain a doubly stochastic matrix which is an average over all the equally optimal permutations. A self-ampli"cation term of the same form as in the WTA can be added to the energy function in order to break symmetry in DA. Self-annihilation is a di!erent route to symmetrybreaking than self-ampli"cation. The basic idea is to make the entropy term in SA become negative, roughly
642
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
corresponding to a negative temperature [34]. We illustrate this idea with the WTA. Examine the SA selfannihilation WTA energy function shown below:
A
B
F (v, p)"!+ ¹ v #j + v !1 4!//85! i i i i i
A
B
1 v # + v log i !v #dp . i i i a pd i i
(28)
In Eq. (28), the KL divergence between v and the `olda value p has been modi"ed. Nevertheless, the new WTA objective function can still be minimized w.r.t. v and p and the earlier interpretation of p as the `olda value of v still holds. Minimizing Eq. (28) by di!erentiating w.r.t. v and p and setting the results to zero, we get LF pd exp(a¹ ) i , "0Nv " i i + pd exp(a¹ ) Lv i j j j
LF "0Np "v . (29) i i Lp i
It is fairly straightforward to show that p"v is a minimum. Substituting the relation p"v in the self-annihilation objective function, we get:
A
B
F (v, p(v))"!+ ¹ v #j + v !1 4!//85! i i i i i #
(1!d) + (v log v !v ). i i i a i
(30)
The crucial term in the above energy function is the summation over (1!d)v log v . For dO1, this term is i i not equal to zero if and only if v O0 or 1. For d'1 this i term is strictly greater than zero for v 3(0, 1). Consei quently, in a symmetry breaking situation, the energy can be further reduced by breaking ties while preserving the
constraint that + v "1. The update equation after seti i ting p"v is (v(n))d exp(a¹ ) i , v(0)'0, v(n`1)" i i i + (v(n))d exp(a¹ ) j j j ∀i, i3M1 ,2, NN.
(31)
Once again assuming uniform initial conditions for v, we solve for v(n) to obtain exp[a((dn!1)/(d!1))¹ ] i , v(n)" i + exp[a((dn!1)/(d!1))¹ ] i i ∀i, i3M1 ,2, NN.
(32)
The above closed-form solution for v at the nth step in the self-annihilation update does not have a limiting form as nPR for d'1. For d"1, we obtain the emergent linear schedule of the previous section. Examining the self- annihilation energy function (30), we may assign the "nal temperature to be !(d!1)/a which is the equivalent negative temperature. The reason we call this process self-annihilation is that for any v 3(0, 1), vd(v for d'1. i i i We now demonstrate the ability of self-annihilation to perform symmetry breaking. In Fig. 1, we showed the evolution of the AP self-annealing algorithm when there were no ties. The permutation norm (1!+ v2 /N) deai ai creases as expected and the AP energy (+ A v ) inai ai ai creases to the maximum value (see Fig. 2). Next, we created a situation where there were multiple maxima and reran the SA algorithm. This result shown in Fig. 3 demonstrates the inability of SA to break symmetry. However, when we set d"1.1, the algorithm had no di$culty in breaking symmetry (Fig. 3). The tradeo! in using self-annihilation is between local minima and speed of convergence to an integer solution.
Fig. 2. Self-annealing: 50 node AP with ties. Left: permutation norm. Right: AP energy.
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
643
Fig. 3. Self-annihilation: 50 node AP with ties. d"1.1. Left: permutation norm. Right: AP energy.
Symmetry breaking can usually be performed in linear problems like WTA and AP by adding some noise to the WTA vector ¹ or to the AP bene"t matrix A. However, self-annihilation is an attractive alternative due to the increased speed with which integer solutions are found.
6. Self-annealing and relaxation labeling Rather than present the RL update equation in its `canonicala labeling problem form, we once again return to the winner take all problem where the similarities between self-annealing and RL are fairly obvious. The RL WTA update equation is v(n)(1#a¹ ) i , v(0)'0, v(n`1)" i i i + v(n)(1#a¹ ) j j j ∀i, i3M1 ,2, NN.
(33)
Eqs. (23) and (33) are very similar. The main di!erence is the 1#a¹ factor in RL instead of the exp(a¹ ) j j factor in self-annealing. Expanding exp(a¹ ) using the j Taylor}MacLaurin series gives f (a)"exp(a¹ )"1#a¹ #R (a), j j 2
(34)
where exp(aD¹ D)a2¹2 j j. R (a)) 2 2
(35)
If the remainder R (a) is small, the RL WTA closely 2 approximates self-annealing WTA. This will be true for small values of a. Increased divergence between RL and
self-annealing can be expected as a is increased } faster the rate of the linear schedule, faster the divergence. If ¹ (!1/a, the non-negativity constraint is violated j leading to breakdown of the RL algorithm. Instead of using a Taylor-series expansion at the algorithmic level, we can directly approximate the self-annealing energy function. A Taylor-series expansion of the KL divergence between the current (v) and previous estimate evaluated at v"p yields
A
B
v (v !p )2 i + v log i !v #p ++ i i i i p 2p i i i i # + O[(v !p )3]. (36) i i i This has the form of a s2 distance [44]. Expanding the self-annealing energy function upto second order (at the current estimate p), we get
A
B
E 2(v, p, j, a)"!+ ¹ v #j + v !1 i i i s i i 1 (v !p )2 i . # + i (37) a 2p i i This new energy function can be minimized w.r.t. v. The "xed points are: LE v !p i"0, "0N!¹ #j# i i p Lv i i LE "0Np "v i i Lp i which after setting p"v(n) leads to v(n`1)"v(n)[1#a(¹ !j)]. i i i
(38)
(39)
644
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
Fig. 4. From self-annealing to relaxation labeling.
There are many similarities between Eqs. (39) and (33). Both are multiplicative updating algorithms relying on the derivatives of the energy function. However, the important di!erence is that the normalization operation in Eq. (33) does not correspond to the optimal solution to the Lagrange parameter j in Eq. (39). Solving for j in Eq. (39) by setting + v "1, we get i i v(n`1)"v(n)(1#a¹ )!a+ ¹ v(n). i i i j j j
(40)
By introducing the Taylor-series approximation at the energy function level and subsequently solving for the update equation, we have obtained a new kind of multiplicative update algorithm, closely related to relaxation labeling. The positivity constraint is not strictly enforced in Eq. (40) as in RL and has to be checked at each step. Note that by eschewing the optimal solution to the Lagrange parameter j in favor of a normalization, we get
the RL algorithm for the WTA. The two routes from SA to RL are depicted in Fig. 4. A dotted line is used to link the s-squared energy function to the RL update equation since the normalization used in the latter cannot be derived from the former. Turning to the problem of symmetry breaking, RL in its basic form is not capable of resolving ties. This is demonstrated in Fig. 5 on AP. Just as in SA, self-annihilation in RL resolves ties. In Fig. 6, the permutation norm (1!+ v2 /N) can be reduced to arbitrary small ai ai values. Comparison at the WTA and AP levels is not the end of the story. RL in its heyday was applied to image matching, registration, segmentation and classi"cation problems. Similar to the QAP formulation, the bene"t matrix C was introduced and preset depending on the ai_bj chosen problem. Because of the bias towards labeling problems, the all important distinction between matching and labeling was blurred. In model matching problems (arising in object recognition and image registration), a two-way constraint is required. Setting up oneto-one correspondence between features on the model and features in the image requires such a two-way assignment constraint. On the other hand, only a one-way constraint is needed in segmentation, classi"cation, clustering and coloring problems since (a) the label and the data "elds occupy di!erent spaces and (b) many data features share membership under the same label. (Despite sharing the multiple membership feature of these labeling problems, graph partitioning has a two-way constraint because of the requirement that all multiple memberships be equal in number } an arbitrary requirement from the standpoint of labeling problems arising in pattern recognition.) Pseudo-code for the QAP RL algorithm is provided below.
Fig. 5. Relaxation labeling: 50 node AP with ties. Left: permutation norm. Right: AP energy.
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
645
Fig. 6. Relaxation labeling with self-annihilation: 50 node AP with ties. d"1.1. Left: permutation norm. Right: AP energy.
Pseudo-code for relaxation labeling QAP Initialize v to 1/N#m , p to v ai ai ai ai Begin A: Do A until integrality condition is met or number of iterations 'I . A q Q+ C v ai bj ai_bj bj v Qp (1#aq ) ai ai ai Update v by normalizing the columns: ai v v Q ai ai + v a ai p Qv ai ai End A Due to the bias towards labeling, RL almost never tried to enforce two-way constraints either using something like the Sinkhorn procedure in discrete time algorithms or using projected gradient descent in continuous time algorithms [31,34]. This is an important di!erence between SA and DA on one hand and RL on the other. Another important di!erence is the separation of past and present. Due to the close ties of both self- and deterministic-annealing to simulated annealing, the importance of relaxation at a "xed temperature is fairly obvious. Otherwise, a very slow annealing schedule has to be prescribed to avoid poor local minima. Due to the lack of a temperature parameter in RL, the importance of relaxation at "xed temperature was not recognized. Examining the self-annealing and RL QAP algorithms, it is clear that RL roughly corresponds to one iteration at each temperature. This issue is orthogonal to constraint satisfaction. Even if Sinkhorn's procedure is implemented in RL } and all that is needed is nonnegativity of each entry of the matrix 1#aQ } the separation of past (p) ai
and present (v) is still one iteration. Put succinctly, step B is allowed only one iteration. A remaining di!erence is the positivity constraint. We have already discussed the relationship between the exponential and the RL term (1#a¹ ) in the WTA context. i There is no need to repeat the analysis for QAP } note that positivity is guaranteed by the exponential whereas it must be checked in RL. In summary, there are three principal di!erences between self annealing and RL: (i) The positivity constraint is strictly enforced by the exponential in self-annealing and loosely enforced in RL, (ii) the use of the softassign rather than the softmax in matching problems has no parallel in RL and "nally (iii) the discrete time selfannealing QAP update equation introduces an all important delay between past and present (roughly corresponding to multiple iterations at each temperature) whereas RL (having no such delay) forces one iteration per temperature with consequent loss of accuracy.
7. Results We conducted several hundred experiments comparing the performance of deterministic annealing (DA), relaxation labeling (RL), and self-annealing (SA) discrete time algorithms. The chosen problems were quadratic assignment (QAP) and quadratic winner take all (QWTA). In QAP, we randomly generated bene"t matrices CK (of size N]N]N]N) that are positive de"nite in the subspace spanned by the row and column constraints. The $%& I !e eT /N procedure is as follows: De"ne a matrix r" N N N where e is the vector of all ones. Generate a matrix R by N
646
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
taking the Kronecker product of r with itself (R$%& "r?r). Rewrite CK as a two-dimensional N2]N2 matrix c( . Project c( into the subspace of the row and column constraints by forming the matrix Rc( R. Determine the smallest eigenvalue j (Rc( R). Then the matrix .*/ $%& c( !j (Rc( R)I 2#e (where e is a small, positive c" .*/ N quantity) is positive de"nite in the subspace spanned by the row and column constraints. Four algorithms were executed on the QAP. Other than the three algorithms mentioned previously, we added a new algorithm called exponentiated relaxation (ER). ER is closely related to SA. The only di!erence is that the inner B loop in SA is performed just once (I "1). ER is also closely related to RL. The main B di!erence is that the positivity constraint is enforced via the exponential. Since the QAP has both row and column constraints, the Sinkhorn procedure is used in ER just as in SA. However, RL enforces just one set of constraints. To avoid this asymmetry in algorithms, we replaced the normalization procedure in RL by the Sinkhorn procedure, thereby avoiding unfair comparisons. As long as the positivity constraint is met in RL, we are guaranteed to obtain doubly stochastic matrices. There is overall no proof of convergence, however, for this `souped upa version of RL. The common set of parameters shared by the four algorithms were kept exactly the same: N"25, e"0.001, Sinkhorn norm threshold *"0.0001, energy di!erence threshold e "0.001, permutation norm threshold 5)3 p "0.001, and initial condition v0"e eT /N. The stop5)3 N N ping criterion chosen was p "0.001 and row domi5)3 nance [29]. In this way, we ensured that all four algorithms returned permutation matrices. A linear schedule with b "a and b "a was used in DA. The parameter 0 r a was varied logarithmically from log(a)"!2 to
log(a)"1 in steps of 0.1. Hundred experiments were run for each of the four algorithms. The common bene"t matrix c( shared by the four algorithms was generated using independent, Gaussian random numbers. c( was then made symmetric by forming (c( #c( T)/2. The results are shown in Fig. 7a. The most interesting feature emerging from the experiments is that there is an intermediate range of a in which self-annealing performs at its best. (The negative of the QAP minimum energy is plotted on the ordinate.) Contrast this with ER and RL which do not share this feature. We conjecture that this is due to the `one iteration per temperaturea policy of both these algorithms. RL could not be executed once the positivity constraint was violated but ER had no such problems. Also, notice that the performances of both SA and DA are nearly identical after a"0.2. The emergent linear schedule derived analytically for the WTA seems to be valid only after a certain value of a. Fig. 7b shows the results of QWTA. The behavior is very similar to the QAP. In QWTA the bene"t matrices were projected onto the subspace of only one of the constraints (row or column). In other respects, the experiments were carried out in exactly the same manner as QAP. Since there is only one set of constraints, the canonical version of RL [11] was used. Note that the negative of the minimum energy is consistently higher in QWTA than QAP; this is due to the absence of the second set of constraints. Next, we studied the behavior of self-annealing with changes in problem size. In Fig. 8a, the problem size is varied from N"2 to 25 in steps of one. We normalized the QAP minimum energy at log(a)"!2 for all values of N. Not only is the overall pattern of behavior more or less the same, in addition there is an impressive
Fig. 7. Median of 100 experiments at each value of a. Left: (a) QAP. Right (b) QWTA. The negative of the QAP and QWTA minimum energies is plotted on the ordinate.
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
647
Fig. 8. Self annealing: Left: (a) Normalized negative QAP minimum energy plot for problem size N varying from 2 to 25 in steps of one. The performance is somewhat invariant to the broad range of a. Right. (b) Negative QAP minimum energy plot in a more "nely sampled range of a.
Fig. 9. Self annealing: Left: A contour plot of the permutation norm versus a. Right: A `waterfalla plot of the permutation norm versus a and the number of iterations. Both plots illustrate the abrupt change in behavior around a"0.1.
invariance to the choice of the broad range of a. This evidence is also anecdotal. Finally, we present some evidence to show that there is a qualitative change in the behavior of the self-annealing algorithm roughly around a"0.15. The energy plot in Fig. 8b, the contour and `waterfalla plots in Fig. 9 indicate the presence of di!erent regimes in SA. The change in the permutation norm with iteration and a is a good qualitative indicator of this change in regime. Our results are very preliminary and anecdotal here. We do not as yet have any understanding of this qualitative change in behavior of SA with change in a.
8. Discussion We have for the most part focused on the three way relationships between SA, DA and RL discrete time dynamical systems. One of the reasons for doing so was the ease with which comparison experiments could be conducted. But there is no reason to stop here. Continuous time projected gradient dynamical systems could just as easily have been derived for SA, RL and DA. In fact, continuous time dynamical systems were derived for RL and DA in Ref. [3] and in Refs. [31,45], respectively. In a similar vein, SA continuous time projected gradient
648
A. Rangarajan / Pattern Recognition 33 (2000) 635}649
descent dynamical systems can also be derived. It would be instructive and illuminating to experimentally check the performances of these continuous time counterparts as well as other closely related algorithms such as iterated conditional modes (ICM) [46] and simulated annealing [47,48] against the performances of the discrete time dynamical systems used in this paper.
9. Conclusion We have demonstrated that self-annealing has the potential to reconcile relaxation labeling and deterministic annealing as applied to matching and labeling problems. Our analysis also suggests that relaxation labeling can itself be extended in a self-annealing direction until the two become almost indistinguishable. The same cannot be said for deterministic annealing since it has more formal origins in mean "eld theory. What this suggests is that there exists a class of hitherto unsuspected selfannealing energy functions from which relaxation labeling dynamical systems can be approximately derived. It remains to be seen if some of the other modi"cations to relaxation labeling like probabilistic relaxation can be related to deterministic annealing.
References [1] R. Duda, P. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, NY, 1973. [2] A.L. Yuille, J.J. Kosowsky, Statistical physics algorithms that converge, Neural Comput. 6 (3) (1994) 341}356. [3] R. Hummel, S. Zucker, On the foundations of relaxation labeling processes, IEEE Trans. Pattern Anal. Mach. Intell. 5 (3) (1983) 267}287. [4] J.J. Hop"eld, D. Tank, &Neural' computation of decisions in optimization problems, Biol. Cybernet. 52 (1985) 141}152. [5] G. Parisi, Statistical Field Theory, Addison-Wesley, Redwood City, CA, 1988. [6] A.L. Yuille, Generalized deformable models, statistical physics, and matching problems, Neural Comput. 2 (1) (1990) 1}24. [7] K. Rose, E. Gurewitz, G. Fox, Constrained clustering as an optimization method, IEEE Trans. Pattern Anal. Mach. Intell. 15 (8) (1993) 785}794. [8] J.S. Bridle, Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters, in: D.S. Touretzky (Ed.), Advances in Neural Information Processing Systems 2, Morgan Kaufmann, San Mateo, CA, 1990, pp. 211}217. [9] A. Rangarajan, S. Gold, E. Mjolsness, A novel optimizing network architecture with applications, Neural Comput. 8 (5) (1996) 1041}1060. [10] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy Statist. Soc. Ser. B 39 (1997) 1}38.
[11] A. Rosenfeld, R. Hummel, S. Zucker, Scene labeling by relaxation operations, IEEE Trans. Systems Man Cybernet 6 (6) (1976) 420}433. [12] S. Peleg, A new probabilistic relaxation scheme, IEEE Trans. Pattern Anal. Mach. Intell. 2 (4) (1980) 362}369. [13] O. Faugeras, M. Berthod, Improving consistency and reducing ambiguity in stochastic labeling: an optimization approach, IEEE Trans. Pattern Anal. Mach. Intell. 3 (4) (1981) 412}424. [14] E.R. Hancock, J. Kittler, Discrete relaxation, Pattern Recognition 23 (7) (1990) 711}733. [15] W.J. Christmas, J. Kittler, M. Petrou, Structural matching in computer vision using probabilistic relaxation, IEEE Trans. Pattern Anal. Mach. Intell. 17 (5) (1995) 749}764. [16] M. Pelillo, Learning compatibility coe$cients for relaxation labeling processes, IEEE Trans. Pattern Anal. Mach. Intell. 16 (9) (1994) 933}945. [17] B. Kamgar-Parsi, B. Kamgar-Parsi, On problem solving with Hop"eld networks, Biol. Cybernet. 62 (1990) 415}423. [18] T. Hofmann, J.M. Buhmann, Pairwise data clustering by deterministic annealing, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1) (1997) 1}14. [19] A. Rangarajan, A.L. Yuille, S. Gold, E. Mjolsness, A convergence proof for the softassign quadratic assignment algorithm, Advances in Neural Information Processing Systems 9, MIT Press, Cambridge, MA, 1997, pp. 620}626. [20] C. Peterson, B. Soderberg, A new method for mapping optimization problems onto neural networks, Int. J. Neural Systems 1 (1) (1989) 3}22. [21] P.D. Simic, Constrained nets for graph matching and other quadratic assignment problems, Neural Comput. 3 (1991) 268}281. [22] S. Gold, A. Rangarajan, A graduated assignment algorithm for graph matching, IEEE Trans. Pattern Anal. Mach. Intell. 18 (4) (1996) 377}388. [23] D.E. Van den Bout, T.K. Miller III, Graph partitioning using annealed networks, IEEE Trans. Neural Networks 1 (2) (1990) 192}203. [24] A. Rangarajan, H. Chui, E. Mjolsness, S. Pappu, L. Davachi, P. Goldman-Rakic, J. Duncan, A robust point matching algorithm for autoradiograph alignment, Med. Image Anal. 4 (1) (1997) 379}398. [25] S. Gold, A. Rangarajan, C.P. Lu, S. Pappu, E. Mjolsness, New algorithms for 2-D and 3-D point matching: pose estimation and correspondence, Pattern Recognition 31 (8) (1998) 1019}1031. [26] D.E. Van den Bout, T.K. Miller III, Improving the performance of the Hop"eld}Tank neural network through normalization and annealing, Biol. Cybernet. 62 (1989) 129}139. [27] P.D. Simic, Statistical mechanics as the underlying theory of &elastic' and &neural' optimisations, Network 1 (1990) 89}103. [28] D. Geiger, A.L. Yuille, A common framework for image segmentation, Int. J. Comput. Vision 6 (3) (1991) 227}243. [29] J.J. Kosowsky, A.L. Yuille, The invisible hand algorithm: solving the assignment problem with statistical physics, Neural Networks 7 (3) (1994) 477}490. [30] A. Rangarajan, E. Mjolsness, A Lagrangian relaxation network for graph matching, IEEE Trans. Neural Networks 7 (6) (1996) 1365}1381.
A. Rangarajan / Pattern Recognition 33 (2000) 635}649 [31] A.L. Yuille, P. Stolorz, J. Utans, Statistical physics, mixtures of distributions, and the EM algorithm, Neural Comput. 6 (2) (1994) 334}340. [32] W.J. Wolfe, M.H. Parry, J.M. MacMillan, Hop"eld-style neural networks and the TSP, IEEE Int. Conf. on Neural Networks, vol. 7, IEEE Press, 1994, pp. 4577}4582. [33] A.H. Gee, R.W. Prager, Polyhedral combinatorics and neural networks, Neural Comput. 6 (1) (1994) 161}180. [34] K. Urahama, Gradient projection network: analog solver for linearly constrained nonlinear programming, Neural Comput. 8 (5) (1996) 1061}1074. [35] R. Sinkhorn, A relationship between arbitrary positive matrices and doubly stochastic matrices, Ann. Math. Statist. 35 (1964) 876}879. [36] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cli!s, NJ, 1988. [37] J. Buhmann, T. Hofmann, Central and pairwise data clustering by competitive neural networks, in: J. Cowan, G. Tesauro, J. Alspector (Eds.), Advances in Neural Information Processing Systems 6, Morgan Kaufmann, San Francisco, CA, 1994, pp. 104}111. [38] L.S. Davis, Shape matching using relaxation techniques, IEEE Trans. Pattern Anal. Mach. Intell. 1 (1) (1979) 60}72. [39] L. Kitchen, A. Rosenfeld, Discrete relaxation for matching relational structures, IEEE Trans. Systems Man Cybernet. 9 (1979) 869}874.
649
[40] S. Ranade, A. Rosenfeld, Point pattern matching by relaxation, Pattern Recognition 12 (1980) 269}275. [41] K. Price, Relaxation matching techniques } a comparison, IEEE Trans. Pattern Anal. Mach. Intell. 7 (5) (1985) 617}623. [42] M. Pelillo, On the dynamics of relaxation labeling processes, IEEE Int. Conf. on Neural Networks (ICNN), vol. 2, IEEE Press, 1994, pp. 606}1294. [43] M. Pelillo, A. Jagota, Relaxation labeling networks for the maximum clique problem, J. Arti"cial Neural Networks 2 (4) (1995) 313}328. [44] J. Kivinen, M. Warmuth, Additive versus exponentiated gradient updates for linear prediction, J. Inform. Comput. 132 (1) (1997) 1}64. [45] K. Urahama, Mathematical programming formulations for neural combinatorial optimization algorithms, J. Arti"cial Neural Networks 2 (4) (1996) 353}364. [46] J. Besag, On the statistical analysis of dirty pictures, J. Roy. Statist. Soc. B 48 (3) (1986) 259}302. [47] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (6) (1984) 721}741. [48] S. Li, H. Wang, K. Chan, M. Petrou, Minimization of MRF energy with relaxation labeling, J. Math. Imaging Vision 7 (2) (1997) 149}161.
About the Author*ANAND RANGARAJAN received the B. Tech. degree in electronics engineering from the Indian Institute of Technology, Madras, India in 1984, the M.S. and Ph.D. degrees in 1986 and 1991 respectively, in electrical engineering from the University of Southern California. From 1990 and 1992 he was a postdoctoral associate in the Departments of Diagnostic Radiology and Computer Science, Yale University. From 1992 to 1995, he held a joint research faculty position in both departments. He is now an Assistant Professor in the Image Processing and Analysis Group, Departments of Diagnostic Radiology and Electrical Engineering, Yale University. In 1992, he chaired a Neural Information Processing Systems (NIPS) post meeting workshop entitled `Deterministic Annealing and Combinatorial Optimizationa in Vail, CO and in 1995 he co-chaired a NIPS post meeting workshop entitled `Statistical and Structural Models in Network Visiona. His research interests are medical imaging, neural networks, computer vision, and the scienti"c study of consciousness.
Pattern Recognition 33 (2000) 651}669
Data visualization by multidimensional scaling: a deterministic annealing approach HansjoK rg Klock, Joachim M. Buhmann* Rheinische Friedrich-Wilhelms-Universita( t, Institut fu( r Informatik III, Ro( merstra}e 164, D-53117 Bonn, Germany Received 15 March 1999
Abstract Multidimensional scaling addresses the problem how proximity data can be faithfully visualized as points in a low-dimensional Euclidean space. The quality of a data embedding is measured by a stress function which compares proximity values with Euclidean distances of the respective points. The corresponding minimization problem is non-convex and sensitive to local minima. We present a novel deterministic annealing algorithm for the frequently used objective SSTRESS and for Sammon mapping, derived in the framework of maximum entropy estimation. Experimental results demonstrate the superiority of our optimization technique compared to conventional gradient descent methods. ( 2000 Published by Elsevier Science Ltd. All rights reserved. Keywords: Multidimensional scaling; Visualization; Proximity data; Sammon mapping; Maximum entropy; Deterministic annealing; Optimization
1. Introduction Visualizing experimental data arises as a fundamental pattern recognition problem for exploratory data analysis in empirical sciences. Quite often the objects under investigation are represented by proximity data, e.g. by pairwise dissimilarity values instead of feature vectors. Such data occur in psychology, linguistics, genetics and other experimental sciences. Multidimensional scaling (MDS) is known as a collection of visualization techniques for proximity data which yield a set of representative data points in a suitable embedding space. These points are selected in such a way that their mutual distances match the respective proximity values as faithfully as possible. In the more familiar case of data represented by feature vectors, MDS can be used as a visualization tool. It establishes a mapping of these points to an
* Corresponding author. E-mail addresses: [email protected] [email protected]. (J.M. Buhmann)
(H.
Klock),
informative low-dimensional plane or manifold on the basis of pairwise Euclidean distances in the original feature space. Due to the relational nature of the data representation, the visualization poses a di$cult optimization problem. Section 2 provides a brief introduction to the multidimensional scaling concept. For a more detailed treatment of the subject the reader is referred to the monographs of Borg and Groenen [1] and Cox and Cox [2]. Kruskal has formulated the search for a set of representative data points as a continuous optimization problem [3]. Deterministic algorithms, the most frequently used candidates to solve such a problem, often converge quickly but display a tendency to get trapped in local minima. Stochastic techniques like simulated annealing treat the embedding coordinates as random variables and circumvent local minima at the expense of computation time. The merits of both techniques, speed and the capability to avoid local minima, are combined by the deterministic annealing approach. This design principle for optimization algorithms is reviewed in Section 3. Sections 4 and 5 present the new algorithms for Kruskal's stress minimization and for Sammon mapping. The
0031-3203/00/$20.00 ( 2000 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 8 - 3
652
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
applicability of the novel techniques to realistic problems is demonstrated in Section 6.
2. Multidimensional scaling Multidimensional scaling refers to a class of algorithms for exploratory data analysis which visualize proximity relations of objects by distances between points in a low-dimensional Euclidean space. Proximity values are represented in the following as dissimilarity values. The reader is referred to Hartigan [4] for a detailed discussion on proximity structures. Mathematically, the dissimilarity of object i to object j is de"ned as a real number d . Throughout this paper we assume symmetric disij similarity values, i.e., d "d . The MDS algorithm deterij ji mines a spatial representation of the objects, i.e., each object i is represented by coordinates x 3RM in a Mi dimensional space. We will use X"Mx ,2, x N to de1 N note the entire embedding con"guration. The distance between two points x and x of X is usually measured by i j the Euclidean distance d ,d(x ,x )"DDx !x DD. Quite ij i j i j often, the raw dissimilarity data are not suitable for Euclidean embedding and an additional processing step is required. To model such data transformations we assume a monotonic non-linear transformation D(d ) of ij dissimilarities into disparities. Ideally, after an iterative re"nement of D(.) and X, the transformation D(.) should project the dissimilarities d to disparities that closely ij match the distances d of the embedded points, i.e., ij d +D(d ). As discussed by Klock et al. [5] a transformaij ij tion of the dissimilarities d can be necessary to compenij sate a dimensionality mismatch between dissimilarities in the (hypothetical) data-space and Euclidean distances in the embedding space. 2.1. Objective functions Let us assume that the transformed dissimilarities D "D(d ) match su$ciently well metric distances in an ij ij embedding space. Under this condition it makes sense to formulate MDS as an optimization problem with the cost function N N H(Mx N)" + + w (d !D )2. (1) i ik ik ik i/1 k/1 The factors w are introduced to weight the disparities ik individually. This is useful in order to gauge the scale of the stress function, i.e., to normalize the absolute values of the disparities D . Depending on the data analysis task ij at hand, it might be appropriate to use a local, a global or an intermediate normalization 1 w(-)" , ik N(N!1)D2 ik
1 w(')" , ik +N D2 l,m/1 -.
1 w(.)" . ik D +N D ik l,m/1 lm
(2)
The di!erent choices correspond to a minimization of relative, absolute or intermediate error [6]. The weighting w might also be used to discount disparities ik with a high degree of experimental uncertainty. For the sake of simplicity w "0 for all i is assumed in ii the sequel. A common choice for Eq. (1) is N N HMDS(Mx N)" + + w (DDx !x DD2!D )2, i ik i k ik i/1 k/1
(3)
as adopted by the ALSCAL algorithm (Alternating Least Squares Scaling) [7]. The squared Euclidean distances are used for computational simplicity. Note that one expects D "d2 if the d su$ciently correspond to metric disik ik ik tances, i.e., the squaring of dissimilarities is subsumed into the choice of the function D. Eq. (3) is known as SSTRESS in the literature [2,7]. A more natural choice seems N N HS(Mx N)" + + w (DDx !x DD!D )2 ik i k ik i i/1 k/1
(4)
which is referred to as Sammon mapping [8]. Sammon used the intermediate normalization from Eq. (2) to search for a non-linear dimension reduction scheme, i.e., the dissimilarities D are computed from a set of vectors ik Mm 3Rn: 1)i)NN in the n-dimensional input space. i From the view point of an embedding problem, i.e., "nding an optimal X, there is no need to distinguish between MDS and dimension reduction via pairwise distances. Despite the fact that many di!erent choices of the distance function are possible, e.g. based on other metrics, we will restrict the discussion in this paper to the minimization of Eqs. (3) and (4). MDS methods which minimize an objective function of this types are commonly referred to as least squares scaling (LSS) and belong to the class of metric multidimensional scaling algorithms. The term metric characterizes the type of transformation D(.) used to preprocess the dissimilarities and does not refer to a property of the embedding space [2]. Fig. 1 gives an idea of how MDS might be used in practice. Starting with the dissimilarity matrix (a) of 226 protein sequences from the globin family (dark grey levels correspond to small dissimilarities), embeddings are derived by minimizing Eq. (3) with global (b), intermediate (c) or local (d) weighting. The embeddings clearly reveal the cluster structure of the data with di!erent accuracy in the representation of inter- and intra-cluster dissimilarities. Often it is not possible to construct an explicit functional form D(.) such that the mapped dissimilarities
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
653
Fig. 1. Similarity matrix (top-left) of 226 protein sequences of the globin family. Dark grey levels correspond to high similarity values. The other three "gures show the embeddings derived by multidimensional scaling: (top-right) global (bottom-left) intermediate and (bottom-right) local normalization of the stress function HMDS.
D of an empirical data set match su$ciently well metric ij distances. In such a situation the space of possible transformations D(.) has to be enlarged and should only be restricted by the monotonicity constraint d (d ND(d ))D(d ). Order preserving but otherij lk ij lk wise unconstraint transformations of the dissimilarities de"ne the class of non-metric MDS algorithms invented by Shepard [9] and Kruskal [10]. In Kruskal's approach not the transformation D(.) but the disparity matrix
is modi"ed. The objective function di!ers slightly from Eq. (4). 2.2. Alternatives to gradient descent approaches Other algorithms discussed in the literature do not rely on explicit gradient descents. One of these methods, aimed at minimizing a stress function of the Sammon type (4), is known by the acronym SMACOF (Scaling
654
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
by MAjorizing A COmplicated Function) [11}13]. It is based on an iterative majorization algorithm that introduces ideas from convex analysis. Instead of minimizing the stress function H(X) directly, a majorizing function G(X,Y) is derived with G(X,Y)*H(X), ∀X, YL),
(5)
where ) denotes the space of all coordinates. Equality holds for Y"X. During the iterative update, Y is the con"guration from the previous step. The iterative majorization gives rise to a non-increasing sequence of stress values with linear convergence to a local minimum of the stress function [12]. The approach can be extended to account for arbitrary Minkowski distances [14]. The algorithms discussed up to this point are local minimizers sharing the problem of frequently getting stuck in a local minimum of the complicated energy landscape. Only a few global minimization strategies have been developed for MDS, the most prominent algorithm perhaps being the tunneling method [15]. This deterministic scheme allows the algorithm to escape local minima by `tunnelinga to new con"gurations X with the same stress, possibly providing a starting point for further stress reduction. A second group of papers deals with the application of stochastic optimization techniques to MDS. Among these approaches there is an application of simulated annealing [16], sharing with our approach the concept of maximum entropy inference (see below). The hybrid MDS algorithm of Mathar and Zilinskas combines genetic algorithms with local optimization procedures [17]. 2.3. Dexcits of multidimensional scaling and Sammon mapping An often discussed de"cit of the classical multidimensional scaling techniques such as Sammon mapping is their inherent batch-character [18,19]. A run of the program will only yield an embedding of the corresponding data without direct generalization capabilities. To project new data, the program has to be restarted on the pooled data, because a projection of additional data will modify the embedding of the old data as well. Another, perhaps more urgent de"cit is the amount of proximity values that characterize large data sets. For non-linear dimension reduction, the standard technique clusters the data beforehand and visualizes the resulting cluster prototypes. This coarse-graining of a large data set by clustering, already proposed by Sammon in his original paper [8], is unsatisfactory and often unacceptable. The need to overcome this drawback has recently initiated a number of developments [18}21]. These approaches share the common idea to use the Sammon stress function as relative supervisor to train a nonlinear mapping. Such mappings can be implemented as a radial
basis function or a backpropagation network. If y"f(x; W) is the output vector, mapped by a function f which depends on a set of weights W, the stress becomes N N H(Mx N)" + + w (DDf(x ; W)!f(x ; W)DD!D )2. i ik i k ik i/1 k/1
(6)
Di!erentiating with respect to W yields a set of equations that can be used to adjust the weights W. Besides a batch approach, updating can be performed online by presenting pairs of patterns. Although the ideas discussed in this paper apply to these approaches as well, we will subsequently restrict our attention to the derivation of a novel framework for optimization of the classical stress functions.
3. The maximum entropy approach to optimization 3.1. Stochastic optimization and deterministic annealing Since the introduction of simulated annealing in a seminal paper of Kirkpatrick et al. [22], stochastic maximum entropy approaches to optimization problems have found widespread acceptance in the pattern recognition and computer vision community as alternative to gradient descent or other deterministic techniques. In application to MDS, the search for optimal solutions is a simulated random walk through the space )LRM of possible embeddings X3). If implemented by a Markov Chain Monte Carlo method such as the Metropolis algorithm, this process converges to an equilibrium probability distribution known as Gibbs distribution with density
A P A
B B
1 PG(X)"exp ! (H(X)!F) , ¹ F"!¹ log
)
1 exp ! H(X) dX. ¹
(7)
If we denote by P the space of probability densities over ) ), then the Gibbs density PG minimizes an objective function over P called the generalized free energy ) F "SHT !¹S(P) P P
P
,
P
P(X)H(X) dX#¹ P(X)log P(X) dX. ) )
(8)
SHT and S denote the expected energy and the entropy P of the system with state space ) and probability density P. The computational temperature ¹ serves as a Lagrange multiplier to control the expected energy SHT. Obviously, entropy maximization with "xed expected costs minimizes F [23]. P
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
Simulated and deterministic annealing: Eq. (8) motivates the idea to slowly reduce the temperature during an optimization process. Analogous to an experimental annealing of solids, solutions for an optimization problem are heated and cooled in simulations. To prevent the system from falling into poor local minima, one starts at a high temperature ¹ where the free energy landscape (8) is dominated by the entropy term and appears to be smoothed out. A decrease of the temperature then gradually reveals the structure of the original cost function de"ned by H. In simulated annealing, the interesting expectation values of the system parameters, e.g., the expected embedding coordinates in MDS, are estimated by sampling the Gibbs distribution PG using a Monte Carlo method. For a logarithmically slow decrease of the temperature convergence to a global optimum has been proven [24], but in practice simulated annealing is well-known for being slow compared to deterministic approaches. It is the aim of deterministic annealing to cure this drawback by exactly or approximately calculating the relevant expectation values w.r.t. the Gibbs distribution. Since the convincing success of an early application to data clustering [25,26], deterministic annealing has been applied to other combinatorial or continuous optimization problems such as pairwise data clustering [27], graph matching [28] and multidimensional scaling [5,29]. The interested reader is referred to [27] for a more detailed discussion.
655
EM-Algorithm: The introduction of the mean "eld parameters # suggests an alternating algorithm to estii mate the expectation values of the embedding coordinates. Iteratively, the parameters # are optimized given i a vector of statistics ' that contains all relevant informai tion about the other sites (M-like step). This step is followed by a recomputation of the statistics ' ,kOi on k the basis of the new parameters # (E-like step). The i resulting alternation algorithm can be viewed as a generalized expectation-maximization algorithm [30].
4. Derivation of the mean 5eld approximations Utilizing the symmetry D "D and neglecting conik ki stant terms an expansion of HMDS yields the expected costs N SHMDST" + w [2SDDx DD4T!8SDDx DD2x TTSx T ik i i i k i,k/1 #2SDDx DD2TSDDx DD2T#4¹r[Sx xTTSx xTT] i k i i k k !4D (SDDx DD2T!Sx TTSx T)]. (11) ik i i k ¹r[A] denotes the trace of matrix A. Expectation values in Eq. (11) are taken w.r.t. the factorized distribution P0 (9), i.e.,
P
N = SgT" < g(x ) q (x D # )dx . i i i i i i/1 ~=
(12)
3.2. Mean xeld approximation 4.1. A statistics for any mean xeld approximation The computation of the free energy F (7) and, consequently, of all other interesting expectation values is computationally intractable due to the high-dimensional integrals : f (X) dX. We, therefore, reside to an approxi) mation technique called mean xeld approximation. The Gibbs distribution PG(X) is approximated by a factorized distribution with density N P0(X D #)" < q (x D # ). (9) i i i i/1 Each of the factors q (x D # ), parameterized by a vector of i i i mean xeld parameters M# D 1)i)NN, serves as a model i of the marginal distribution of the coordinates x of site i. i This approximation neglects correlations between optimization variables and only takes their averaged e!ect into account. To determine the optimal parameters # , i we have to minimize the Kullback}Leibler (KL) divergence between the factorial density P0(X) and the Gibbs density PG(X), D(P0(X D #)DDPG(X))
P
,
A
P0(XD#)log )
B
P0(X D #) dX. PG(X)
(10)
Before discussing computationally tractable model densities q (x D # ), the statistics '"(' ,2,' ) have to i i i 1 N be calculated for an unconstrained mean "eld approximation. Using Eq. (11) we determine the Kullback}Leibler divergence between P0 and the Gibbs density PG N 1 D(P0DDPG)" + Slog q (x D # )T# [SHMDST!F]. i i i ¹ i/1 (13) The correct free energy F of the system does not depend on the mean "eld parameters and can be neglected in the minimization problem. Variation with respect to the parameters h of P0 leads to a system of transcendental ip equations LD(P0DDPG) Lh ip LSEx E4T LSDDx DD2x T L i i i # "a0 #hK T ¹r[H Sx xTT] i i i i i Lh Lh Lh ip ip ip LSx T L i #¹ # hT Slog q T, 1)i)N. (14) i Lh i Lh ip ip
0"¹
656
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
Fig. 2. Algorithm I.
All terms independent of the parameters h are collected ip in a vector of statistics ' "(a0, h , H , hK ) with i i i i i N hK "!8 + w Sx T, i ik k k/1
N a0"2 + w , i ik k/1
(15)
algorithm reaches with HS"0.00413 the best local minimum obtained in all experiments.1 Of course, if the dissimilarities are Euclidean distances of points in R2, the algorithm "nds a perfect reconstruction at ¹"0. 4.2. Model densities
N h "8 + w (D Sx T!SDDx DD2x T), i ik ik k k k k/1
(16)
N H " + w [8Sx xTT#4(SDDx DD2T!D )I]. i ik k k k ik k/1
(17)
I denotes the unit matrix. The reader should note that the derivation up to this point does not depend on the choice of the model density (9). ' is a statistics to compute any i mean "eld approximation to the Gibbs density PG with cost function (3). We propose an algorithm (Fig. 2) to compute the statistics '"(' ,2,' ) and the parameter estimates 1 N #"(# ,2,# ) in an iterative fashion. 1 N The algorithm decreases the temperature parameter exponentially while an estimate of the statistics ' (E-like step) is alternated with an optimization of the parameters # (M-like step). This can be carried out in parallel (with potential convergence problems caused by oscillations) or sequentially with an immediate update of the statistics '. The sequential variant of the generalized EM-algorithm with a random site visitation schedule and immediate update is known to exhibit satisfactory convergence properties [31]. It converges to a local minimum of the KL divergence since the parameters # are uniquely i determined by the statistics ' which do not explicitly i depend on # . i Fig. 3a}d displays the expected coordinates for the Iris data set at four di!erent temperatures. At ¹+0 the
The derivation of the MDS algorithm is completed by inserting the derivatives of the expectation values Sx T, Sx xTT, Sx DDx DD2T, SDDx DD4T and the entropy i i i i i i S"!Slog q (x D# )T into the stationary Eq. (14). i i i Depending on the chosen model q (x D# ) these values can i i i be computed analytically or they have to be estimated by Monte Carlo integration. The rest of this section will be devoted to a detailed discussion of some variants in the choice of q (x D# ). i i i Exact model: The ansatz # "' for the factorizing i i density:
A
B B
1 1 exp ! f (x ) , q0(x D # )" i i i ¹ i i Z0 i
P
Z0" i
=
A
1 dx exp ! f (x ) , i ¹ i i ~=
f (x )"a0DDx DD4#DDx DD2xThK #Tr[x xTH]#xTh i i i i i i i i i i i
(18)
(19) (20)
can be used in principle, since the factorial density is directly parameterized by the statistics ' . From (18) the i
1 Presumably the global minimum, but there is no proof. Also note that in anticipation of Section 5 we used the Sammon stress function here.
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
657
Fig. 3. Evolution of the embedding of the Iris data set at di!erent temperature levels.
mean "eld approximation F of the free energy F is 0 given by N F "!¹ + log Z0 0 i i/1 N = 1 "!¹ + log exp ! f (x ) dx . (21) i ¹ i i ~= i/1 The ansatz (18) exactly estimates the marginals of the Gibbs density (7) with the stress function HMDS and, therefore, is called the exact model in this paper. The moments of x are dependent on the mean "eld parai
P
A
B
meters # "' . The former are related to the free energy i i F by the so-called self-consistency equations, i.e., the 0 derivatives of F with respect to the elements h , H , hK 0 i i i and a0 of the "eld vector i LF LF 0"Sx T, 0"Sx xTT, i i i Lh LH i i LF LF 0"SDDx DD2x T, 0"SDDx DD4T. (22) i i i Lh La0 i i Unfortunately the integral (21) cannot be evaluated analytically. A Taylor-series expansion of the argument f (x ) i i
658
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
of the exponential at the minima x with +f Dxip"0 yields ip i satisfactory results for low temperatures. At intermediate temperatures, however, the Gibbs distribution can be strongly skewed. The skew might introduce severe approximation errors if the number of modes is not estimated correctly, as indicated by numerical instabilities found in our simulations. Dirac model: To derive tractable approximations for the statistics ' we consider the Dirac delta distribution q (x Dl )"d(x !l ), i i i i i
(23)
centered at the location l . This model can be considered i as the zero temperature limit ¹P0 of the density (18) with moments Sx T"l ; Sx xTT"l lT; i i i i i i SDDx DD2x T"DDl DD2l ; i i i i
SDDx DD4T"DDl DD4. i i
(24)
Inserting the derivatives with respect to l into the stai tionary equations (14) yields the gradient of an M-dimensional potential ¹I (q )"a0DDl DD4#hK Tl DDl DD2# Tr [l lTH ]#hTl . (25) i i i i i i i i i i i i I quanti"es the partial costs of assigning site i to the i model q given the statistics ' . It is a fourth degree vector i i polynomial that can be minimized by gradient descent methods, e.g. conjugate gradient [32] or the technique described in the appendix to explicitly compute all minima. Gaussian models: The Dirac model for q (x Dl ) is indei i i pendent of the temperature and, consequently, does not exploit the smoothing e!ects of deterministic annealing at "nite ¹. A re"ned model based on a multivariate Gaussian with expectation value l and covariance i R correctly captures "nite ¹ e!ects and, thereby, i preserves the bene"ts of deterministic annealing,
A
B
(26)
with Z "DR D12 (2n)M2 . i i Here DR D denotes the determinant. In practice, however, i the full multivariate Gaussian model can be restricted to a radial basis function model with a diagonal covariance matrix R "p2I. The moments of this isotropic model i i q (x D l ,p ) are given by i i i i Sx T"l , i i
(27)
Sx xTT"p2I#l lT, i i i i i
(28)
SDDx DD2x T"Kp2l #DDliDD2l , i i i i i
(29)
(30) (31)
with K"M#2. Inserting these moments into the stationary Eqs. (14) yields LD "DDliDD2(4a0l #hK )#2l lThK i i i i i i Ll i #[2H #4Kp2a0I]l #h #Kp2hK , (32) i i i i i i i LD ¹ "4a0KMp3#(4Ka0DDl DD2#2KlThK i i i i i i Lp i M¹ . (33) #2Tr[H ])p ! i i p i As for the Dirac model the stationary Eqs. (32), (33) can be identi"ed with the gradient of the partial costs ¹
¹I (q )"a0SDDx DD4T#hK TSx DDx DD2T i i i i i i i #Tr[Sx xTTH ]#hTSx T!M¹ log p (34) i i i i i i w.r.t. the mean "eld parameters l and p . Note that for i i a "xed value p2 Eq. (32) de"nes the gradient of a quartic i vector potential in l as in the Dirac case. On the other i hand, given a "xed value of l , Eq. (33) amounts to i a quadratic equation in p2 with the unique solution i
S
p p2"! # i 2
p2 #q¹, 4
(35)
where (4Ka0DDl DD2#2KlThK #2Tr[H ]) i i i i i p" 4a0KM i and 1 q" 4a0K i
q (x Dl , R ) i i i i 1 1 " exp ! Tr[R~1(x !l )(x !l )T] i i i i i Z 2 i
Sx4T"2Mp4#4DDliDD2p2#(Mp2#DDliDD2)2, i i i i M !Slog q T" (1#log p2#log 2n) i i 2
(36)
since p2'0, q'0 and therefore !p/2(Jp2/4#q¹ for all p. Eq. (35) makes immediately clear how the temperature ¹ acts as a `fuzzi"era of the model distribution by introducing a "nite variance p2 for all temperi atures ¹'0. In the performed MDS experiments, the system (32, 33) of equations has been solved in an iterative fashion, alternating the computation l given p2 i i and p2 given l . i i 5. Approximations for Sammon mapping In contrast to the SSTRESS cost function (3), where an optimization algorithm based on local "x-point iterations exists, the usual way to minimize the costs of Sammon mapping is by gradient descent. These
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
659
approaches are known to be computationally expensive and prone to local minima. Also convergence critically depends on the chosen step-size [33,34]. A method to compute an optimal step size with proven convergence can be found in the literature [35], but at the expense of large computational cost. As outlined in the previous section, the interactions in the cost function HMDS can be completely eliminated introducing a vector of statistics ' . These statistics capi ture all relevant information to "nd an optimal embedding of one selected point keeping the other sites xxed. This strategy is not directly applicable in the case of Sammon mapping for the following reason: Expanding the stress function (4)
maximum in this situation and a naive "xed-point iteration based on Eq. (41) would perform an undesired gradient ascent.
HS"+ w (DDx !x DD2!2D DDx !x DD#D2 ) (37) ik i k ik i k ik ik and di!erentiating with respect to x i LHS N D ik "4 + w (x !x ) 1! (38) ik i k Lx DDx !x DD i i k k/1 reveals that the Euclidean distance DDx !x DD introduces i k a strong coupling of the sites by its occurrence in the denominator of the stationary equations. Furthermore, LHS/Lx is plagued by discontinuities at x "x . The i i k Hessian matrix (w.r.t. to a single point) is given by
(43) F(x)"max [yx!Fw(y)]. y Fw denotes the conjugate of F, derived by the Legendre}Fenchel transformation [38,39]
A
B
CA
B D
L2HS N D ik "4 + w I 1! ik Lx xT DDx !x DD i i i k k/1 (x !x )(x !x )T k i k . #D i (39) ik DDx !x DD3 i k Interestingly for the one-dimensional case it is constant except for the points x "x , where the gradient is not i k de"ned. To derive simple stationary equations for the moments as in case of HMDS, the denominator of the fraction in Eq. (38) cannot be approximated as a constant * "DDx !x DD, (40) ik i k computed e.g. from the expectations of the previous step. The reason is as follows: The corresponding cost function
A
B
D H K S"2+ w 1! ik DDx !x DD2 (41) ik i k * ik i,k de"nes a paraboloid with respect to the coordinates of a selected point x with a Hessian matrix given by i L2H K S N D "2I + w 1! ik . (42) ik Lx xT * i i ik k/1 The (constant) Hessian is not strictly positive de"nite, e.g. as soon as D '* holds for enough sites k, the rightik ik hand side of Eq. (42) might become negative de"nite at the (one) extremum of H K S, i.e., the paraboloid #ips its orientation. The stationary equations describe a local
A
B
5.1. Algebraic transformation This section describes a new approach to Sammon mapping based on a "x-point preserving algebraic transformation of objective functions [36]. Originally, this transformation has been used to linearize a quadratic term in the cost function [36,37]. In the context of Sammon mapping this approach preserves the quartic nature of the cost function while discarding the inconvenient square root term in Eq. (37). The key idea is to express a convex term F(x) in a cost function by
Fw(y)"max [yx!F(x)] (44) x of F(x). The conjugate Fw(y) is also a convex function in the dual variable y. Geometrically, Eq. (43) de"nes a representation of F by its tangent space. Applying this transformation to the cost function (37), we eliminate the convex second term of each summand by the transformation
C
D
X !2JX Pmax ! ik!* , ik ik * ik *ik X " : DDx !x DD2, 1)i, k)N, (45) ik i k introducing an auxiliary variable * . Additional ik straightforward algebra yields the expression
A
B
X arg max ! ik!* "JX . (46) ik ik * ik *ik A second transformation is applied to the cost function (37) in order to enforce the existence of at least one local minimum. For this purpose, the "rst term of Eq. (37) has to be rewritten as X2 2X P ik#*I . (47) ik ik *I ik Optimal values of the auxiliary parameters *I satisfy the ik condition
A
B
X2 ik#*I "X . (48) arg min ik ik *I I ik ik * The reader should note that *I have to assume a minik imum due to the concavity of the "rst term in X2 . ik In summary, the transformations (45), (47) turn the minimization of HS into the computation of a saddle point, i.e., a local maximum w.r.t. the auxiliary
660
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
parameters M* N and a minimum w.r.t. the parameters ik M*I N as well as the coordinates X: ik
A A
N w DDx !x DD4 i k #*I H I S" + ik ik 2 *I ik i,k/1 !2D ik
B
B
DDx !x DD2 i k #* #2D2 . ik ik * ik
(49)
Inserting Eqs. (46) and (48) into Eq. (49) shows that the minima of the original and the saddle-points of the transformed objective function can be identi"ed. The gradient of the transformed cost function H I S
A
B
LH I S N DDx !x DD2 D i k ! ik (x !x ) "4 + w ik i k Lx *I * i ik ik k/1
(50)
equals the gradient (38) of HS at *"*015. But Eq. (49) has distinct computational advantages, i.e., in contrast to Eq. (41) which might only de"ne one local maximum, Eq. (49) guarantees the existence of at least one local minimum. Consequently, the set of stationary equations can be solved to yield a currently optimal x keeping the other i points and the auxiliary parameters "xed, and an iterative optimization scheme analogous to the case of HMDS can be de"ned. We will denote by *"M* , *I N the complete set of ik ik auxiliary variables and by *015 their current optimal value as de"ned by DDx !x DD. For "xed *, the transi k formed stress (49) of Sammon mapping reveals an appealing similarity with HMDS. Neglecting the constant terms, the only di!erence turns out to be the additional weighting by 1/* and 1/*I , respectively. Note that in ik ik the zero-temperature limit (no stochastic variables) *2 and *I can be identi"ed, as it is done for the rest of ik ik this subsection. For the derivation of the deterministic annealing algorithm, we have to distinguish strictly between both sets of variables. To understand the e!ects of the transformation, we analyze the cost of embedding in R two points with mutual dissimilarity D "1. The "rst point is kept "xed 01 at x "0. The graph in Fig. 4 displays the costs of 0 embedding the second point at locations x . The bold 1 line depicts the Sammon costs. Note the characteristic peak at x"0 due to the discontinuity of the derivative of DDx !x DD at x "x . The discontinuity is smoothed out 1 0 1 0 by the approximation, shown for di!erent values of * 01 (thin lines). We note the following important properties that hold for the M-dimensional case as well: f The approximation is exact for the optimal value of * , and the approximated cost function smoothly 01 approaches the Sammon costs at DDx !x DD"* . 1 0 01 f For large values of DDx !x DD, the approximation 1 0 yields an upper bound on the local Sammon costs for all values of the parameter * . 01
Fig. 4. Visualization of the approximation of Sammon mapping for the simple case of embedding two points in one dimension (see text). The "rst point is kept "xed at x "0. The graph 0 displays the cost of embedding the second point at locations x . 1 Bold: Sammon costs. Thin: Approximation for di!erent values of * . 01
f This upper bound does not hold for small values * (D /2 if DDx !x DD(minMD /2,* N, but the reik 01 1 0 01 01 sulting error is always bounded. We suspect that the discontinuities of the derivative are related to the computational problems of Sammon mapping. Consider the gradient (38) of HS with respect to a single point x . Each term of the sum will introduce i a hypersphere around x de"ned by the denominator k DDx !x DD. Passing the surface of one of these hyperi k spheres, one encounters a discontinuity of the gradient. If the absolute value of the gradient is not too large, the discontinuity can reverse the direction of the gradient, implying the existence of another local minimum. Consequently the number of local minima is related with the number N of points to be embedded. This contrasts with the maximal number of 2M#1 extrema encountered when embedding a single point with the SSTRESS function (see Appendix), where M is the embedding dimension. 5.2. Deterministic annealing for Sammon mapping To develop a deterministic annealing algorithm, the embedding coordinates are considered to be random variables. The similarity of HMDS and H I S strongly motivates a derivation analogous to Section 4. It has to be emphasized at this point that the free energy of the transformed stress function F I S"SH I ST!¹S
(51)
does not provide an upper bound for the true free energy FS de"ned by HS. But since the saddle-points of F I S and FS coincide in the zero-temperature limit, the minimization of an upper bound on F I S will still solve our optimization problem.
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
661
Fig. 5. Algorithm II.
The auxiliary variables are now determined by a variation of the Kullback}Leibler divergence (13), entering the latter via the transformed expected costs2
C A
N w SDDx !x DD4T i k #*I SH I ST" + ik ik *I 2 ik i,k/1 SDDx !x DD2T i k #* !2D . (52) ik ik * ik At "nite temperatures, *2 and *I will assume di!erent ik ik values:
BD
*I "JSDDx !x DD4T, * "JSDDx !x DD2T. (53) ik i k ik i k Introducing e!ective weights w8 and dissimilarities ik DI de"ned by ik w *I w8 " ik , DI "D ik, (54) ik 2*I ik ik * ik ik we can immediately identify H I S with HMDS up to constant terms as far as the minimization w.r.t. X is concerned. Applying the techniques treated in the previous paragraphs, we iterate the EM-loop of HMDS and the adaptation of the auxiliary parameters *. This suggest the following deterministic annealing algorithm (shown in the algorithm box (Fig. 5)) to compute an embedding based on the Sammon mapping cost function. As for HMDS, the algorithm decreases the computational temperature ¹ exponentially while iterating in an
2 Constant terms have been neglected.
asynchronous update scheme the estimation of the statistics ' (E-like step) and the optimization of the mean "eld parameters # (M-like step). Again the iteration proceeds until the gain in KL divergence falls below a threshold e. The update of the conjugate variables * and *I can be performed before re-entering the EM-like loop after convergence. In our practical implementation the update is performed in conjunction with the exponential decrease of ¹, but also with an immediate update directly before the E-like step. We did not experience any instabilities.
6. Simulation results In the following, we discuss a number of data sets and show typical embeddings derived by Sammon mapping and SSTRESS-based metric MDS. For three of the data sets, one with very low stress (Iris), one with intermediate stress (Virus) and another with very high stress (Random) we performed a large number of runs (1000 each) with random starting conditions in order to derive reliable statistics. The experiments consistently demonstrate the success of the deterministic annealing approach. 6.1. Iris data A well-known data set widely used in the pattern recognition literature is the Iris data [40]. It contains three classes of 50 instances each, where each class refers to a type of iris plants. The data is four-dimensional, and consists of measurements of sepal and petal length. One class is linearly separable from the other two, but the
662
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
Table 1 Statistics of an experiment of 1000 runs of Sammon mapping on the Iris data set with (1) gradient descent (LVQ-PAK), (2) zero-temperature and (3) deterministic annealing algorithm
Table 2 Statistics of an experiment of 1000 runs of Sammon mapping on the Virus data set with (1) gradient descent (LVQ-PAK), (2) zero-temperature and (3) deterministic annealing algorithm
Algorithm
Mean [10~3]
Std. Dev. [10~3]
Max. [10~3]
Min. [10~3]
Algorithm
Mean
Std. Dev.
Min.
Max.
Gradient Zero Annealing
5.281 5.084 4.255
0.841 0.777 0.289
9.999 8.904 5.208
4.129 4.130 4.129
Gradient Zero Annealing
0.04558 0.04407 0.04157
0.003266 0.002611 0
0.04156 0.04156 0.04157
0.05827 0.05184 0.04157
Fig. 6. Sammon mapping applied to the Iris data for non-linear dimension reduction. Histograms of 1000 runs with deterministic annealing (gray) and gradient descent (white).
latter are not. Two feature vectors are identical and were removed before we computed the dissimilarity matrix. We applied Sammon mapping to this data in order to derive a two-dimensional representation. The resulting embedding is shown in Fig. 3d. We compared the results of the deterministic annealing algorithm with the classical gradient descent approach. For the latter we used the program sammon supplied with the widely used LVQPAK [41] software package. We ran both algorithms 1000 times. Each run of the sammon program lasted 1000 iterations. There was hardly any improvement by increasing that number. Each run of the deterministic annealing algorithm performed 80 cycles through the coordinate set (Table 1). Fig. 6 depicts the histogram of "nal stress values. While the gradient descent optimization produces solutions with broad distribution in quality, deterministic annealing reached the best two bins in nearly 90% of the runs. Further improvement is expected for a re"ned (although more time-consuming) annealing schedule. 6.2. Virus data A second experiment was performed on a data set described in [34]. The data consists of 60 vectors with 18
Fig. 7. Virus data, Sammon mapping. Histograms of 1000 runs with gradient descent (dark gray), zero-temperature (light gray) deterministic annealing algorithm (white with stripes).
entries describing the biochemical features of a virus under investigation. The Tomabovirus subset exhibits a large number of poor local minima when analyzed with the original Sammon mapping algorithm [34]. A comparison of our (zero temperature) results with the solutions produced by the program sammon are summarized as follows: The zero temperature version of the algorithm avoided very poor local minima but produced a broad distribution. The results for sammon were marginally worse. Deterministic annealing, however, found the best solution in almost all cases (see Table 2). These experiments support the view that deterministic annealing eliminates the need for a good starting solution. Fig. 7 shows the histograms of the corresponding runs. 6.3. Embedding of random dissimilarities Random dissimilarities pose a particularly di$cult problem for embedding since a lot of Euclidean constraints are violated. We have performed this experiment in order to demonstrate the power of the deterministic annealing technique in situations where the energy landscape becomes very rugged. Dissimilarities in the data set have been randomly drawn from a bimodal Gaussian mixture with k "1.0 and k "2.0, both mixture com1 2 ponents with standard deviation p"0.1. It turns out
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
Fig. 8. Bimodal random data: Histograms of the "nal SSTRESS of 1000 runs of the deterministic annealing algorithm (gray) versus 1000 runs of the zero temperature version (white) with local weighting. Table 3 Statistics of an experiment of 1000 runs of the SSTRESS-based multidimensional scaling on the bi-modal random data set with (1) zero-temperature and (2) deterministic annealing algorithm Algorithm
Mean
Std. Dev.
Min.
Max.
Zero Annealing
0.4652 0.4609
0.00204 0.00094
0.4714 0.4653
0.4584 0.4577
that the probability to reach the global optimum by a random starting solution shrinks signi"cantly compared to the Virus data set. Histograms of deterministic annealing solutions and zero temperature solutions are shown in Fig. 8. 95% of the deterministic annealing solutions can be found in the top 10% range of the gradient descent solutions. This experiment was performed with the SSTRESS objective function (3). The statistics of the experiment are presented in Table 3. 6.4. Other experiments Protein data: Another real-world data set which we used for testing consists of 226 protein sequences. The dissimilarity values between pairs of sequences have been determined by a sequence alignment program based on biochemical and structural information. The sequences belong to di!erent globin families abbreviated by the displayed capital letters. The "nal stress of about SSTRESS"10% is considerably higher than for the Iris and Virus data set. Fig. 1 displays both a grey level visualization of the dissimilarity matrix (dark values denote high similarity) which have been sorted according to a clustering solution and the discovered embedding
663
which is in good agreement with the similarity values of the data. Note the signi"cant di!erences between the three embeddings. Results are consistent with the biochemical classi"cation. Embedding of a face recognition database: Another visualization experiment was motivated by the development of a face recognition system. A database of 492 persons3 was used to obtain a dissimilarity matrix by a variant of a facial graph matching algorithm based on Gabortransform features [42]. The images were restricted to the central region of the face and did not include signi"cant background information. Additional preprocessing of the dissimilarity values was required to remove artifacts resulting from a dimensional mismatch [5] in the dissimilarity distribution. There is no prominent cluster structure, and the stress was comparatively high (around 10%), which is indicative for a too low dimension M of the embedding space. Despite these shortcomings, regions with images containing distinctive features such as a pair of glasses, smiling or an opened mouth showing the teeth can be well separated even in this low-dimensional representation as can be seen in Fig. 11. These distinctive regions are also supported by the results of a pairwise clustering of the data [27]. The experiment is intended to demonstrate how MDS can serve as a tool for exploratory data analysis in data mining applications. It allows the user to get an impression what properties are selected by his (dis-)similarity measure when analyzing relational data. Together with data clustering tools this procedure might reveal additional facets of the data set under investigation. 6.5. A note on performance CPU time: Annealing techniques have the reputation of being notoriously slow. The algorithms described in this paper support a di!erent picture: Table 4 presents the average CPU time needed to compute embeddings for three of the data sets discussed above. For, e.g. the Iris data set and depending on the convergence parameter e as well as the annealing schedule (g"0.8 within an exponential schedule), the total execution time of 203 seconds CPU time on a 300 MHz Linux Pentium-II is indeed comparable to the CPU time of sammon (149 s/5000 iterations). A rich potential for optimization resides in the expensive update of the site statistics ' (E-like step) which is of i order O(N2). Omitting the update of those sites which do not change during the M-like step can help to reduce execution time. A systematic evaluation of such potentials by selection strategies is part of current research.
3 FERET database, P.J. Phillips, U.S. Army Research Lab.
664
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
Speed}accuracy tradeow: Despite the potential for further optimization of the CPU time requirements it is worth to consider an application of our current DA implementation even if computation time is limited. Fig. 9 shows the average Sammon STRESS obtained with gradient descent, zero temperature and annealing on the Iris data set. Both zero temperature as well as annealing do not produce acceptable results faster than in 20 s. But as soon as this time is available, results are better than those of Sammon on average for both zero temperature and annealed optimization. Note that also the variance has been signi"cantly reduced, i.e., solutions with higher than average stress are less likely to occur (cf. Table 5). Parameter sensitivity: To evaluate the robustness properties of the annealing algorithm, we performed a number of experiments with suboptimal annealing and conver-
gence parameters in order to enforce fast convergence. Table 5 lists the average stress obtained in 100 runs with the respective values of f the start temperature ¹ , 0 f the annealing parameter g, f the convergence threshold e.
Table 4 Average CPU time elapsed on a standard 300 MHz Linux Pentium-II while computing three of the described embeddings with (1) gradient descent (LVQ-PAK) on Sammon mapping (5000 iterations), (2) zero-temperature as well as (3) deterministic annealing on Sammon mapping Algorithm
Iris (s)
Globin (s)
Virus (s)
Gradient sammon Zero Sammon Annealing Sammon
149 200 203
} 340 353
9.7 21 24
Fig. 9. Improvement of the average solution as a function of the available CPU time. Dashed: Sammon, dotted: zero temperature, solid: annealing for low and high initial temperatures (upper curve: ¹ "0.1, lower curve: ¹ "10). In the latter case 0 0 CPU time has been controlled by a variation of the annealing parameter g.
Table 5 Average stress and standard deviation after 100 runs of the deterministic annealing algorithm on the Iris data set subject to a variation of the annealing parameters ¹ , g and e. For each combination of the parameters, the respective column contains the average "nal stress 0 (in 10~3), the corresponding standard deviation (in 10~3) and the average CPU time (in seconds) needed to compute an embedding e g
1.0 Stress
CPU
0.1 Stress
CPU
0.01 Stress
CPU
0.001 Stress
CPU
¹ "10 0 0.2 0.4 0.6 0.8
12.7 6.30 4.65 4.17
[1.80] [1.21] [0.65] [0.01]
20 35 59 126
8.63 5.05 4.38 4.16
[2.67] [0.68] [0.15] [0.01]
23 37 61 128
5.56 4.56 4.27 4.17
[1.23] [0.27] [0.16] [0.01]
27 41 67 140
4.58 4.32 4.18 4.15
[0.52] [0.23] [0.01] [0.01]
36 58 96 203
¹ "0.1 0 0.2 0.4 0.6 0.8
21.3 8.14 5.56 4.71
[10.1] [1.21] [0.87] [0.55]
14 24 40 86
7.91 6.46 5.76 5.40
[1.75] [1.30] [1.17] [0.63]
19 29 45 91
5.12 4.98 5.19 5.05
[0.67] [0.93] [1.09] [0.61]
22 32 48 91
4.91 4.66 4.65 4.55
[0.52] [0.51] [0.63] [0.50]
29 42 59 110
¹ "0.001 0 0.2 0.4 0.6 0.8
137 [62.5] 25.4 [20.2] 9.67 [1.91] 5.94 [1.05]
8 14 22 44
79.1 14.4 7.80 5.52
[63.0] [13.6] [1.40] [0.98]
10 17 25 47
33.2 6.98 5.97 5.08
[27.4] [1.65] [1.35] [0.76]
16 24 33 53
27.2 7.19 5.57 5.01
[21.7] [1.62] [1.27] [0.72]
19 31 39 62
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
The starting temperature of course has a certain e!ect on the quality of the "nal solution. Apparently the annealing parameter g has a strong e!ect on the CPU time as well as the quality of the the "nal result. In addition, the convergence parameter e exhibits a considerable in#uence on the solution quality. Interestingly, the e!ect of e on the CPU time is not as prominent as that of g. We suspect that waiting for true convergence before cooling is particularly important at higher temperatures when structural parameters like the orientation of complete clusters are determined (see next section). Furthermore, true convergence at higher temperatures seems to lead to faster convergence at lower temperatures afterwards.
665
6.6. Structural diwerences of local minima conxgurations Are there signi"cant structural di!erences between the embedding with the smallest stress and those con"gurations which are supposed to be a good local minimum in terms of the stress? To answer this question, we performed Procrustes analysis [2] on pairs of embeddings (translation, rotation and scaling) to optimally match the con"gurations in a linear sense. As a typical example consider Fig. 10. For the Iris data set it displays the di!erence between the best solution found in all experiments with stress H"4.13]10~3 (presumably the optimal solution) and a local minimum con"guration with
Fig. 10. Di!erence vectors between the coordinates of two embeddings computed for the Iris data with Sammon mapping: Optimal con"guration found (stress " 0.00413) and a local minimum (stress " 0.00505). The ends of each line denote the position of the same object in the two embeddings.
666
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
stress H"5.05]10~3. Corresponding points are connected by lines. We "nd a complete re#ection of the "rst cluster (the separate one) between the two solutions. The two other clusters do not di!er signi"cantly. Nevertheless, three points of the intermediate cluster are embedded in a completely di!erent neighborhood in the suboptimal con"guration. The data analyst who exclusively relies on this visualization might therefore draw unwarranted conclusion in this situation. The large distance between the di!erent embeddings of the three points in the intermediate cluster is supposed to
be a consequence of the re#ection of the "rst cluster. In order to "nd a better minimum, we would then have to break of the re#ection, i.e., rearrange the "rst cluster completely. Clearly this will not happen if the con"guration is already in a stable state: The embedding process gets stuck in this local minimum. Deterministic annealing helps to avoid such situations, since the re#ectional symmetry of the clusters is broken in the beginning of the experiment at high temperatures, when the points inside each cluster are still at an entropy-dominated position. If, e.g. the second and third cluster have just determined
Fig. 11. Face recognition database: Embedding of a distance matrix of 492 faces by Sammon mapping (300 randomly chosen are used for visualization).
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
their symmetry, the symmetry of the "rst cluster can be adjusted with little e!ort. At low temperatures, however, such global recon"gurations are unlikely if the stress of the embedding is small compared to the transition temperature.
7. Conclusion A novel algorithm for least}squares multidimensional scaling based on the widely used SSTRESS objective has been developed in the framework of maximum entropy inference [5]. The well-known optimization principle `deterministic annealinga has been generalized to continuous optimization problems. An algebraic transformation enables us to adapt the approach for Sammon mapping. Thus it covers the two most widely used MDS criterion functions. A large number of MDS experiments support our claim that annealing techniques display superior robustness properties compared to conventional gradient descent methods both on synthetic and on real-world data sets. The computational complexity of the new algorithms is comparable to standard techniques. As the algorithms minimize an upper bound on the free energy de"ned by the respective cost function, convergence is guaranteed independently of any critical parameters such as the step-size in the gradient descent approach. Our current research focuses on techniques to alleviate the computational burden posed by a large number N of objects, e.g. N+10 000}50 000 for realistic biochemical or document databases. Active sampling strategies will enable the estimation of the statistics # on the basis of a sparsely sampled dissimilarity matrix, and an integration of clustering and embedding yields the approximation of site}site by site}cluster interactions. Another line of research extends the deterministic annealing principle to alternative cost functions for MDS, e.g. other choices for the metric of the embedding space.
Acknowledgements M. Vingron provided us with the protein database. H. K. would like to thank T. Hofmann for valuable discussions. This work has been supported in part by the Federal Ministry of Education and Research.
Appendix. Minimization of the partial costs The minimization of the potentials (25) and (34) de"ning the partial cost of embedding x with a local density i model q (x D # ) plays an important role for the converi i i gence of the algorithm, since a straightforward minimization by Newton}Raphson or conjugate gradient is like-
667
ly to "nd a local minimum. We therefore present a technique to explicitly enumerate all extrema, which will be feasible at least for moderate embedding dimensions. The derivation uses the fact that the cost function (1) is invariant with respect to translations and rotations of the con"guration Mx N. Replacing Mx N by i i Mx( D x( "R(x !y)N; R3SO , y3RM i i i n
(A.1)
the costs do not change: H(Mx N)"H(Mx( N). Given the i i partial costs in a general form f (l )"DDl DD4#DDl DD2lThK #lTHl #lTh , i i i i i i i i i i
(A.2)
a canonic choice for R and y can be derived that simpli"es Eq. (A.2) signi"cantly. We discuss the case of the Dirac-model (25) here. To obtain an equation of the form (A.2) for the Gaussian model from Eq. (32), ewective "elds have to be de"ned which subsume the additional terms depending on the model variance p2. i The "rst step is to calibrate y such that the coe$cients i of hK vanish: i 4 0"hK "! + w Sx !yT. i ik k a i kEi
(A.3)
This leads to the choice 1 y " + w Sx T. i ai ik k 0 kEi
(A.4)
If one translates the con"guration by y, the coordinate moments change as follows (omitting the index i): Sx( T"SxT#y, Sx( x( TT"SxxTT#SxTyT#ySxTT#yyT, Sx( x( TT"SxxTxT#SDxD2Ty#2SxxTT#2yyTSxT # DyD2SxT#yyTy.
(A.5)
The variables of the translated system are marked with a hat. Consequently, translated statistics 'K have to be computed according to Eq. (A.5). Rotating the coordinate system into the eigensystem of the symmetric matrix H by an orthogonal matrix V yields a diagonal matrix D D"Diag(j ,2,j )"VTHV, 1 n
(A.6)
j , 1)i)n being the eigenvalues of B. After division by i a0, translation and rotation, the potential has the form i f (l)"DDlDD4#lTDl#lTh,
(A.7)
omitting the index i and the hat above the l for conciseness. To compute the extrema of this potential, set the components of the gradient to zero. L f (l)"4k DDlDD2#2j k #h "0, 1)a)M. a a a a a
(A.8)
668
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669
If oO!j /2, the solution for l is a h a ; o"DDlDD2. k "! a 4o#2j a For (A.9) to hold o must ful"ll the condition M M h2 a o"DDlDD2" + k2" + a (4o#2j )2 a a/1 a/1 which is equivalent to
(A.9)
(A.10)
M M M o < (4o#2j )2! + h2 < (4o#2j )2"0. (A.11) b a b b/1 a/1 bEa This is a polynomial of degree 2M#1 in one variable. Its roots have to be evaluated numerically, e.g. by Laguerre's method [32]. Applying the inverse rotation and the inverse translation to (A.9), the extrema of (A.2) can be determined directly from the roots o ; 1)q)2M#1. q If h"0, the obvious solutions of (A.8) are o"0 and j o"! a, 1)a)M, 'j )0. a 2
(A.12)
By rearranging (A.8), solutions for k are obtained, (A.13) l"$J!j e , 1)a)M, 'j )0 a a a where e is the ath unit vector. Again, applying inverse a rotation and inverse translation yields the results in the original coordinate system.
References [1] I. Borg, P. Groenen, Modern Multidimensional Scaling, Springer Series in Statistics, Springer, Berlin, 1997. [2] T.F. Cox, M.A.A. Cox, Multidimensional Scaling, Monographs on Statistics and Applied Probability, vol. 59, Chapman & Hall, London, 1994. [3] J.B. Kruskal, Multidimensional scaling by optimizing goodness of "t to a nonmetric hypothesis, Psychometrika 29 (1) (1964) 1}27. [4] J.A. Hartigan, Representations of similarity matrices by trees, J. Am. Statist. Assoc. 62 (1967) 1140}1158. [5] H. Klock, J.M. Buhmann, Multidimensional scaling by deterministic annealing, in: M. Pellilo, E.R. Hancock (Eds.), Proc. EMMCVPR'97, Lecture Notes in Computer Science, vol. 1223, Springer, Berlin, 1997, pp. 245}260. [6] R.O. Duda, P.E. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, 1973. [7] Y. Takane, F.W. Young, Nonmetric individual di!erences multidimensional scaling: an alternating least squares method with optimal scaling features, Psychometrika 42 (1) (1977) 7}67 ALSCAL. [8] J.W. Sammon, A nonlinear mapping for data structure analysis, IEEE Trans. Comput. C-18 (5) (1969) 401}409. [9] R.N. Shepard, The analysis of proximities: multidimensional scaling with an unknown distance function I, Psychometrica 27 (1962) 125}140.
[10] J.B. Kruskal, Nonmetric multidimensional scaling: a numerical method, Psychometrika 29 (2) (1964) 115}129. [11] J. De Leeuw, Applications of convex analysis to multidimensional scaling, in: J.R. Barra, F. Brodeau, G. Romier, B. van Cutsen (Eds.), Recent Developments in Statistics, North-Holland, Amsterdam, 1977, pp. 133}145. [12] J. De Leeuw, Convergence of the majorization method for multidimensional scaling, J. Classi"cation 5 (1988) 163}180. [13] W.J. Heiser, A generalized majorization method for least squares multidimensional scaling of pseudodistances that may be negative, Psychometrika 38 (1991) 7}27. [14] P.J.F. Groenen, R. Mathar, W.J. Heiser, The majorization approach to multidimensional scaling, J. Classi"cation 12 (12) (1995) 3}19. [15] P.J.F. Groenen, The majorization approach to multidimensional scaling: some problems and extensions, PhD thesis, Leiden University, 1993. [16] R.W. Klein, R.C. Dubes, Experiments in projection and clustering by simulated annealing, Pattern Recognition 22 (2) (1989) 213}220. [17] R. Mathar, A. Zilinskas, A class of test functions for global optimization, J. Global Optim. 5 (1994) 195}199. [18] J. Mao, A.K. Jain, Arti"cial neural networks for feature extraction and multivariate data projection, IEEE Trans. Neural Networks 6 (2) (1995) 296}317. [19] M.E. Tipping, Topographic mappings and feed-forward neural networks, PhD thesis, University of Aston in Birmingham, 1996. [20] D. Lowe, Novel &topographic' nonlinear feature extraction using radial basis functions for concentration coding in the &arti"cal nose', in: Proc. 3rd IEE Int. Conf. on Arti"cial Neural Networks, IEE, London, 1993. [21] A.R. Webb, Multidimensional scaling by iterative majorization using radial basis functions, Pattern Recognition 28 (5) (1995) 753}759. [22] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671}680. [23] E.T. Jaynes, Information theory and statistical mechanics, Phys. Rev. 106 (1957) 620}630. [24] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell. 6 (1984) 721}741. [25] K. Rose, E. Gurewitz, G. Fox, Statistical mechanics and phase transitions in clustering, Phys. Rev. Lett. 65 (8) (1990) 945}948. [26] J.M. Buhmann, H. KuK hnel, Vector quantization with complexity costs, IEEE Trans. Inform. Theory 39 (4) (1993) 1133}1145. [27] T. Hofmann, J.M. Buhmann, Pairwise data clustering by deterministic annealing, IEEE Trans. Pattern Anal. Machine Intell. 19 (1) (1997) 1}14. [28] S. Gold, A. Rangarajan, A graduated assignment algorithm for graph matching, IEEE Trans. Pattern Anal. Machine Intell. 18 (4) (1996) 377}388. [29] J.M. Buhmann, T. Hofmann, Central and pairwise data clustering by competitive neural networks, in: Advances in Neural Information Processing Systems, vol. 6, Morgan Kaufmann, Los Altos, CA, 1994, pp. 104}111.
H. Klock, J.M. Buhmann / Pattern Recognition 33 (2000) 651}669 [30] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Royal Statist. Soc. Ser. B (methodological) 39 (1977) 1}38. [31] R.M. Neal, G.E. Hinton, A new view of the EM algorithm that justi"es incremental and other variants, in: M.I. Jordan (Ed.), Learning In Graphical Models, NATO ASI Series D, Kluwer Academic Publishers, Dortrecht, 1998, pp. 355}368. [32] W. Press, S. Teukolsky, W. Vetterling, B. Flannery, Numerical Recipes in C, 2nd ed., Cambridge University Press, Cambridge, 1992. [33] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cli!s, NJ 07632, 1988. [34] B.D Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, 1996. [35] H. Niemann, J. Weiss, A fast converging algorithm for nonlinear mapping of high-dimensional data to a plane, IEEE Trans. Comput. C-28 (1979) 142}147. [36] E. Mjolsness, C. Garrett, Algebraic transformations of objective functions, Neural Networks 3 (1990) 651}669.
669
[37] A. Rangarajan, E.D. Mjolsness, A Lagrangian relaxation network for graph matching, IEEE Trans. Neural Networks 7 (6) (1996) 1365}1381. [38] I.M. Elfadel, Convex potentials and their conjugates in analog mean-"eld optimization, Neural Computation 7 (1995) 1079}1104. [39] G. Strang, Introduction to Applied Mathematics, Wellesley, Cambridge, MA, 1986. [40] R.A. Fisher, The use of multiple measurements on taxonomic problems, Ann. Eugenics 7 (1936) 179}188. [41] T. Kohonen, H. Hynninen, J. Kangas, H. Laaksonen, K. Torkkola, LVQ-PAK: The learning vector quantization program package, Technical Report A30, Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02150 Espoo, Finland, 1996. [42] M. Lades, J.C. VorbruK ggen, J.M. Buhmann, J. Lange, Ch. von der Malsburg, R.P. WuK rtz, W. Konen, Distortion invariant object recognition in the dynamic link architecture, IEEE Trans. Comput. 42 (1993) 300}311.
About the Author*HANSJOG RG KLOCK (Warneboldt) received his Diploma degree in Physics from the University of GoK ttingen, Germany in 1993 with a Diploma thesis on articulatory speech synthesis. In 1993/1994 he joined the Speech Research group at the III. Institute of Physics in GoK ttingen. Since November 1994 he is with the Computer Vision and Pattern Recognition group of the University of Bonn, where he currently completes his Ph.D. thesis with a focus on modeling and optimization aspects in multidimensional scaling. His research interests also include signal processing, wavelets and video coding. About the Author*JOACHIM M. BUHMANN received a Ph.D. degree in theoretical physics from the Technical University of Munich in 1988. He has held postdoctoral positions at the University of Southern California and at the Lawrence Livermore National Laboratory. Currently, he is a professor for applied computer science at the University of Bonn, Germany where he heads the research group on Computer Vision and Pattern Recognition. His current research interests cover statistical learning theory and its applications to image understanding and signal processing. Special research topics include exploratory data analysis, stochastic optimization, and computer vision.
Pattern Recognition 33 (2000) 671}684
Object localization using color, texture and shape Yu Zhong, Anil K. Jain* Department of Computer Science, Michigan State University, E. Lansing, MI 48824, USA Received 15 March 1999
Abstract We address the problem of localizing objects using color, texture and shape. Given a handrawn sketch for querying an object shape, and its color and texture, the proposed algorithm automatically searches the image database for objects which meet the query attributes. The database images do not need to be presegmented or annotated. The proposed algorithm operates in two stages. In the "rst stage, we use local texture and color features to "nd a small number of candidate images in the database, and identify regions in the candidate images which share similar texture and color as the query. To speed up the processing, the texture and color features are directly extracted from the Discrete Cosine Transform (DCT) compressed domain. In the second stage, we use a deformable template matching method to match the query shape to the image edges at the locations which possess the desired texture and color attributes. This algorithm is di!erent from other content-based image retrieval algorithms in that: (i) no presegmentation of the database images is needed, and (ii) the color and texture features are directly extracted from the compressed images. Experimental results demonstrate performance of the algorithm and show that substantial computational savings can be achieved by utilizing multiple image cues. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Deformable template matching; Color; Shape; Texture; Feature extraction; Compressed domain; Image database; Discrete Cosine Transform (DCT)
1. Introduction We are now living in the age of multimedia, where digital libraries are beginning to play a more and more important role. In contrast to traditional databases which are mainly accessed by textual queries, digital libraries, including image and video databases, require representation and management using visual or pictorial cues. The current trend in image and video database research re#ects this need. A number of content-based image database retrieval systems have been designed and built using pictorial cues including shape, texture, and color. Among them, QBIC (Querying by Image Content) [1] can query large on-line image databases using image content (color, texture, shape, and geometric composi-
* Corresponding author. Tel.: #517-353-6484; fax: #517432-1061 E-mail address: [email protected] (A.K. Jain)
tion). It uses both semantic and statistical features to describe the image content. Photobook [2] is a set of interactive tools for browsing and searching image databases. It uses both semantic-preserving contentbased features and text annotations for querying. The Virage search engine enables search using texture, color and composition for images and videos [3,4]. A novel image region segmentation method was used in Ref. [5] to facilitate automatic region segmentation for color/ texture based image retrieval. Vinod and Murase [6] proposed to locate an object by matching the corresponding DCT coe$cients in the transformed domain [6]. Color, texture and shape features have also been applied to index and browse digital video databases [7], For all these applications, object shape, as an important visual cue for human perception, plays a signi"cant role. Queries typically involve a set of curves (open or closed) which need to be located in the images or video frames of the database. In most of the image retrieval approaches, the challenge is to extract appropriate features such that they are
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 9 - 5
672
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
representative of a speci"c image attribute and at the same time, are able to discriminate images with di!erent attributes. Color histogram [8] is a commonly used color feature; responses to specially tuned spatial and orientation "lters are widely used to characterize a texture. Invariant moments and histograms of edge turning angles are used as shape features [9]. Once features are extracted to characterize the image property of interest, the matching and retrieval problem is reduced to computing the similarity in the feature space and "nding database images which are most similar to the query image. However, it is not always clear whether a given set of features is appropriate for a speci"c application. Feature-based methods can be applied only when the object of interest (and the associated features) has been segmented from the background. Deformable templatebased methods [10}14] do not compute any speci"c shape features. Various deformable template models have been proposed to perform tasks including image registration, object detection and localization, feature tracking, and object matching. These deformable models are popular because (i) they combine both the structural knowledge and local image features, and (ii) they are versatile in incorporating intra-object class variations. We have proposed one such method for shape matching [11]. The advantage of this method is that it does not compute speci"c shape features, and no segmentation of the input image is necessary. However, the generality of the approach and avoidance of segmentation are achieved at the cost of expensive computation. As a result, Deformable Template Matching (DTM) method is currently more suited for o!-line retrieval tasks rather than online retrievals. In order to make the DTM method feasible for online retrievals, we have adopted a hierarchical retrieval scheme which integrates the three important image content cues: shape, texture, and color. In the "rst (screening) stage, the database is browsed using some simple and e$cient matching criteria. In particular, texture and color features are used as supplemental clues to help locate promising regions in the image which are likely to contain the desired objects. This eliminates a large portion of the database images from further screening. Once a small set of candidate regions is obtained, we then use the deformable template matching method to localize the objects in the proximity of these regions in the second stage. A diagram of this system is given in Fig. 1. This hierarchical mechanism can improve both e$ciency and accuracy. The motivation of this work is threefold: (i) the region cues (texture and color) may come naturally as a constraint in the retrieval task, (ii) the region cues may be used to expedite the localization process: the deformable template matching process need not be executed where the region cues are quite di!erent from the desired ones,
and (iii) region-based matching methods are more robust to misalignment and position shift than edge-based methods. We use the region information to obtain some good yet coarse initializations. The contributions of this work are as follows: (i) we extract color and texture features directly from the compressed image data, (ii) we use the region attributes to direct the shape-based search to save computational costs, and (iii) we sensibly fuse multiple content cues to e$ciently retrieve images from a nonannotated image database where the only information available is the bit stream of the images. The remainder of the paper is organized as follows. In Section 2 we describe the screening process using color and texture, where these features are extracted from the DCT domain to browse the database and retrieve a small number of images as well as to identify speci"c locations for the object of interest in these images. In Section 3 we describe the deformable template approach to the shape matching problem, where the query shape is used as a prototype template which can be deformed. We integrate the color, texture, and shape matching in Section 4 and present the two-stage matching algorithm. Experimental results are presented in Section 5. Section 6 summarizes the paper and proposes future work.
2. Matching using color and texture Texture and color features have been used in several content-based image database systems to retrieve objects or images of a speci"c texture and color composition [2, 15}17]. We use texture and color cues in addition to shape information to localize objects. For example, one may be interested in "nding a "sh, with a particular shape, color and texture. The texture and color information can be speci"ed in terms of a sample pattern, as in the case `I want to retrieve all "sh images with the same color and texture as the "sh in this picturea. When such image region information is available, we use these features to quickly screen the input image and retrieve a small set of candidate positions where we can initialize the deformable template-based shape matching process. As the color and texture cues are used as supplemental tools for examining an image for the presence of a candidate object, we need to use features which are easy to compute and at the same time, characterize the desired color and texture properties. For this purpose, we extract the features from the block DCT coe$cients of an image. These coe$cients can be obtained directly from DCT compressed images and videos (JPEG [18], MPEG [19]) without "rst decompressing them. This is very appealing since more and more images and videos are stored in compressed format for e$cient storage and transfer [7,20].
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
673
Fig. 1. Image retrieval system using color, texture, and shape.
2.1. DCT compressed images DCT-based image compression techniques encode a two-dimensional image by the block DCT coe$cients. To compress an image, the DCT coe$cients of each N]N image block (macroblock) are computed and quantized. These compression techniques take advantage of the fact that most of the high-frequency components of the transformed image are close to zero. The low-order coe$cients are quantized to save the bits, and then further compressed using either the Hu!man coding or the
arithmetic coding method. The JPEG images and Intra frames of MPEG videos are compressed this way, where the value of N is set to 8. The DCT coe$cients Mc N of an N]N (N is usually uv a power of 2) image region MI , 0)x(N, 0)y(NN xy are computed as follows: 1 N~1 N~1 nu(2x#1) nv(2y#1) c " K K + + I cos cos , uv N u v xy 2N 2N x/0 y/0 (1)
674
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
where u and v denote the horizontal and vertical frequencies (u, v"0, 1 ,2, N!1), and K , K "1/J2 for u v u, v"0 and K "K "1, otherwise. The DC componu v ent (c ) of the transformed coe$cients represents the 00 average of the spatial domain signals I in the macroxy block, and the AC components (c , uO0 or vO0) uv capture the frequency (characterized by u and v) and directionality (by tuning the u and v values) properties of the N]N image block. One property of the discrete cosine transform is that for a typical image, its energy is dominant at the lowfrequency components. This means that the coe$cients of the high-frequency components are close to zero, and therefore negligible in most cases. Most of the information is contained in the low-frequency components, which represent a `coarsea or `blurreda version of the spatial image. We will now show how we extract texture and color features from DCT coe$cients. 2.2. Texture features An image region is textured if it contains some repetitive gray-level pattern. Texture is usually characterized by the spatial variation, directionality, and coarseness in the image. Textured images provide rich information about the image content. It is desirable to determine whether texture-based methods are suitable for processing the given image [21]. Multichannel "ltering approach has been used extensively in texture analysis. This includes the Gabor-"lter-based approach by Jain and Farrokhnia [22] the wavelet transform model by Chang and Kuo [23], and the subband approach by Jernigan and D'Astous [24] to name a few. As the discrete cosine transform converts the spatial image information into the spatial frequency domain, we de"ne texture features as the spectrum energies in di!erent channels of a local macroblock. The absolute values of the AC components of the quantized DCT coe$cients of each macroblock indexes the channel spectrum. We use them as the texture features which are expected to capture the spatial variation and directionality of the image texture. The DC component, which is the average greyscale value of the macroblock, is not considered a texture measure. This is reasonable because we usually subtract the mean to normalize the image before extracting texture features.
model is that the human eye is usually more sensitive to the luminance changes than to the chrominance changes. As a result, the chrominance frames can be encoded at a lower bit rate than the luminance frame for compression purposes, without signi"cantly a!ecting the quality of the perceived image. In line with the JPEG and MPEG standards, we use the YCrCb model for representing color images. We use the DC components of the DCT coe$cients of the three frames Y, Cr and Cb to represent the color of a macroblock. We note that although the intensity (the Y plane) is subject to lighting conditions, the Cr and Cb components are more robust indicators of the color attribute. However, for image retrieval tasks, it is necessary to distinguish between bright red and dark red. So, the intensity also plays a role in color perception. We should note that although we use the DC component of DCT for representing the color attribute and AC components for texture, we believe that texture and color properties are mingled together. The variation in color results in color texture, so it is di$cult to draw a clear boundary between color and texture. 2.4. Feature selection There are N2 DCT coe$cients for an N]N image block; for an 8]8 macroblock, there are thus 64 coe$cients. Not all the coe$cients contain useful information. As mentioned earlier, for a typical image a large number of the high frequency components have negligible coe$cients. We use the following two di!erent criteria to choose only M features out of the N2 total number of features (M;N2): 1. We take the M lowest-frequency components. That is, we pick Dc D, Dc D, Dc D, Dc D, Dc D, 2 and so on, until 10 01 20 11 02 we have selected M features; 2. Find the M features which maximize the energy for the query image as follows:
2.3. Color features
1. obtain the quantized DCT coe$cients for all the DCT blocks for the query object region; 2. compute the absolute values of the AC components as features; 3. sum up the energies for each frequency component over all the DCT blocks in the region; 4. select those M features that have the most energy over all the blocks.
The YCrCb color model is widely used to encode color images in TV and video and in various compression standards, including JPEG and MPEG. This color space is obtained by applying a linear transformation to the RGB color space, where the Y plane represents the luminance information, and the Cr and Cb planes encode the chrominance di!erences. The advantage of this color
The texture features are extracted separately for each of the three color frames (Y, Cr, Cb). It turns out that for most cases, the two criteria select the same set of features, except that when the query image presents very "ne texture, the second criteria results in a feature set which outperforms the "rst one. We have used the "rst feature selection method in out experiments for its simplicity.
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
675
ith vector in R to the query set Q is the summation of the distances in color and texture:
2.5. Representing the query image region The query image is represented by a set of feature vectors. For every N]N block in the query image, we compute a feature vector according to the feature extraction criterion in Section 2.4. Note that we allow the overlapping of the macroblocks so that the blocks densely cover the query region, and all the N]N blocks in the query region are covered. The DCT coe$cients of a non-aligned block can be computed from the DCT coe$cients of its four overlapping, aligned macroblocks using the algorithm proposed by Chang and Messerschmitt [25]. Each feature vector includes both the color and texture features which are extracted as speci"ed in Sections 2.2 and 2.3. So, for a query region of size M]M(M'"N), we obtain (M!N#1)] (M!N#1) feature vectors. If there is a large number of feature vectors, we cluster all the feature vectors, and only keep the features corresponding to the cluster centers to maintain a small set of representative features.
Dist(R , Q)"dist (R , Q)#dist (R , Q). i text i color i
(4)
The distance of set R to set Q is de"ned as the average distance of vectors in R to Q: NR Dist(R, Q)" + Dist(R , Q)/N , i R i/1
(5)
where N is the number of feature vectors in R. Note that R this distance is asymmetric. We de"ne a symmetric distance measure between R and Q as follows: DIST(R, Q)"1 (Dist(R, Q)#Dist(Q, R)). 2
(6)
3. Deformable template matching Shape-based matching is a di$cult problem in content-based retrieval due to the following factors:
2.6. Similarity computation We have represented the query region attributes using a set of feature vectors (Section 2.5) which characterize color and texture. In the same manner, we can also extract a set of feature vectors to represent a region in the test image, one vector for each macroblock in this region. Then we can match the query region to an arbitrary region in the database image by comparing the two characteristic feature vector sets. We have derived a symmetric distance measure between query feature set Q and a test region feature set R. First, we de"ne the color and texture distances of the ith feature vector in set R to vector set Q as the minimum distance to each of the vector in Q: 1 N~1 ( f ik!f jk)2 text , dist (R , Q)"Min + text text i N vartext j|Q k k/0
(2)
1 3 (f !f jk)2 color , dist (R , Q)"Min + colorik color i varcolor j|Q 3 k/1 k
(3)
where R denotes the ith feature vector in R, f ik ( f ik) i text color denotes the texture (color) feature k for vector i, and vartext (varcolor ) denotes the variance of texture (color) k k feature k in the database. The weighted distance measure is used because the DC component usually has a very large variation, the low-frequency AC features have a smaller variation, and the high-frequency AC components have the least variation. We weigh the contribution of each feature by the variance of each feature component computed from all the macroblocks in the database images. (This is equivalent to the Mahalanobis distance with a diagonal covariance matrix.) The distance of the
f For a query shape, one generally has no prior information about its presence in database images, including the number of occurrences and its location, scale, and orientation. f Often, the desired object has not been segmented from the background in the image. f There is a need to accommodate both rigid and nonrigid deformations in the query shape. f Most quantitative shape features cannot e$ciently represent di!erent query shapes. We have proposed a deformable template matching model to retrieve objects using handrawn sketches [11], where prior knowledge of an object shape is described by a handrawn prototype template T which consists of its 0 representative contours. The shape variations in an object class are accommodated using a set of probabilistic deformation transformations on the template. The deformations are parameterized using a set of deformation parameters n. A deformed template T(T , n), which is 0 derived from the prototype T and the values of the 0 deformation parameters n, then interacts with the input image I via a directional edge potential "eld E (I) edge computed from the salient edge features (edge positions and directions). A Bayesian scheme, which is based on the prior knowledge and the likelihood (edge information) in the input image, is employed to "nd a match between the deformed template and objects in the image. The "tness of the template to a subimage of the input edge map is measured by an objective function L which consists of two parts: (i) a function P(n) of the deformation parameters n, which penalizes the deformations from the reference query template, and (ii) an error term
676
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
E(T(T , n, E (I)) between the deformed template 0 edge T(T , n) and the input edgemap (position, direction) 0 E (I) which measures the discrepancy of the deformed edge template to the input edgemap. The matching proceeds by minimizing the objective function w.r.t. the deformation and pose parameters of the template. Interested readers are referred to Ref. [11] for more details. To determine the presence of a desired object in the neighborhood of a given location, the prototype template is initialized at the proximity of this location. The gradient descent method is used to "nd the smallest value of the objective function L with respect to the deformation parameters and the other transformation parameters (translation, rotation, and scale). If L is less than a threshold value, then the desired object is assumed to be present, and the "nal con"guration of the deformed template gives the detected object shape and its location; otherwise, it is decided that the desired object is not present. A multiresolution algorithm searches for the desired object in a coarse-to-"ne manner. We use the above-mentioned deformable template approach [11] to perform shape matching. Some deformed versions of a hand-drawn sketch are shown in Fig. 2 to illustrate the deformations that are allowed for this approach, where Fig. 2(a) is the prototype template on a grid, and Figs. 2(b)}(d) are the deformed templates using the deformation transform in Ref. [11]. In spite of the multiresolution scheme, the deformable template matching is computationally expensive. To improve the performance, we use the texture and color features to prune the search space for the template localization process. We apply the deformable template matching process only at those image locations which match the query region in texture and color.
4. Integrating texture, color and shape We have integrated texture, color, and shape cues to improve the performance of the retrieval process. The integrated system operates in two stages. Since regionbased matching methods are relatively robust to minor displacements as long as the two matching regions sub-
stantially overlap, we browse the database using color and texture in the "rst stage, so that only a small set of images, and a small number of locations in the candidate images are identi"ed. In the second stage, the identi"ed regions with the desired texture and color are used to direct the shape-based search, so that the iterative matching process is only performed in the proximity of those candidate locations. The integrated matching algorithm is described as follows: Region-based screening. f Compute feature vectors for the query region: f C extract the quantized DCT coe$cients for the macroblocks in the sample region; f C compute DCT coe$cients for the other displaced 8]8 blocks from the DCT coe$cients of the 4 overlapping macroblocks; f C form the color and texture feature vectors for each block, as described in Section 2; f C if the number of sample blocks exceeds a threshold, cluster the sample feature vectors; keep the cluster centers as the representative sample feature vectors; f Find similar images in the database: f f f f
C C C C
for each database image, for each macroblock in the database image: compute the color and texture feature vectors; place the masked query shape at evenly spaced positions, and over a discretized set of orientations, f C compute the distance between the query texture and color attributes and the masked input image region as described in Section 2.6. If the distance is less then a threshold, initialize the shape-based matching. Shape-based matching. f Initialize the query template at the computed con"gurations from the previous stage for M iterations; if the "nal objective function value is less than a threshold, report the detection.
Fig. 2. Deformations of a hand-drawn template.
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
5. Experimental results We have applied the integrated retrieval algorithm to an image database containing 592 color images of people, animals, birds, "shes, #owers, outdoor and indoor scenes, etc. These images are of varying sizes from 256]384 to 420]562. They have been collected from di!erent sources including the Kodak Photo CD, web sites (Electronic Zoo/Net Vet-Animal Image Collection URL: http://netvet/wusti.edu/pix.htm), and HP Labs. Some sample images from the database are illustrated in Fig. 3. To gain some insight about the DCT spectrums that we have used as texture and color features, Fig. 4 shows the absolute value of block DCT coe$cients of a color image containing houses (Fig. 4(a)). Figs. 4(b)}(d) show the absolute values of the DCT coe$cients for the three color components separately. Each small image (block) corresponds to the spectrum of a speci"c channel, that is,
677
one feature for all the macroblocks in the image. The x-axis (across the features) indicates horizontal variations, and the y-axis (across the features) indicates vertical variations, with increasing frequencies from left to right, top to bottom. So, the block at the top left corner corresponds to the DC component, which is the averaged and subsampled version of the input image, and the small images on the top row, from left to right, correspond to channels of zero vertical frequency, and increasing horizontal frequencies. This "gure shows that the top left channels, which represent the low-frequency components, contain most of the energy, while the high-frequency channels, which are located at the bottom right corner of each "gure, are mostly blank. It also indicates that the channel spectrums capture the directionality and coarseness of the spatial image; for all the vertical edges in the input image, there is a corresponding high frequency component in the horizontal frequencies, and
Fig. 3. Sample images from the database. They have been `scaleda for display purposes.
678
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
Fig. 4. Features extracted from the block DCT coe$cients. (a) 250]384 input color image; (b) DCT features for the Y frame (intensity); (c) DCT features for the Cr frame (chrominance); (d) DCT features for the Cb frame (chrominance).
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
vice versa. Furthermore, diagonal variations are captured by the channel energies around the diagonal line. This example illustrates that the DCT domain features do characterize the texture and color attributes. We now show the retrieval results using only texture and color, as described by the "rst stage of the integrated algorithm. Fig. 5 shows one example of color matching, where the image in the subwindow in Fig. 5(a) is the query sample, and Fig. 5(b) gives the top-4 retrieved
679
images from the database. The three DC components of the color frames are used as the color features. Fig. 6 shows one matching result using the texture features. Five features are selected from each of the Y, Cr, and Cb frames, so that a total of 15 features are used. Fig. 6(a) speci"es the query textured region, and Fig. 6(b) shows the matching macroblocks in the same image, and Fig. 6(c) shows the retrieved regions with similar texture.
Fig. 5. Retrieval based on color: (a) query color is speci"ed by the rectangular region; (b) top-4 retrieved images from the database which contain blocks of similar color.
680
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
Fig. 6. Retrieval based on texture: (a) query texture is speci"ed by the rectangular region; (b) matching macroblocks are marked with crosses in the query image; (c) other nine (besides (a)) retrieved images from the database which contain regions of similar texture.
One example of object localization using color and shape is illustrated in Fig. 7, where the rectangular region in Fig. 7(a) speci"es the sample color. Matching macroblocks in the same images are identi"ed by &x', as shown in Fig. 7(c). Note that almost all the blocks on the "sh where the query is extracted are marked. Part of another "sh with a similar blueish color is also marked. No blocks in the background pass the color matching test. Shape matching using the handrawn sketch in Fig. 7(b) is then processed around the two detected regions. The
"nal matched results are shown in Fig. 7(d). The "nal con"guration of the deformed templates agrees in most part with the "sh boundaries. The deviations of the template from the "sh boundary are due to the edges extracted in the textured background. Note that although there is another striped "sh in the image, it is not localized due to its di!erent color. We show another example of the integrated retrieval in Fig. 8. One region is extracted from a cardinal to specify the query color and texture, as shown in Fig. 8(a).
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
681
Fig. 7. Retrieval based on color and shape: (a) query color is speci"ed by the rectangular region; (b) sketch for the shape; (c) matching macroblocks are marked with crosses in the query image; (d) two retrieved shapes.
682
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684
Fig. 8. Retrieval based on color, texture, and shape: (a) query region is speci"ed by the rectangular region; (b) sketch for the shape; (c) retrieved shape.
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684 Table 1 Performance of the two-stage algorithm; the database contains 592 color images. Computation time denotes CPU time per 256]384 image on a SGI Indigo 2
Stage 1 Stage 2
Images retrieved
Computation time
11% 1.2%
0.1 s 1.76 s
683
extract more reliable texture features, which can capture texture structures that go beyond the size of a DCT macroblock. The performance of the retrieval system can be further improved by adopting learning and e$cient search techniques in the "rst stage [26] to reduce the search complexity from linear to logarithmic.
Acknowledgements The authors would like to thank Dr. Hongjiang Zhang of HP labs for providing some of the test images.
A sketch of a side view of a bird is used as the shape template (Fig. 8(b)). One cardinal image is retrieved from the database using the combined shape and region information (Fig. 8(c)). The performance of the system is summarized in Table 1. Using texture and color, we can eliminate a large portion (89%) of the database images. A total of 18 color and texture features are used. Given a query image, it typically takes about 180 sec to perform a retrieval on our database containing 592 images on a SGI Indigo 2 workstation. Query images are successfully retrieved.
6. Conclusion We have proposed an algorithm for object localization using shape, color, and texture. Shape-based deformable template matching methods have the potential in object retrieval because of their versatility and generalizability in handling di!erent classes of objects and di!erent instances of objects belonging to the same shape class. But, one disadvantage in adopting them in content-based image retrieval systems is their computational cost. We have proposed e$cient methods to compute texture and color features to direct the initialization of the shapebased deformable template matching method. These texture and color features can be directly extracted from compressed images. This "ltering stage allows the deformable template matching to be applied to a very small subset of database images, and only to a few speci"c positions in the candidate images. Preliminary experimental results show computational gains using these supplemental features. The proposed method assumes no preprocessing of the image database. The input is the raw image data. We believe that our system can be used as an auxiliary tool to annotate, organize, and index the database using color, texture, and shape attributes o!-line, where features (shape, color and texture) of retrieved objects are computed and stored to index the database. We are currently investigating whether shape matching can also be performed in the compressed domain, which may be feasible now that the edge detectors are available for compressed data. We are also trying to
References [1] W. Niblack, R. Barber, W. Equitz, The QBIC project: querying images by content using color, texture, and shape, Proceedings of the SPIE Conference on Storage and Retrieval for Image and Video Databases, vol. 1908, 1993, pp. 173}187. [2] A. Pentland, R.W. Picard, S. Sclaro!, Photobook: tools for content-based manipulation of image databases, Proceedings of the SPIE Conference on Storage and Retrieval for Image and Video Databases II, 2185-05, February 1994. [3] J.R. Bach, C. Fuller, A. Gupta, The Virage image search engine: an open framework for image management, Proceedings of the SPIE, vol. 2670, Feb. 1996, pp. 76}87. [4] A. Hampapur, A. Gupta, B. Horowitz, C.F. Shu, C. Fuller, J. Bach, M. Gorkani, R. Jain, Virage video engine, Proceedings of the SPIE: Storage and Retrieval for Image and Video Databases V, San Jose, CA, 1997, pp. 188}197. [5] W.Y. Ma, B.S. Manjunath, Netra: a toolbox for navigating large image databases, in Proceedings of the International Conference on Image Processing (ICIP), vol. 1, Santa Barbara, CA, Oct. 1997, pp. 568}571. [6] V.V. Vinod, H. Murase, Object location using complementary color features: histogram and DCT, Proceedings of the 13th International Conference on Pattern Recognition (ICPR), Vienna, Austria, 1996, pp. 554}559. [7] H.J. Zhang, C.Y. Low, S.W. Smoliar, Video parsing and browsing using compressed data, Multimedia Tools and Applications 1 (1) (1995) 89}111. [8] M.J. Swain, D.H. Ballard, Color indexing, Int. J. Comput. Vision 7 (1) (1991) 11}32. [9] A. Vailaya, Y. Zhong, A.K. Jain, A hierarchical system for e$cient image retrieval, Proceedings of the 13th International Conference on Pattern Recognition (ICPR), Vienna, Austria, 1996, pp. 356}360. [10] U. Grenander, M.I. Miller, Representation of knowledge in complex systems, J. Roy. Statist. Soc. (B) 56 (3) (1994) 1}33. [11] A.K. Jain, Y. Zhong, S. Lakshmanan, Object matching using deformable templates, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1996) 267}278. [12] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Int. J. Comput. Vision 1 (4) (1988) 321}331. [13] B.C. Vemuri, A. Radisavljevic, From global to local, a continuum of shape models with fractal priors, Proceedings of the IEEE Conference on Computer Vision and Pattern
684
[14]
[15]
[16]
[17]
[18] [19]
Y. Zhong, A.K. Jain / Pattern Recognition 33 (2000) 671}684 Recognition (CVPR), New York City, NY, June 1993, pp. 307-313-627. A.L. Yuille, P.W. Hallinan, D.S. Cohen, Feature extraction from faces using deformable templates, Int. J. Comput. Vision 8 (2) (1992) pp. 133}144. M. Das, E. Riseman, Focus: Searching for multi-colored objects in a diverse image database, Proceedings of the IEEE Conference Computational Vision Pattern Recognition '97 (CVPR), 1997, pp. 756}761. M.M. Gorkani, R.W. Picard, Texture orientation for sorting photos `at a glancea, Proceedings of the 12th International Conference on Pattern Recognition, Jerusalem, Israel, Oct. 1994, pp. A459}A464. J. Huang, R. Kumar, M. Mitra, Image indexing using color correlograms, Proceedings of the IEEE Conference on Computational Vision Pattern Recognition '97 (CVPR), 1997, pp. 762}768. G.K. Wallace, The JPEG still picture compression standard, Commun. ACM 34 (4) (1991) 31}44. D.L. Gall, MPEG: a video compression standard for multimedia applications, Commun. ACM 34 (4) (1991) 47}58.
[20] B. Shen, I.K. Sethi, Direct feature extraction from compressed images, Proceedings of the SPIE Conference on Storage and Retrieval for Image and Video Databases IV, vol. 2670, 1995. [21] K. Karu, A.K. Jain, R.M. Bolle, Is there any texture in the image?, Pattern Recognition 29 (9) (1996) 1437}1446. [22] A.K. Jain, F. Farrokhnia, Unsupervised texture segmentation using gabor "lters, Pattern Recognition 24 (12) (1991) 1167}1186. [23] T. Chang, C. Kuo, Texture analysis and classi"cation with tree-structured wavelet transform, IEEE Trans. Image Process. 2 (1994) 429}441. [24] M.E. Jernigan, F. D'Astous, Entropy-based texture analysis in the spatial frequency domain, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 237}243. [25] S.F. Chang, D. Messerschmitt, A new approach to decoding and compositing motion compensated DCT-based images, Proceedings international Conference Acoustic Speech Signal Processing, 1993, Minneapolis, pp. 421}424. [26] J. Weng, On comprehensive visual learning, Proceedings NSF/ARPA Workshop on Performance vs. Methodology in Computer Vision, Seattle, WA, June 1994, pp. 152}166.
About the Author*YU ZHONG received the B.S. and M.S. degrees in Computer Science and Engineering from Zhejiang University, Hangzhou, China in 1988 and 1991, the M.S. degree in Statistics from Simon Fraser University, Burnaby, Canada, in 1993, and the Ph.D. degree in Computer Science from Michigan State University, East Lansing, Michigan, in 1997. She is currently a postdoctoral fellow at Carnegie Mellon University. Her research interests include image/video processing, pattern recognition, and computer vision. About the Author*ANIL JAIN is a University Distinguished Professor and Chair of the Department of Computer Science at Michigan State University. His research interests include statistical pattern recognition, Markov random "elds, texture analysis, neural networks, document image analysis, "ngerprint matching and 3D object recognition. He received the best paper awards in 1987 and 1991 and certi"cates for outstanding contributions in 1976, 1979, 1992, and 1997 from the Pattern Recognition Society. He also received the 1996 IEEE Trans. Neural Networks Outstanding Paper Award. He was the Editor-in-Chief of the IEEE Trans. on Pattern Analysis and Machine Intelligence (1990}1994). He is the co-author of Algorithms for Clustering Data, Prentice-Hall, 1988, has edited the book Real-Time Object Measurement and Classi"cation, Springer-Verlag, 1988, and co-edited the books, Analysis and Interpretation of Range Images, Springer-Verlag, 1989, Markov Random Fields, Academic Press, 1992, Arti"cial Neural Networks and Pattern Recognition, Elsevier, 1993, 3D Object Recognition, Elsevier, 1993, and BIOMETRICS: Personal Identi"cation in Networked Society to be published by Kluwer in 1998. He is a Fellow of the IEEE and IAPR, and has received a Fulbright research award.
Pattern Recognition 33 (2000) 685}704
Genetic algorithms for ambiguous labelling problems Richard Myers, Edwin R. Hancock* Department of Computer Science, University of York, York YO1 5DD, UK Received 15 March 1999
Abstract Consistent labelling problems frequently have more than one solution. Most work in the "eld has aimed at disambiguating early in the interpretation process, using only local evidence. This paper starts with a review of the literature on labelling problems and ambiguity. Based on this review, we propose a strategy for simultaneously extracting multiple related solutions to the consistent labelling problem. In a preliminary experimental study, we show that an appropriately modi"ed genetic algorithm is a robust tool for "nding multiple solutions to the consistent labelling problem. These solutions are related by common labellings of the most strongly constrained junctions. We have proposed three run-time measures of algorithm performance: the maximum "tness of the genetic algorithm's population, its Shannon entropy, and the total Hamming distance between its distinct members. The results to date indicate that when the Shannon entropy falls below a certain threshold, new solutions are unlikely to emerge and that most of the diversity in the population disappears within the "rst few generations. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Consistent labelling; Genetic algorithms; Ambiguity; Line labelling; Graph matching
1. Introduction According to Marr's principle of least commitment, a hallmark of intelligence is the ability to simultaneously entertain several hypotheses until there is su$cient evidence to drop all but one [1]. This paper concerns ambiguous consistent labelling problems, and suggests a framework for maintaining populations of related solutions based on the genetic algorithm. 1.1. Consistent labelling The consistent labelling problem was formulated by Haralick and Shapiro in the 1970s. A set of units must be assigned labels subject to constraints [2,3]; examples include graph colouring, subgraph isomorphism, inexact matching, the Boolean satis"ability problem and scene labelling. The problem is known to be NP-complete and
* Corresponding author. Tel.: #1904-433-374; fax: #1904432-767 E-mail address: [email protected] (E.R. Hancock)
is often solved using deterministic search [2,4]. Operators such as forward checking and back marking [4], and Waltz "ltering (discrete relaxation) [5], which prune incompatible unit-label assignments from the search space, improve the e$ciency of search. However, search is of little use when no totally consistent solution exists, such as is the case with inexact matching or analysis of `impossiblea scenes; and neither search nor discrete relaxation use global contextual evidence, relying instead on pre-de"ned local constraint dictionaries. Most recent work involving consistent labelling has adopted Hummel and Zucker's paradigm for the case where the compatibility coe$cients are symmetric: the problem is to "nd a set of unit-label assignments which maximises some global consistency measure [6]; this is usually done by gradient ascent [6}10]. Gradient ascent techniques are appropriate when there are no local optima between the initial guess and the solution; this is not usually the case, i.e. gradient ascent requires a good initialisation. It may therefore be preferable to use techniques known to posses global convergence properties such as simulated annealing [11,12], mean "eld annealing [13,14] or genetic search [15], which is the method
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 8 0 - 1
686
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
used here. A weakness of global optimisers is that they do not generally take into account the initial labelling assignment. However, it should be said that it is possible to initialise a genetic algorithm in a non-random manner. In this paper, we always use random initialisation. 1.2. Ambiguity Many consistent labelling problems have more than one possible solution. This was recognised in Waltz's original paper [5], but no strategy for handling ambiguity was developed. In the machine vision literature, ambiguity has been seen as a `bad thinga } to be resolved locally as quickly as possible, rather than as a necessary part of scene interpretation. Waltz used search to extract an arbitrary solution [5]; Hummel and Zucker used a simple de"nition of `unambiguous labellinga as a sine qua non for consistency [6]; and Faugeras and Berthod developed a measure of ambiguity which was minimised in their relaxation scheme [7]. Much work concerning ambiguity has been done by linguists and psychologists since language understanding is fraught with ambiguity [16]. MacDonald and coworkers suggest Hummel and Zucker's relaxation framework [6] as a computational model for the disambiguation of sentences based on lexical constraints [16]. Observed frequency and context appear to be the major factors in determining the "nal interpretation of a word [7,18]; Kawamoto has used a connectionist model to demonstrate this dependency [17]. Ambiguities also occur in visual perception. Connectionist systems have been used to model visual perceptual alternation of ambiguous visual stimuli, in which the interpretation of drawings such as the Necker cube and ShroK der staircase (see Fig. 1) periodically switches between several alternatives [19}22]. Bialek and Deweese show that the alternation rate depends on a priori hypotheses [17}22]. Kawabata has observed that the visual "xation point determines the perception of depth and alternation rates in such "gures [23]. He suggests that a local interpretation at the "xation point propagates to generate a stable global interpretation. These observations chime with the selective attention hypothesis [24,25] in which a priori expectations combined with focussed attention lead to stable unambiguous interpretations of ambiguous "gures. Calliari and Ferrie [26] have recently developed a model-based vision system which can cope with ambiguity. The system makes a set of initial guesses which are re"ned by subsequent data gathering. This approach has produced promising results, and would seem to complement an active vision strategy. Ambiguity is a major issue in perceptual organisation: Williams and Hanson use an integer linear programming formalism to represent a space of legal labellings from which an optimal one is selected [27]. Kumaran et al. [28] use simulated annealing to "nd the best of a set of possible organisations of the scene.
Fig. 1. Two ambiguous drawings: (a) Necker cube, (b) ShroK der staircase.
Early disambiguation may be appropriate if there is compelling local evidence for a particular interpretation, but if not, backtracking is generally ine$cient [4]. Although the use of global contextual information in scene interpretation is a major unsolved problem in machine vision; premature commitment to a particular interpretation does not help } rather, it makes the problem worse. Following the principle of least commitment, the initial stage of scene interpretation should yield several plausible, and perhaps related, solutions from which the system can choose without having to backtrack. 1.3. Paper overview The aim of the work reported here is to investigate the e!ectiveness of genetic algorithms as a means of locating multiple solutions to ambiguous labelling problems. Our aim is to explore whether population-based optimisation methods can provide greater solution yields that multiple random starts. We focus on two di!erent labelling problems. The "rst of these is furnished by Hu!man}Clowes line labelling. As we have already pointed out, this is a well-known and extensively studied ambiguous labelling problem. Conventionally, ambiguities are exhaustively generated using the Waltz "ltering algorithm. In other words, the line-labelling problem furnishes a convenient example in which the fractional solution yield of the genetic algorithms can be evaluated. However, one of the limitations of the line-labelling problem is the relatively weak nature of the constraints residing in the junction dictionaries for the four line-labels. Our second problem is that of graph matching. Here the dictionaries of consistent subgraph matches provide much stronger constraints. Unfortunately, the exhaustive enumeration of ambiguity is not feasible. We therefore use this second example only to provide additional information concerning the population statistics. In general, very little modi"cation should be necessary to make a genetic algorithm work for line labelling. An evolutionary process usually has two levels: the genotypic level is the level of encodings } chromosomes or
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
bitstrings; the phenotypic level is the level of observed characteristics. In standard formulations, the precise nature of the problem is invisible to the algorithm: crossover and mutation operate at a genotypic level; selection at the phenotypic. The only stage at which problem knowledge is required is the translation of genotype to phenotype, which is abstracted via a "tness function. It has become clear that the performance of genetic algorithms can be enhanced by adding a local search step at each iteration [29}31]. Gradient ascent is a standard technique for optimising the consistency of labellings in a relaxation framework [6}9]. Its advantages are speed and simplicity, but it su!ers from a major disadvantage in that it cannot escape from local optima. Almost all of the major optimisation techniques which have been developed over the years are attempts to circumvent this problem. Nevertheless, gradient ascent is the method of choice when a local optimum su$ces, or when contextual information can provide an initial guess which is close to the global solution. A combination of the genetic algorithm and gradient ascent should be particularly well-suited to line labelling since the constraints are local. The cross-over used must not be too disruptive because individuals in local optima will tend to have more or less contiguous regions of consistency: a non-disruptive cross-over will cause these regions to coalesce over time. The gradient ascent step will maximise the size of these regions prior to cross-over. Although highly disruptive cross-overs such as uniform cross-over [32] or Eshelman's HUX, in which exactly half of the di!ering bits are exchanged between the parents [33], explore the search space better, such a crossover may not be appropriate in a hybrid algorithm because it would undo much of the work of the gradient ascent step. Much exploration of the search space is undertaken in the gradient ascent step: the members of the population will be forced into local optima so the cross-over need not have great exploratory power } indeed, cross-over should be conservative to avoid disturbing the consistent regions. In this framework, the genetic algorithm is seen as an adjunct to gradient ascent rather than the other way around. For this reason, multi-point cross-over should be preferred when gradient ascent is used. However, in non-hybrid genetic algorithms, the need to adequately explore the search space may dictate that a uniform cross-over be chosen [32]. Although the eventual convergence of genetic algorithms using elitist selection is guaranteed [34], it may take arbitrarily long. Some way of ascertaining the current status of the algorithm is needed. The essence of the genetic algorithm is that the cross-over and mutation operators generate diverse solutions which are tested by the selection operator. The notion of `diversitya in a population really incorporates two distinct attributes: the degree of clustering and the extent to which the individuals span the search space.
687
Our experimental study focusses on two main issues. The "rst of these is to consider which choice of genetic algorithm gives the best solution yield. There are many algorithm variants reported in the literature. Broadly speaking these can be viewed as deriving from di!erent choices of cross-over and selection operators. Moreover, the di!erent algorithm are governed by the choice of population size and mutation rate. We provide a comparative study which points to the best choice of algorithm and parameter setting to optimal solution yield. The second aspect of our study focusses on the run-time population characteristics. Here our aim is to investigate di!erent population statistics which can use to monitor solution yield. We consider, three alternatives namely maximum "tness, population entropy and inter-pattern Hamming distance. The outline of this paper is as follows. Section 2 casts line labeling into an optimisation framework. In Section 3 we explain how the implied optimisation problem can be mapped onto a population-based genetic algorithm. Details of our population-based measures are motivated and presented in Section 4. Section 5 describes experiments for the line-labelling problem. These are augmented in Section 6 with some additional experimentation using graph matching as an example. Finally, Section 7 summarises our conclusions and outlines our future plans.
2. Line labelling by optimisation Line drawing interpretation has been an active area of investigation in machine vision for over 25 yr: it was the work of Hu!man and Clowes on the consistent labelling of line drawings of polyhedral scenes that led Waltz to his seminal discrete relaxation algorithm [5,35,36]. Waltz's contribution was to show how a dictionary of consistent junction labellings could be used in an e$cient search for consistent interpretations of polyhedral objects. Such dictionaries are derived from the geometric constraints on the projection of 3D scenes onto 2D planes [5,37]. The interpretation of line drawings remains an important topic in machine vision, and has obvious applications in document analysis, processing architects' sketches, engineering drawings and so-on. Following the work of Hu!man, Clowes and Waltz, Sugihara developed a grammar for skeletal polyhedra [37]. Malik has extended the theory to include curved surfaces [38], and Williams has used labelled line drawings to reconstruct smooth objects [39]. Kirousis has developed several e$cient algorithms for determining `labellabilitya and labelling [40]. Most recently, Parodi and Piccioli have developed a method for reconstructing 3D scenes from labelled line drawings given known vanishing points [41]. Hancock and Kittler have built on the work of Faugeras and Berthod Faugeras and Berthod [7] and
688
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
Fig. 2. Legal labellings for a FORK junction.
Hummel and Zucker [6] by developing a Bayesian framework for measuring consistency [8]. This framework can be applied at various levels in image analysis from pixel labelling operations through edge and line labelling to relational matching. Its novelty lies in using an explicit dictionary representation of constraints, as adopted by Waltz, in conjunction with a Bayesian model of the constraint corruption process. The constraint corruption model is based on the premise that the representation of an initially consistent scene is subject to the action of a memoryless label-error process } i.e. a labelcorruption process in which successive events are statistically independent [8]. With this model they formulated a probabilistic measurement of the consistency of a labelling: scene interpretation was done by searching for the label con"guration which optimised the probability criterion: this was originally done in [8] by gradient ascent. In a recent preliminary study, Hancock has applied this framework to labelling polyhedral scenes [42]. Suppose that a polyhedral scene under consideration consists of lines drawn from a set U"Mu ,2, u N. Each junction 1 n in the scene can be characterised by the set of indices J of k the lines from which it is constructed. We can form a set J"MJ ,2, J N whose elements are the tuples of line 1 K indices making up each junction. Each of the ELL, TEE, FORK or ARROW junction types has a distinct dictionary which is a compilation of the permitted label con"gurations. Suppose that " dek notes the dictionary for the kth junction. If the label set applying to the scene interpretation task is ""M#, !, P, QN, then the cardinality of the junction dictionary D" D is usually much smaller than the k number of possible con"gurations D"@Jk@D. For example, there are only "ve consistent labellings for a FORK junction (Fig. 2), whereas 43"64 combinatorial possibilities exist. A candidate solution to this labelling problem is a list of labels, ¸"Sj , 2, j T, where j 3". According to 1 n i Hancock and Kittler's relaxation framework [8], the global probabilistic criterion is given by summing the probabilities associated with the labellings ¸ -¸ of k each junction, !(¸ ). k 1 @J@ P(¸)" + !(¸ ). (1) k DJD k/1 The probabilities of the individual junction labellings are computed using a model of the label corruption mecha-
nism. This label-error process assumes that the label on each line is subject to the action of memoryless corruption which occurs with probability p. The consequence of this model is that the consistency of the junction labellings is gauged by an exponential function of their Hamming distances to the various dictionary items. Suppose that H denotes the Hamming distance between the k,l current labelling ¸ of the junction J 3J and the dick k tionary item l3" . The Bayesian model leads to the k following expression for the junction probability, !.
C D
p Hk,l (1!p)@Jk@ + . (2) !(¸ )" k D" D 1!p k l|"k The parameter of this criterion is the probability of memoryless label errors, p. We can re-write the above expression to make the exponential role of Hamming distance explicit.
C
D
(1!p)@Jk@ 1!p !(¸ )" + exp !H ln . (3) k k,l D" D p k l|"k As the error probability, p, decreases towards zero, labellings lying outside the dictionary make smaller contributions. In the limit of zero label error probability, the global criterion counts the number of consistent junctions. Of greater interest are the observations that for small values of ln((1!p)/p), the exponential becomes dominated by the term involving the smallest Hamming distance; and that maximising +exp[!H ] is equiva.*/ lent to minimising +H [8]. Thus we can maximise the .*/ consistency of a labelling by minimising its cost. @ J@ C(¸)" + min kH . l|" k,l k/1
(4)
3. Line labelling with a genetic algorithm Optimisation algorithms based on Darwinian evolution have been proposed by several authors [15,43}45], but it is Holland's formulation [15] which is regarded as the standard. Genetic algorithms simulate evolution to solve problems: candidate solutions model organisms which exist in an environment modelled by the problem itself. Good solutions to a problem `evolvea over time. The variety of organisms in the world suggests that the problem of survival has many good solutions. It is tempting, therefore, to suppose that a genetic algorithm would
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
produce several alternative optimal solutions. However, this behaviour has not generally been observed: one solution becomes dominant since selection biases the population in favour of "t individuals. This genetic drift can be observed even when survival of individuals is equiprobable. A genetic algorithm could also be suitable for `impossiblea objects, where the drawings are not consistently labellable but we nevertheless wish to "nd one or more of the `next besta labellings. The algorithm takes a set of bit-strings, the chromosomes or individuals, and iteratively applies cross-over (mixing) and mutation (random change) operators to them. At every iteration, the "tness of all individuals is evaluated according to some problem-speci"c measure. Individuals are then selected for the next generation based on their scores. Most implementations terminate when either a speci"ed number of iterations has been performed or a maximally "t individual has emerged. The algorithm has several control parameters. These are the cross-over rate, which is the probability of information exchange between individuals; the mutation rate, which in this study is the probability of a single bit-change in an individual; and the population size. The type of cross-over used may also be considered to be a parameter. Where the maximum number of iterations is "xed, this too is a parameter. Recall from the previous section that a candidate solution to the labelling problem is a list of labels, ¸"Sj , 2, j T, where j 3". If this list is given a binary 1 n i encoding, E(¸) : ¸ C I, where I3M0,1N(n> v-'@"@w ), then the problem can be solved using a genetic algorithm, provided some suitable "tness measure F(I) : I C [0,1] can be derived. 3.1. Fitness measure We can derive a linear "tness measure directly from the labelling cost in Eq. (4) to turn C(¸) into a "tness measure for use in a genetic algorithm (i.e. one with range [0, 1]), we exponentiate: F (I)"exp[!bC(E~1(I))] L
(5)
This measure falls o! rapidly with increasing cost. The steepness of the fall-o! can be adjusted by changing the scaling parameter, b (in the work reported here, b"1). The function never tolerates more than a few label errors regardless of the number of junctions, for example: F L has a value of 1 when there are no errors, 0.37 for errors involving one junction, 0.14 for errors involving two junctions, 0.05 for errors involving three junctions, and 0.00 for errors involving six or more junctions. 3.2. Cross-over Cross-over operators generate two o!spring from two parent chromosomes. There are two main classes: uni-
689
form cross-overs exchange information in a bitwise manner; multi-point cross-overs exchange whole sequences of bits at a time. The cross-over strategy is derived from consideration of the algorithmic variant used, and the relationship between regions in the individual chromosomes and lines in the drawing to be labelled. In a standard genetic algorithm, disruptive cross-overs (i.e. uniform) have been shown to explore the search space better [32,33]. However, in a hybrid genetic algorithm with gradient ascent, much exploration will be accomplished by the gradient ascent step, which will tend to create `islands of consistencya. In this case, a more conservative cross-over (i.e. multi-point), which will preserve and coalesce these islands, should be used. The use of multi-point cross-over raises the more subtle question of how the structure of the chromosome relates to the structure of the drawing. The cross-over will recombine chunks of chromosome: neighbouring bits will segregate together, a phenomenon known as linkage in genetics. It is therefore important that those loci which are close in the chromosome should correspond to lines which occupy the same region of the drawing } i.e. lines which are relatively closely connected. This is not a problem with synthetic data, since humans have a natural tendency to segment line drawings and number junctions and arcs accordingly: thus data can be primed subconsciously to yield solutions. However, the same is not true of real-world data, such as edge-detector output. Our method uses a heuristic to number the arcs. In general, TEE junctions represent occlusions of part of the scene by an overlying plane [35]. A crude segmentation can be achieved by numbering the arcs depth-"rst, backtracking at TEE junctions. For our drawings, this makes strongly linked loci in the chromosome map to broadly similar locales in the drawing. However, the inverse relation does not necessarily hold.
4. Monitoring the progress of genetic search Although the eventual convergence of genetic algorithms using elitist selection is guaranteed [34], it may take arbitrarily long. Some way of ascertaining the current status of the algorithm is needed. The simplest statistics are the maximum and mean "tnesses of individuals. The maximum "tness clearly shows how close the population is to the solution: the mean "tness rapidly approaches the maximum "tness as a result of selection pressure; when a new optimum is found, the mean "tness tends to lag behind the maximum "tness and is not therefore an especially useful statistic. Probably because of the lack of a coherent, robust theory for genetic algorithms, there has been relatively little e!ort put in to devising measures of the algorithm's progress at run-time. Many researchers use average "tness to measure the performance (e.g. Ref. [46]). This is
690
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
Table 1 Properties of entropy S and average hamming distance, HM a b c
N (x)"1, N (y)"1 t t`1 N (x)'1, N (y)"1 t t`1 N (x)"1, N (y)'1 t t`1
S unchanged S increased S decreased
N (x)'1, N (y)'1 t t`1
N (x)'(N (y)#1) t t S unchanged N (x)"(N (y)#1) t t decreased N (x)((N (y)#1) t t
G
increased
d
somewhat namK ve since the average "tness will either rapidly approach the maximum "tness as the population converges on an optimum, or provide no speci"c information if the population is distributed over several local optima. When the positions of the optima are known, the numbers of individuals occupying them or close to them can measure the convergence. However, the positions of optima are usually unknown (or there would not be a problem to solve), and the de"nition of `closea may entail ungeneralisable assumptions (e.g. Ref. [47]). Louis and Rawlins use the average Hamming distance between members of the population as a measure of diversity [48]. They successfully use this to give an upper bound on the convergence time of the algorithm, but the measure gives no indication of whether the algorithm is actively exploring the search space or stagnating. Furthermore, as they observed, (traditional) crossover } a key operator in the genetic algorithm } does not a!ect the average Hamming distance. The essence of the genetic algorithm is that the crossover and mutation operators generate diverse solutions which are tested by the selection operator. The notion of `diversitya in a population really incorporates two distinct attributes: the degree of clustering and the extent to which the individuals span the search space. 4.1. Clustering From an information-theoretic point of view, the genetic algorithm's search space is the alphabet from which a population of symbols is drawn. We wish to obtain information about this space by considering the population. The Shannon entropy is a natural measure of how much information about the space is contained in the population [49], and corresponds to the degree of clustering (a `clustera is a bag of identical strings in this case). The Shannon entropy is de"ned as follows for a bag (population) ( of strings, which is a subset of a search space $. Let p be the proportion of the ith distinct string i in (, such that ∀i3[1, D$D] ) 1 )p )1 and +@(@p "1. @(@ i 1 i
HM HM HM HM
unknown unknown unknown unknown
The Shannon entropy S is given by @(@ S"! + p log p . i i 1
(6)
The base of the logarithm depends on the number of possible values of each element in a string. For a standard genetic algorithm this is 2, but since we may not always use a binary encoding, it seems sensible to use the natural logarithm and measure the information in `natural unitsa [49]. The entropy measures clustering: it is 0 when ( contains identical strings; otherwise it is positive and maximal when all the strings in ( are distinct, in which case S"S "logD(D. Consider replacing some string x with .!9 a new string y and the e!ects of this on the entropy, S, and the average Hamming distance, HM . There are four cases shown in Table 1 } we use N (x) to denote the number t of strings x at time t in the population. According to Shannon's observation that any averaging operation will monotonically increase the entropy [49], if N (x)'(N (y)#1), S must increase when an x is replaced t t by a y. The entropy monotonically increases as new information is introduced (cases a and b), and monotonically decreases as information is removed (cases a and c). The former behaviour corresponds to exploration of the search space by the genetic algorithm; the latter to convergence of the algorithm. Even when no distinct string has been added or removed, changes in S are predictable. By contrast, HM is unpredictable in all cases and furthermore tells us nothing about the homogeneity of the population. In fact, HM is equivalent to 2nq(1!q), where q is the proportion of high bits amongst the distinct strings in the population, and hence says very little about the distribution of the strings themselves. 4.2. Span As a "rst approximation, we can measure the extent to which the population spans the search space by considering the total inter-cluster Hamming distance, H , which T
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
compares favourably with HM because it will be increased by any cross-over event which adds new clusters without deleting existing ones. We de"ne H by rewriting M(N as T 6t , where t is the ith cluster in (. H is given by i i T k k H " + + H(t , t ) (7) T i j i/1 j/i`1 where k is the number of clusters (distinct strings) in (. H will almost certainly be changed by mutation, re#ectT ing the way in which these operators sample the search space.
5. Experiments The algorithm was tested on three labelling problems with and without gradient ascent and Waltz "ltering. Several di!erent parameter sets were tried. The number of iterations required to "nd a solution and the solution yields were recorded, as were the entropy and total intercluster Hamming distance. We do not give timing data for the algorithm because "rst such data are generally highly implementation dependent and second our main concern is not algorithm e$ciency. Su$ce to say that G generations of a genetic algorithm with population size P running on a single processor will require O(PG) cross-overs, mutations and "tness evaluations all of which scale linearly with problem size } we expect the characteristic operation to be "tness evaluation in this case. In the case of the hybrid algorithm, the characterisitc operation is de"nitely "tness evaluation since it involves a quadratic hillclimbing step. 5.1. Method A generational algorithm was used. The initial population was created at random, and at each generation, all individuals were subject to cross-over and mutation at rates determined by the control parameters. The popula-
691
tion for successive generations was selected from the current population and its o!spring. `Roulette-wheela selection was used. The algorithm terminated after a set number of iterations regardless of whether any solutions had been found. The algorithm used was a variant of Eshelman's CHC algorithm [33] in which selection for reproduction is done at random, the parent generation is allowed to mutate, and then parents and o!spring compete equally for places in the next generation. `Incest preventiona was not used. HUX cross-over was used in some experiments. The algorithm was run on the problems shown in Fig. 3. These problems can be made arbitrarily larger by adding disconnected copies; this is reasonable because the algorithm does not `knowa that the two drawings are identical: it just sees more lines. The local nature of the constraints means that disconnected copies are almost as di$cult as connected copies. In the work reported here, two copies of each drawing had to be labelled. Several parameter sets were tested with and without gradient ascent, and with and without Waltz "ltering [5]. Statistics were gathered over sets of 1000 trials. 5.1.1. Control parameters Control parameters for the genetic algorithm are notoriously di$cult to set [50]. The literature recommends two alternative parameter suites as set out in Table 2. These parameters are based on the standard test suite for the genetic algorithm developed by DeJong [51]. Several other sets were tried (Table 3). 5.2. Results The results are summarised in Tables 4 and 5 (no consistent labellings were found for the impossible object). The algorithm performed best with gradient ascent, and especially well when this was combined with multi-point crossover (Sets D and E), having the highest convergence rate and highest yields. Waltz "ltering
Fig. 3. Test drawings.
692
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
Table 2 Parameter sets from the literature
Population size Cross-over type Cross-over rate Mutation rate
DJS (DeJong and Spears [52])
Gref (Grefenstette [53])
100
30
2 point
Uniform
0.6
0.9
0.001
0.01
5.3. Discussion
Table 3 Additional parameter sets
Population size Cross-over type Cross-over rate Mutation rate
lower with the plain algorithm (around 0.7). The correlation between the two measures did not depend on the success of the algorithm. Fig. 6 shows the average population entropy over 1000 trials for plain and hybrid algorithms.
Set A
Set B
Set C
Set D
Set E
100
100
100
100
100
Uniform
Weighted
HUX
1 point 2 point
0.9
0.9
0.9
0.9
0.9
0.03
0.03
0.03
0.03
0.03
completely confounded the algorithm. The multi-point cross-overs generally outperformed the uniform ones. 5.2.1. Progress measures Fig. 5 shows sample plots of the maximum "tness, entropy and total inter-cluster Hamming distance of single successful (left column) and unsuccessful (right column) trials. The correlation between the entropy and the total inter-cluster Hamming distance was found to be high (above 0.9) with the gradient ascent hybrid and
5.3.1. Labelling The most convincing results were produced when the algorithm was augmented by gradient ascent. All populations converged within "ve generations on average. This might suggest that the ro( le of the genetic algorithm is not signi"cant. However 100,000 restarts of gradient ascent from the same initial conditions only resulted in 84 and 59 consistent labellings for each problem (about 8 and 5%). It is quite clear from this that the hillclimber is getting stuck in local optima, an escape route from which is provided by the genetic algorithm. Yields were highest with multi-point cross-over: this suggests that the algorithm is combining consistent sublabellings, something which uniform cross-overs would impair. The number of generations to convergence ("ve) compares favourably with the 20 or so needed by the `multiniche crowdinga algorithm used by Vemuri [46]. The failure of the algorithm with Waltz "ltering may appear surprising: Waltz "ltering is known to prune the search space of consistent labellings. However, genetic algorithms work by exploring the "tness landscape; Waltz "ltering sharpens this landscape since partially consistent labellings are regarded as being unacceptable. Thus the algorithm is faced with a landscape consisting of several deep troughs, the local minima, from which it cannot readily escape through mutation. The population rapidly converges and no progress can be made. 5.3.2. Similarity of solutions The solutions found tended to be invariant with respect to FORK junctions. The results of a typical trial
Table 4 Results for the wedding cake problem DJS
Gref
Set A
Set B
Set C
Set D
Set E
Stardard
c: 2.30% y6 : 0.06 g6 : 595
c: 17.8% y6 : 0.54 g6 : 528
c: 29.3% y6 : 2.10 g6 : 281
c: 30.2% y6 : 2.27 g6 : 269
c: 30.4% y6 : 1.87 g6 : 305
c: 35.5% y6 : 3.17 g6 : 237
c: 38.8% y6 : 3.45 g6 : 245
With gradient ascent
c: 99.2% y6 : 17.0 g6 : 2.47
c: 76.1% y6 : 3.34 g6 : 3.45
c: 99.4% y6 : 17.3 g6 : 2.37
c: 97.8% y6 : 13.5 g6 : 2.54
c: 99.2% y6 : 17.6 g6 : 2.34
c: 100% y6 : 25.2 g6 : 2.29
c: 100% y6 : 33.0 g6 : 2.22
Note: c is the proportion of trials yielding consistent labellings, y6 is the average solution yield over all trials, g6 is the average generation at which the "rst solutions are found. No solutions were found with Waltz "ltering (c: 0% in all cases).
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
693
Table 5 Results for the groove 2 problem DJS
Gref
Set A
Set B
Set C
Set D
Set E
Standard
c: 3.80% y6 : 0.04 g6 : 687
c: 23.3% y6 : 0.34 g6 : 508
c: 38.3% y6 : 1.02 g6 : 230
c: 37.4% y6 : 0.99 g6 : 270
c: 33.3% y6 : 0.80 g6 : 250
c: 42.6% y6 : 1.11 g6 : 244
c: 42.9% y6 : 1.10 g6 : 224
With gradient ascent
c: 98.6% y6 : 9.78 g6 : 2.96
c: 75.9% y6 : 3.23 g6 : 4.23
c: 99.2% y6 : 15.1 g6 : 2.76
c: 99.4% y6 : 13.4 g6 : 2.77
c: 98.4% y6 : 15.3 g6 : 2.77
c: 99.9% y6 : 17.8 g6 : 2.47
c: 99.9% y6 : 19.8 g6 : 2.61
Note: c is the proportion of trials yielding consistent labellings, y6 is the average solution yield over all trials, g6 is the average generation at which the "rst solutions are found.
Fig. 4. Related labellings. Labellings of line-triples with strong chromosomal linkage (proximity) found in 11 distinct solutions. Note that the lines incident at FORK junctions only have one label, but the others may have several. Lines are labelled in numerical order.
which found 11 distinct labellings for one of the two `wedding cakesa are given in Fig. 4. The convex interpretation of the two FORKs predominates. This cannot be explained simply by the proximity of the arcs in the drawing (and hence their strong linkage in the chromosomes), since other arc-groups (e.g. 15}17) do not show this consistency. It is likely that a random change in the labelling of a consistently labelled junction will yield a less good labelling. Consider an ELL junction: there are 16 combinatorial labelling possibilities, six have Hamming distances of zero from the Hu!man dictionary (i.e. they are consistent), and ten have Hamming distances of one; none have Hamming distances of two. This means that a random replacement of a consistent labelling has a probability of 5/15"0.30 of yielding another consistent labelling and a probability of 10/15"0.60 of yielding a labelling with a single error. By contrast, a FORK junction has 64 combinatorial possibilities of which "ve are consistent; the outcomes of a replacement of a consistent labelling are: another consistent labelling with prob-
ability 4/63"0.06, a labelling with Hamming distance one with probability 39/63"0.62, or a labelling with Hamming distance two with probability 20/63"0.32. Thus, the expectation of the Hamming distance from a consistent labelling following a labelling change is 0.60 for an ELL junction and 1.25 for a FORK junction, so FORKs can be said to be more strongly constrained than ELLs. We would therefore expect the labellings of FORK junctions to be relatively immune to the e!ects of gradient ascent, cross-over and selection; and the "nal population will probably only contain individuals with one labelling for any particular FORK. Our results reinforce the "ndings of Trueswell and others with respect to the propagation of interpretation. Trueswell and coworkers have suggested that rapid disambiguation occurs in regions of strong constraint [54]; Kawabata has suggested that a local interpretation tends to propagate when humans are faced with ambiguous scenes [23]. With this in mind, FORK junctions can be seen as models for strongly constrained localities which tend to dictate the interpretation of their surroundings.
694
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
Fig. 5. Measurements on a genetic algorithm. Left column: successful run, right column: unsuccessful run. A log scale is used for the inter-cluster Hamming distance.
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
This chimes with the notion that the alternative interpretations of a drawing should all be plausible given a priori evidence, and suggests that the search can be controlled by seeding the initial population appropriately. 5.3.3. Progress measures As can be seen from Fig. 5, for populations of 100 individuals, the entropy always starts at 4.6. This is reassuring: the "rst generation is initialised at random, and for a population size of 100, the maximum entropy is ln 100"4.61. As the population becomes saturated, the entropy usually falls to some minimum below about 2, but the variations in entropy and total inter-cluster Hamming distance after saturation indicate that the algorithm is still attempting to explore the search space. The presence of a set of relatively "t individuals reduces the likelihood that new chromosomes will persist. Some, but not all, of the major peaks in entropy coincide with jumps in the maximum "tness } i.e. "nding a new optimum. Those peaks which do not presumably represent unsuccessful forays in the search space. Those peaks which do coincide with jumps in maximum "tness may either precede or follow them. This can be explained by proposing several methods by which new optimal solutions can arise. The algorithm may explore some fruitful avenue in the search space, causing an increase in entropy, then an optimal solution may be found following a cross-over or mutation. Thus an entropy peak can precede a "tness jump. Alternatively, a new solution may arise de novo without extensive search. There will be a "tness jump with no entropy peak. However, if the copy number of the new solution increases over the next few
695
generations, the entropy peak will succeed the "tness jump. A peak occurs because the initial copy number is 1. Replacing a string from a large cluster with one from a smaller one will increase the entropy, but as some point, the cluster containing the new string becomes su$ciently large that adding to it reduces the entropy. Hence the peak. Fig. 6 shows that the behaviour of the entropy is remarkably consistent between trials: there is an abrupt decrease from the maximum to around 2 of the maximum 3 over the "rst few generations followed by a fall to some relatively constant minimum value ((2) after 20 to 40 generations. This minimum is typically lower ((1) in successful trials. New optima are rarely found once the entropy minimum has been reached. The initial selection removes most of the diversity from the population: the total inter-cluster Hamming distance falls from around 100,000 to around 1000 and the entropy loses 1 of its initial value. This is almost certainly 3 the reason for the high correlations observed between entropy and total inter-cluster Hamming distance. The especially high correlations observed with gradient ascent may arise from the fact that the clusters are relatively stable since they all represent locally optimal solutions.
6. Graph matching To provide some additional experiments we focus on the problem of graph matching. We furnish this example to illustrate how the performance of the genetic algorithm scales when both the number of available labels and the number of ambiguous solutions increases.
Fig. 6. Average entropy of the population for (a) 1000 runs of the plain algorithm and (b) 200 runs of the algorithm with gradient ascent. Lines between points are drawn for clarity: the data are discrete. Dashed lines indicate 1 standard deviation on either side of the solid lines.
696
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
Table 6 Algorithm variations used Experiment
A
B
C
D
E
F
G
Algorithm Population Iterations Crossover Crossover rate Mutation rate
GA#GD 100 10 2 point 0.9
GA#GD 100 10 Geomet 0.9
GA 100 4000 2 point 0.9
CHC#GD 50 10 HUX 1.0!
CHC 50 6000 HUX 1.0!
GA#D 100 10 None 0
Restarts 100 10,000 * *
0.3
0.3
0.3
0.35
0.35
0.3
*
!In fact, because of incest prevention, the e!ective cross-over rate is only about 0.3.
We adopt a simpli"ed version of the inexact matching criterion developed by Wilson and Hancock [10]. In our formulation, we consider only symbolic } i.e. relational constraints: there is no dependence on node attributes. The basic idea underlying this consistency measure is to compare the symbolic matches residing on the neighbourhoods of a data graph with their counterparts in a model graph. Suppose that the data graph G "(< , E ) has node set < and edge set E . In order to 1 1 1 1 1 accommodate the possibility of clutter nodes in the data graph, we use a null label 0 to augment the set of model graph nodes. The basic unit of comparison is the neighbourhood which consists of the nodes connected to a centre object j by data graph edges, i.e. C " j jXMiD(i, j)3E N. If the model graph is denoted by 1 G "(< , E ), then the state of match between the two 2 2 2 graphs is represented by the function f :< P< X0. The 1 2 matched realisation of the neighbourhood C is reprej sented by the con"guration of symbols ! "X j f (i). j i|C Wilson and Hancock's basic idea was to invoke the concept of a label-error process to facilitate the comparison of the matched neighbourhoods in the data graph with their counterparts in a model graph. This label-error process assumes that mis-assigned matches occur with a probability p while null-matches occur with a probability l. The consequence of this model is that the consistency between the matched data graph neighbourhood ! and the model graph neighbourhood S is gauged by j k two quantities. The "rst of these is the Hamming distance H(! ,S )"+ k(1!d ) between the assigned match j k l|S f(l),l and the match demanded by the `dictionary-itema S . k The second quantity is the number of null matches '(! ) j currently assigned to the nodes of the data-graph neighbourhood C . These Hamming distances are used to j compute a global probability of match using the following formula:
The exponential constants appearing in the above expression are related to the uniform probability of matching errors and the null-match probability in the following manner: b "ln e
1!p p
(9)
and a"ln
(1!p)(1!l) . l
(10)
The parameter p is gradually reduced towards zero with increasing iterative epochs of the genetic algorithm. This has the e!ect of gradually hardening the constraints residing in the dictionary. In particular, departures from
1 P " + (1!p)@Cj@ + G D< D]D< D 1 2 j|V1 k|V2 ]exp[!(a'(! )#b H(! , S ))]. j e j k
(8)
Fig. 7. Data graph. Nodes 20 and 21 are clutter nodes.
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
697
Table 7 Results for graph matching 1: l"0.0001 Experiment
A
B
C
D
E
F
G
Evaluations per individual Maximum "tness Number of trials Average yield Mean "tness Modal Fitness
13,400 0.590313 100 82.6% "max "max
* 0.590313 100 63.1% "max "max
11,200 0.544882 1 1% 0.440464 0.454085
2380 0.590313 100 48.2% "max "max
2670 0.590267 5 12% 0.464111 "max
5080 0.590313 100 63% "max "max
10,000 0.227017 53 1% 0.205699 "max
Table 8 Results for graph matching 2: l automatically assigned (0.095338) Experiment
A
B
C
D
E
F
G
Evaluations per individual Maximum "tness Number of trials Average yield Mean "tness Modal "tness
13,400 0.361458 100 81.8% "max "max
* 0.361458 100 63.1% "max "max
11,200 0.339230 1 1% 0.273756 0.289177
3150 0.361458 100 82.8% "max "max
2660 0.359988 1 14% 0.289400 0.314319
5080 0.361458 100 62% "max "max
10,000 0.142249 74 1% 0.136558 "max
zero Hamming distance become increasingly energetically unfavourable. Once p(l then residual matching errors migrate into the null category. As we shall demonstrate later, this induces a phase transition which manifests itself as a dip in the di!erent diversity plots. The quantity P lends itself naturally to the de"nition G of a population membership probability. Suppose that P(i) denotes the global con"gurational probability for the G ith member of the pool (population) of graphs. By normalising the sum of clique con"guration probabilities over the population of matches, we arrive at the following probability for randomly admitting the ith solution to the pool of graphs P with probability P(i) G . P" s + P(i) i|P G
(11)
6.1. Experiments In our evaluation of the graph matching process, we have focussed on three optimisation algorithms. These were the basic genetic search procedure described in the previous subsection, Eshelman's CHC algorithm [33], and multiple restarts of gradient ascent. In Table 6 we refer to these as `GAa, `CHCa and `Restartsa. We have also investigated the e!ects of adding a hill-climbing step (`GDa in Table 6) to the genetic search procedure and the CHC algorithm. Our experiments have been conducted
with a 20-node synthetic nearest-neighbour (planar) graph to which 10% clutter nodes have been added to simulate the e!ects of noise, as shown in Fig. 7. Again, algorithm e$ciency is not our primary concern, but we note here that for graph matching, the "tness evaluation is quadratic in the number of nodes, and that the hillclimbing step in this case is quartic. The algorithm variants and the associated parameters used in our experiments are summarised in Table 6. It must be stressed that no attempt was made to "netune any of the algorithm parameters. In all cases, the probability of null-matches (l) was set at either 0.0001 (e!ectively zero) or half the relative cardinalities of the graphs, (DDD< D!D< DDD)/0.5(D< D#D< D). For genetic algorithms D M D M with cross-over, about 10,000 cost function evaluations were allowed. Omitting the cross-over step reduces the number of evaluations required; the other algorithms were run to about 5000 evaluations. Note that the CHC algorithm uses half the population of the standard genetic algorithm. Experiment F used no cross-over: it was gradient ascent augmented with mutation and stochastic selection operations. 6.2. Discussion 6.2.1. Matching The maximal "tness score without null-matches (l"0.0001) (see Table 7) is 0.590313 } Fig. 11 shows some sample matches found. The nodes in the lower
698
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
Fig. 8. Measurements on a genetic algorithm. Left column: l"0.0001, right column: l"0.095338. A log scale is used for the inter-cluster Hamming distance.
(uncorrupted) portion of the graph are consistently correctly matched. The nodes which are most often mismatched are 1, 12, 14, 20 and 21. All of these nodes are either clutter nodes or connected by more than 1 clutter edge. Since our cost function only allows 1 null-match
per superclique, it is not surprising that nodes 1, 12 and 14 are mismatched since the cardinality of matching supercliques may not di!er by more than 1. Nodes 20 and 21 should be labelled with nulls, but since our cost function discriminates against null-matches, we do not
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
699
Fig. 9. Measurements on CHC with gradient descent. Left column: l"0.0001, right column: l"0.095338. A log scale is used for the inter-cluster Hamming distance.
expect the algorithm to get these right. The maximal "tness score with null-matches (l"0.095338) (see Table 8) is 0.361458. Although this appears lower than 0.590313, it actually re#ects the same number
of errors since the value of l contributes to the cost function. We may say therefore that the performance of the algorithm on matching is as good as can be expected.
700
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
Fig. 10. Measurements on CHC without gradient descent. Left column: l"0.0001, right column: l"0.095338. A log scale is used for the inter-cluster Hamming distance.
When null-matching was excluded, the genetic algorithm with gradient ascent, CHC with gradient ascent, and `stochastica gradient ascent gave the best matching results. Results from the genetic algorithm without gradi-
ent ascent were slightly suboptimal. Multiple restarts of gradient ascent did not yield good matches. When nullmatching was allowed, all the gradient ascent methods except multiple restarts found optimal solutions. The
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
701
Fig. 11. Typical matches found. This "gure shows a random sample of the many solutions of "tness 0.590313.
non-gradient ascent methods performed less well, and multiple restarts performed poorly. Thus, as far as quality of matching is concerned, any of the stochastic optimisers with gradient ascent are adequate for the task. 6.2.2. Ambiguity The test problem is highly ambiguous: a rough calculation shows that there are tens of thousands of possible solutions. The solution yields from the genetic algorithm with two-point cross-over and gradient ascent were high for both non-null and null matching. Fig. 11 gives typical examples. The yields for the genetic algorithm with geometric cross-over and stochastic gradient ascent were around 20% lower and those of the pure genetic algorithm and multiple restarts very low. The main conclusion that can be drawn is that the hill-climbing step is important for sustaining population diversity and maintaining ambiguous solutions. This is attributable to the fact that it e!ectively distributes the solutions in the population to the local optima of
the "tness functions. This has the e!ect of `pushinga the solutions apart. In this respect it outperforms CHC alone. 6.2.3. Diversity measures There are several striking di!erences between the diversity plots for the graph-matching problem shown in Fig. 8 and those already shown for line-labelling in Fig. 5. In the "rst instance, the "tness measure grows more slowly with iteration number. This feature is attributable to the greater number of labels employed in the case of graph matching. In other words, there are more label swaps to be considered. However, although the process is slower to converge, the population diversity is signi"cantly larger. This is re#ected by both the entropy and the inter-cluster Hamming distance. Rather than rapidly decaying, in the case of graph matching both measures are relatively #at, only displaying a mid-epoch dip. We now make some comments concerning the di!erences between the diversity measures for the genetic algorithm and the CHC algorithm. Fig. 8 shows the
702
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
diversity measures for the genetic algorithm with hillclimbing. Figs. 9 and 10 give the diversity measures for the CHC algorithm with and without gradient ascent. The pronounced dip in entropy and total inter-cluster Hamming distance occurs roughly halfway through the algorithm. This is con"rmed by other experiments with higher iteration limits. This structure corresponds to a phase transition induced by the onset of the condition p(l as the error-probability is annealed with iteration number. It is only at this point that null-labels enter the match. These plots con"rm the conclusion that gradient ascent sustains diversity better than CHC. Combining gradient ascent with CHC results in further improvements.
7. Conclusion Consistent labelling problems frequently have more than one solution. In order that global contextual information be brought to bear in image analysis, several interpretations of locally ambiguous regions should be maintained. We have argued that most work in the "eld has aimed at disambiguating such regions early in the interpretation process, using only local evidence. Our primary contribution has been to show that the genetic algorithm are a robust tool for solving the linelabelling problem and hence other consistent labelling problems. When combined with gradient ascent and using a multi-point crossover, the algorithm robustly "nds multiple solutions to the problem. These solutions are related by common labellings of FORK junctions, which are the most strongly constrained of all junction types considered. The number of generations to convergence of the algorithm compares very favourably with that reported for multi-niche crowding, which also "nds several solutions [46]. These conclusions are reinforced by the graph-matching study. There is no solid theory to predict the behaviour of genetic algorithms or suggest appropriate parameter values. As a result, most of the run-time performance measures found in the literature are namK ve. We have proposed three run-time performance measures: the maximum "tness of the population, the Shannon entropy of the population, and the total Hamming distance between distinct clusters of individuals. The maximum "tness and Shannon entropy provide useful information about the status of the algorithm. The total inter-cluster Hamming distance appears to be highly correlated with the Shannon entropy, especially with the gradient ascent hybrid. The results to date indicate that a population with a Shannon entropy of less than 2 has become saturated, and that new solutions are unlikely to emerge from such a population for some considerable time. Furthermore, most of the diversity in the population disappears in the "rst few iterations.
References [1] D. Marr, Vision, Freeman, New York, 1982. [2] R.M. Haralick, L.G. Shapiro, The consistent labelling problem: Part 1, IEEE Pattern Anal. Mach. Intell. 1 (1979) 173}184. [3] R.M. Haralick, L.G. Shapiro, The consistent labelling problem: Part 2, IEEE Pattern Anal. Mach. Intell. 2 (1980) 193}203. [4] R.M. Haralick, G.L. Elliott, Increasing search tree e$ciency for constraint satisfaction problems, Proc. 6th Int. Joint Conf. on Art. Intell., 1979, 356}364. [5] D. Waltz, Understanding line drawings of scenes with shadows, in: P.H. Winston (Ed.), Psychology of Computer Vision, McGraw-Hill, New York, 1975, pp. 19}91. [6] R.A. Hummel, S.W. Zucker, On the foundations of relaxation labeling processes, IEEE Pattern Anal. Mach. Intell. 5 (1983) 267}287. [7] O.D. Faugeras, M. Berthod, Improving consistency and reducing ambiguity in stochastic labeling: An optimisation approach, IEEE Pattern Anal. Mach. Intell. 3 (1981) 412}424. [8] E.R. Hancock, J. Kittler, Discrete relaxation, Pattern Recognition 23 (1990) 711}733. [9] R.C. Wilson, E.R. Hancock, Graph matching by discrete relaxation, in: E.S. Gelsema, L.N. Kanal (Eds.), Pattern Recognition in Practice, 4, Elsevier, Amsterdam, 1994, pp. 165}176. [10] R.C. Wilson, E.R. Hancock, Structural matching by discrete relaxation, IEEE Pattern Anal. Mach. Intell. 19 (1997) 634}648. [11] C.D. Gelatt, S. Kirkpatrick, M.P. Vecchi, Optimisation by simulated annealing, Science 220 (1983) 671}680. [12] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Pattern Anal. Mach. Intell. 6 (1984) 721}741. [13] D. Geiger, F. Girosi, Parallel and deterministic algorithms from MRFs: surface reconstruction, IEEE Pattern Anal. Mach. Intell. 13 (1991) 401}412. [14] A.L. Yuille, J.J. Kosowsky, Statistical physics algorithms that converge, Neural Comput. 6 (1994) 341}356. [15] J.H. Holland, Adaptation in Natural and Arti"cial Systems, MIT Press, Cambridge, MA, 1975. [16] N.J. Pearlmutter, M.C. MacDonald, M.S. Seidenberg, The lexical nature of syntactic ambiguity resolution, Psychol. Rev. 101 (1994) 676}703. [17] A.H. Kawamoto, Nonlinear dynamics in the resolution of lexical ambiguity: A parallel distributed processing account, J. Memory Language 32 (1993) 474}516. [18] N.J. Pearlmutter, M.C. MacDonald, Individual di!erences and probabilistic constraints in syntactic ambiguity resolution, J. Memory Language 34 (1995) 521}542. [19] J.A. Feldman, D.H. Ballard, Connectionist models and their properties, Cognitive Sci. 6 (1982) 205}254. [20] M. Riani, F. Masulli, E. Simonotto, Neural network models of perceptual alternation of ambiguous patterns, in: S. Levialdi, V. Cantoni, L.P. Cordella, G. Sanniti di Baja (Eds.), Progress in Image Analysis, World Scienti"c, Singapore, 1990, pp. 751}758. [21] M. Riani, E. Simonotto, Stochastic resonance in the perceptyal interpretation of ambiguous "gures } a neural network model, Phys. Rev. Lett. 72 (1994) 3120}3123.
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704 [22] W. Bialek, M. Deweese, Random switching and optimal processing in the perception of ambiguous signals, Phys. Rev. Lett. 74 (1995) 3077}3080. [23] N. Kawabata, Visual "xation points and depth perception, Vision Res. 18 (1978) 853}854. [24] N. Kawabata, T. Mori, Disambiguating ambiguous "gures by a model of selective attention, Biol. Cybernet. 67 (1992) 417}425. [25] K.L. Horlitz, A. O'Leary, Satiation or availability } e!ects of attention, memory and imagery on the perception of ambiguous "gures, Perception Psychophys. 53 (1993) 668}681. [26] F.G. Callari, F.P. Ferrie, Active recognition: using uncertainty to reduce ambiguity. Proceedings of the 13th International Conference on Pattern Recognition, 1996, pp. 925}929. [27] L.R. Williams, A.R. Hanson, Perceptual completion of occluded surfaces, Comput. Vision and Image Understanding 64 (1996) 1}20. [28] K. Kumaran, D. Geiger, L. Parida, Visual organisation for "gure/ground separation, CVPR 1996, pp. 155}160. [29] L.S. Davis (Ed.), A Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991. [30] C. Graves, D. Whitley, R. Beveridge, K. Mathias, Test driving three 1995 genetic algorithms: new test functions and geometric matching, J. Heuristics 1 (1995) 77}104. [31] R.C. Wilson, A.D.J. Cross, E.R. Hancock, Genetic search for structural matching, in: B. Buxton, R. Cipolla (Ed.), Proceedings of the Fourth European Conference on Computer Vision, vol. 1, 1996, pp. 514}525. [32] G. Syswerda. Uniform crossover in genetic algorithms. in: Proceedings of the Third International Conference on Genetic Algorithms, 1989, pp. 2}9. [33] L.J. Eshelman, The CHC adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. in: G.J.E. Rawlins (Ed.), Foundations of Genetic Algorithms, vol. 1, Morgan Kaufmann, Los Altos, CA, 1991, pp. 265}283. [34] G. Rudolph, Convergence analysis of canonical genetic algorithms, IEEE Trans. Neural Networks 5 (1994) 96}101. [35] D.A. Hu!man, Impossible objects as nonsense sentences, in: B. Meltzer, D. Michie (Ed.), Machine Intelligence, vol. 6, Edinburgh University Press, 1971, pp. 295}323. [36] M.B. Clowes, On seeing things, Arti"cial Intelligence 2 (1971) 79}116. [37] K. Sugihara, Picture language for skeletal polyhedra, Comput. Graphics Image Process. 8 (1978) 382}405. [38] J. Malik, Interpreting line drawings of curved objects, Int. J. Comput. Vision 1 (1987) 73}103. [39] L.R. Williams, Topological reconstruction of a smooth manifold-solid from its occluding contour, in ECCV 92, 1992, pp. 36}47.
703
[40] L.M. Kirousis, E!ectively labeling planar projections of polyhedra, IEEE Pattern Anal. Mach. Intell. 12 (1990) 123}130. [41] P. Parodi, G. Piccioli, 3D shape reconstruction by using vanishing points, IEEE Pattern Anal. Mach. Intell. 18 (1996) pp. 211}217. [42] E.R. Hancock, An optimisation approach to line labelling, in: S. Impedovo (Ed.), Progress in Image Analysis and Processing, vol. 3, World Scienti"c, Singapore, 1994, pp. 159}165. [43] A.S. Fraser, Simulation of genetic systems by automatic digital computers, Austral. J. Biol. Sci. 10 (1957) 484}491. [44] H.J. Bremermann, The evolution of intelligence. The nervous system as a model of its environment, Technical report, Deparment of Mathematics, University of Washington, Contact No. 477(17), 1958. [45] R. Toombs, J. Reed, N.A. Barricelli, Simulation of biological evolution and machine learning, J. Theoret. Biol. 17 (1967) 319}342. [46] V.R. Vemuri, W. Ceden8 o, T. Slezak, Multiniche crowding in genetic algorithms and its application to the assembly of DNA restriction-fragments, Evolutionary Comput. 2 (1995) 321}345. [47] D.R. Bull, D. Beasley, R.R. Martin, A sequential niche technique for multimodal function optimisation, Evolutionary Comput. 1 (1993) 101}125. [48] S.J. Louis, G.J.E. Rawlins, Syntactic analysis of convergence in genetic algorithms, in: D. Whitley (Ed.), Foundations of Genetic Algorithms, vol. 2, Morgan Kaufmann, Los Altos, CA, 1993, pp. 141}151. [49] C.E. Shannon, A mathematical theory of communication, Bell System Tech. J. 27 (1948) pp. 379}423. [50] L.J. Eshelman, J.D. Scha!er, R.A. Caruna, R. Das, A study of control parameters a!ecting online performance of genetic algorithms for function optimisation, in: Proceedings of the Third International Conference on Genetic Algorithms, 1989, pp. 51}60. [51] K.A. DeJong, An analysis of the behaviour of a class of genetic adaptive systems, Ph.D. Thesis, University of Michigan, Department of Computer and Communication Sciences, 1975. [52] K.A. DeJong, W.M. Spears, An analysis of the interacting ro( les of population size and crossover in genetic algorithms, in: Proceedings of the First Workshop on Parallel Problem Solving from Nature, Springer, Berlin, 1990. [53] J.J. Grefenstette, Optimisation of control parameters for genetic algorithms, IEEE SMC 16 (1986) 122}128. [54] M.K. Tanenhaus, J.C. Trueswell, S.M. Garnsey, Semantic in#uences on parsing: use of thematic ro( le information in syntactic disambiguation, J. Memory Language 33 (1994) 285}318.
About the Author*RICHARD MYERS took his B.A. in Natural Sciences from the University of Cambridge in 1989. In 1995 he gained an M.Sc. with distinction in Information Processing at the University of York. He is currently working towards a D.Phil. in the Computer Vision Group at the Department of Computer Science at the University of York. The main topic of his research is the use of genetic algorithms to solve consistent labelling problems arising in the machine vision domain. In 1997 he spent two months working at NEC Corporation in Kawasaki, Japan, sponsored a REES/JISTEC fellowship. His interests include evolutionary computation, perceptual organisation and labelling problems.
704
R. Myers, E.R. Hancock / Pattern Recognition 33 (2000) 685}704
About the Author*EDWIN HANCOCK gained his B.Sc. in physics in 1977 and Ph.D. in high energy nuclear physics in 1981, both from the University of Durham, UK. After a period of postdoctoral research working on charm-photo-production experiments at the Stanford Linear Accelerator Centre, he moved into the "elds of computer vision and pattern recognition in 1985. Between 1981 and 1991, he held posts at the Rutherford-Appleton Laboratory, the Open University and the University of Surrey. Dr. Hancock is currently Reader in the Department of Computer Science at the University of York. He leads a group of some 15 researchers in the areas of computer vision and pattern recognition. He has published about 180 refereed papers in the "elds of high energy nuclear physics, computer vision, image processing and pattern recognition. He was awarded the 1990 Pattern Recognition Society Medal and received an honorable mention in 1997. Dr. Hancock serves as an Associate Editor of the journal Pattern Recognition and has been a guest editor for the Image and Vision Computing Journal. He is currently guest-editing a special edition of the Pattern Recognition journal devoted to energy minimisation methods in computer vision and pattern recognition. He chaired the 1994 British Machine Vision Conference and has been a programme committee member for several national and international conferences.
Pattern Recognition 33 (2000) 705}714
Probabilistic relaxation and the Hough transform J. Kittler* Centre for Vision, Speech and Signal Processing, School of Electronic Engineering Information Technology and Mathematics, University of Surrey, Guildford GU2 5XH, UK Received 15 March 1999
Abstract We discuss the recent developments in probabilistic relaxation which is used as a tool for contextual sensory data interpretation. The relationship of this technique with the Hough transform is then established, focusing on the Generalised Hough Transform (GHT). We show that the label probability updating formula of the probabilistic relaxation process exploiting binary relations between object primitives, under the assumption that the primitives convey weak context, exhibits very close similarity to the voting function employed by a computationally e$cient GHT. We argue that the relationship could be exploited by importing the positive features of the respective techniques to bene"t one another. Speci"c suggestions for enhancing the respective techniques are mentioned. They include the adoption of the representational e$ciency of the Hough transform to reduce the computational complexity of probabilistic relaxation. Vice versa, in the case of the Generalised Hough transform it is pointed out that the e!ect of an unknown object transformation could be dealt with by means of a parallel cooperative interpretation rather than by means of an exhaustive search through the parameter space of a transformation group. This has implications in terms of both the storage and computational requirements of the GHT. It also opens the possibility of using the GHT for detecting objects subject to more complex transformations such as a$ne or perspective. The relationship also suggests the possibility of using alternative voting functions which may speed up the object detection process. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Hough transform; Probabilistic relaxation; Evidence accumulation; Binary relations
1. Introduction Probabilistic relaxation refers to a family of techniques designed to achieve a global consistency when interpreting a network of interacting objects. Its origins go back to the seminal paper of Rosenfeld et al. [1] which in turn was inspired by the work of Waltz [2] concerned with discrete relaxation. Waltz studied the problem of how to impose a global consistency on the labelling of idealised line drawings where the objects or object primitives are assumed to be given and therefore can be labelled unambiguously. Rosenfeld et al. extended this work to a more realistic scenario where the objects to be labelled have to
* Tel.: 01483-259294; fax: 01483-259554. E-mail address: [email protected] (J. Kittler)
be extracted from noisy data and therefore their identity could be genuinely ambiguous. Their endeavour resulted in the replacement of the hard labels used by Waltz by label probabilities. This softening of labels appeared to have also computational bene"ts as the labelling process could be accomplished by a local iterative updating of each label probability, instead of the exhaustive search required by discrete relaxation. The potential and promise of probabilistic relaxation as demonstrated in Ref. [1] spurred a lot of interest in the approach which has been sustained over the last two decades. The e!ectiveness of probabilistic relaxation has been demonstrated on numerous applications including line and edge detection [3}6]. For a comprehensive list of applications the reader is referred to the review article of Kittler and Illingworth [7]. Notwithstanding its practical potential, the early applications of probabilistic relaxation unveiled many
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 8 1 - 3
706
J. Kittler / Pattern Recognition 33 (2000) 705}714
theoretical problems, including inherent bias of the updating process which was exhibited in no information experiments, questions relating to convergence, interpretation of the computed probabilities, and the speci"cation of the compatibility coe$cients and support functions [8}10]. A detailed account of the attempts to overcome these problems can be found in Ref. [7]. Most of these problems were overcome in two key papers published in the 1980's [11,12]. In Ref. [11] Hummel and Zucker laid down the theoretical foundations of probabilistic relaxation by formally de"ning the notion of consistency and by showing that under certain assumptions the optimisation of a simple functional was synonymous with improving the consistency of object labelling. They also developed a constrained optimisation procedure to optimise the functional by extending the work of Faugeras and Berthod [13]. In spite of this progress the relaxation process design methodology remained very heuristic until the publication of the work of Kittler and Hancock [12] which was aimed at providing theoretical underpinning of probabilistic relaxation using the Bayesian framework. It led to the development of an evidence combining formula which fuses observations and a priori contextual information in a theoretically sound manner. The polynomial combinatorial complexity of the approach has been avoided by means of introducing and developing the concept of label con"guration dictionary. The methodology has been validated on problems of edge and line postprocessing [5,6]. Two important criticisms of probabilistic relaxation, namely that 1. the process does not utilise measurement information with the exception of the initialisation stage where observations are used to compute the initial, noncontextual probabilities for the candidate labels at each object, and 2. the richest source of contextual observational information contained in object relations (binary relations) is never tapped, have been overcome most recently [14,15]. In this newly developed form the probabilistic relaxation has been demonstrated to have wide applicability from relaxation problems on a lattice [16] to graph matching problems [17]. The approach contrasts with discrete relaxation techniques exempli"ed by the iterative conditional modes algorithm (ICM) of Besag [18] and the works of Blake [19], Blake and Zisserman [20], Koch et al. [21], and Witkin et al. [22]. Their stochastic optimisation counterpart represented by the method developed by Geman and Geman [23] is based on the simulated annealing technique introduced in Ref. [24]. More recent are the attempts to use the mean "eld theory [25] to simplify and
speed up the discrete optimisation process. An alternative to stochastic optimisation is o!ered by the idea of a label error process which has been introduced by Hancock and Kittler [26] to cope with the two fundamental problems of the original Waltz's algorithm [2]: (i) inadmissible label con"gurations introduced by initial object labelling and (ii) the optimisation process deadlock. The idea has been extended from lattice structures to general object arrangements in Refs. [27,28]. The paper overviews the recent developments in probabilistic relaxation and then exposes the relationship of this technique with the Hough transform [29], focusing on the generalised Hough transform [30]. We show that the label probability updating formula of the probabilistic relaxation process exploiting binary relations between object primitives, under the assumption that the primitives convey weak context, exhibits very close similarity to the voting function employed by the modi"ed GHT [31]. The relationship can be exploited by importing the positive features of the respective techniques to bene"t one another. For instance, the representational e$ciency of the Hough transform could suggest how to reduce the computational complexity of probabilistic relaxation. Vice versa, in the case of the Generalised Hough transform we shall demonstrate that the e!ect of an unknown object transformation can be dealt with by means of a parallel cooperative interpretation rather than by means of an exhaustive search through the parameter space of a transformation group. This has implications in terms of both, storage and computational requirements of the GHT. It also opens the possibility of using the GHT for detecting objects subject to more complex transformations such as a$ne or perspective [32]. The relationship also suggests the possibility of using alternative voting functions which may speed up the object detection process. The paper is organised as follows. First, mathematical preliminaries are introduced in Section 2. Section 3 contains a review of the recent results in probabilistic relaxation which incorporate observations on relational information about the interacting objects. Section 4 relates probabilistic relaxation to the Hough transform. Finally, Section 5 concludes with a summary of the paper.
2. Preliminaries Probabilistic relaxation addresses one of the most important aspects of machine perception, namely object recognition. The term object is understood here in a very general sense and covers not only 2D and 3D geometric entities but also physical phenomena and events. Moreover, as objects are often represented as composites of object primitives we shall view the problem of object recognition in machine perception as one of labelling
J. Kittler / Pattern Recognition 33 (2000) 705}714
object primitives. E!ectively, the same type of primitive can be labelled in di!erent ways depending on the object it is a part of. The set of admissible labels for each object primitive will then depend on the goal of interpretation, which should specify the hypothesised objects either by means of external input from the user of the sensory system, or by object invocation from bottom up processing. Naturally, the complexity of objects may be such that a single-level representation may be inadequate for capturing the details of the object description. It will then be necessary to adopt a hierarchical representation whereby object primitives at one level become objects at a lower level of representation. Bearing this #uidity of terminology in mind, in the future discussion, rather than referring to object primitives and their identity, we shall simply talk about object labelling where a collection of objects de"nes some perceptually signi"cant whole. Let us consider a set of measurement vectors x , represj enting, respectively, objects a , j"1,2, N arranged in j a network with a particular neighbourhood system. Each component of vector x denotes one of three types of j measurements: 1. Binary relation measurements Ak , k"1, 2,2, m ji between the jth and ith objects. 2. Unary relation measurements yl , l"1, 2,2, r j from which the binary relations are derived. 3. Unary relation measurements vi , i"1, 2,2, n which j augment the observational evidence about node j but do not serve as a basis for deriving binary relation measurements Ak . ji Let us arrange these measurements into vectors as follows:
CD A j1 F
A j(j~1) , A" j A j(j`1) F
(1)
A jN
where
A "[A1 ,2, Am]T. ji ji ji
(2)
For the unary relations we have y "[y1,2, yr]T j j j
(3)
and v "[v1,2, vn]T. j j j
(4)
707
Thus x is an [m(N!1)#r#n]-dimensional vector j which can be written as
CD v
j x" y . (5) j j A j We wish to assign each object a a label h . We shall j j consider the problem of object labelling in the framework of the Bayesian decision theory. The theoretical result underpinning the design of any object classi"cation system is the Bayes decision rule (e.g. Ref. [33]). In its general form it speci"es how best decisions about class membership of objects can be made taking into account the probability distribution of the measurements providing observational evidence about the objects. Following the conventional Bayesian approach object a would be assigned to class u based on the information i r conveyed by measurement vectors v and y according to i i the minimum error decision rule [33]. However, objects by de"nition do not exist in isolation. Thus the distinguishing feature of object labelling problems in machine perception is that we deal with a large network of objects which interact with each other. The a priori world knowledge or context can be used to help to disambiguate decisions based simply on noisy features of individual objects. For instance in text recognition individual characters are an integral part of larger objects such as words or sentences formed by character groups. Word dictionary and rules of grammar dictate which combinations of characters and implicitly which individual characters are possible. In contextual labelling we e!ectively associate with each object a decision-making process which attempts to combine evidence from observations made on the object together with the contextual information conveyed by the other objects in the network to deduce which label assignment is most appropriate from the point of view of the available measurement information, local constraints and global labelling consistency. Thus in contrast here we wish to decide about label h using not only the information contained i in unary relation measurements relating to object a but i also any context conveyed by the network. In other words we wish to utilise also the binary relation measurements, i.e. the full measurement vector x plus all the i information about the other objects in the network contained in x , ∀jOi. This is a general statement of the j problem but in order to develop contextual labelling schemes our formulation will have to be somewhat more precise. The contextual labelling problem can be formulated either as the object centred or message centred interpretation. In object-centred interpretation the emphasis is on one node at a time. Contextual information is used to reduce the ambiguity of labelling a single object. Note
708
J. Kittler / Pattern Recognition 33 (2000) 705}714
that object-centred interpretation does not guarantee that the global interpretation makes sense. For example, individually most likely object categories in a character recognition problem will not necessarily combine into valid words. The use of context merely reduces the chance of the global labelling being inconsistent. In contrast, message centred interpretation is concerned with getting the message conveyed by sensory data right. In our text recognition problem the main objective of message-centred labelling would be to label characters so that each line of text gives a sequence of valid words. Generally speaking, in message-centred interpretation we search for a joint labelling h "u 1, 1 h h "u 2,2,h "u N which explains observations 2 h N h x , x ,2, x made on the objects in the network. 1 2 N The most appropriate measure of "t between data and interpretation (but by no means the only one) is the a posteriori probability P(h "u 1,2, h " 1 h N u N D x ,2, x ). h 1 N The object-centred counterpart computes, instead P(h "u iDx , x ,2, x ), the aposteriori probability of i h 1 2 N label h given all the observations which can be rewritten i as p(x ,2, x D h "u)p( (h "u) N i j P(h "uDx ,2, x )" 1 , i 1 N p(x ,2, x ) 1 N
(6)
where p( (h "u) is the a priori probability of label h takj i ing value u. Note that the denominator in Eq. (6) can be dismissed. We can expand the "rst term of the numerator over all possible labellings in the usual fashion, i.e. p(x ,2, x Dh "u)"+ 2+ 2+ p(x ,2, x , 1 N i 1 N )1 )j )N h ,2, h ,2, h Dh "u) 1 j N i "+ 2+ 2+ p(x ,2, x Dh ,2, h 1 N 1 i )1 )j )N "u,2,h ) N P(h ,2, h ,2, h Dh "u), ∀jOi, 1 j N i (7) where ) is the set of labels admitted by object a . For i i simplicity, we shall assume that ) "Mu , u ,2, u N") ∀i, i 0 1 M where u is the null label used to label objects for which 0 no other label is appropriate. Thus we "nally "nd
where the "rst term of the product in the numerator, the conditional joint probability density function p(x ,2, x Dh ,2, h ) of measurement vectors x ,2, x 1 N 1 N 1 N models the measurement process. The second term embodies our a priori knowledge of the likelihood of various combinations of labels occurring. It is our global, world model. Thus computing the probability of a particular label u on a single object a amounts to scanning through i all the possible combinations of labels h ,2, h with 1 N label h set to u and summing up the corresponding i products of the respective joint measurement and label probabilities. Finally, an important and physically realistic assumption regarding the unary measurement process distribution is that the outcomes of measurements are conditionally independent. p(v ,y ,2, v , y D h ,2, h ,2, h ) 1 1 N N 1 i N N " < p(v ,y D h "u i). i i i h i/1
(9)
For binary relations, on the other hand, we assume that p(A ,2, A D h ,2, h ,2, h )"< p(A Dh ,h ). i1 iN 1 i N ij i j jEi
(10)
3. Probabilistic relaxation Under some mild conditional independence assumptions concerning measurements v , y and A , ∀j the j j ij object-centred labelling formulation (8) leads to an iterative probability updating formula [15] P(n)(h "u i)Q(n)(h "u i) i h i h P(n`1)(h "u i)" , (11) i h + j P(n)(h "u )Q(n)(h "u ) u |) i j i j where P(n) (h "u i) denotes the probability of label u i at h i h object a at the nth iteration of the updating process and i the quantity Q(n) (h "u ) expresses the support the label i a h "u receives at the nth iteration step from the other i a objects in the scene, taking into consideration the binary relations that exist between them and object a . Eq. (11) i represents a generic probabilistic relaxation process. After the "rst iteration (n"1) the computed entity is the contextual a posteriori class probability P(h " i u iDx , x ,2, x ). With the increasing value of n the uph 1 2 N dating scheme drives the probabilistic labelling into a hard labelling.
+ 2+ j2+ Np(x ,2, x Dh ,2, h "u,2, h )P(h ,2, h "u,2, h ) ) ) 1 N 1 i N 1 i N, P(h "uDx ,2, x )" )1 i 1 N p(x ,2, x ) 1 N
(8)
J. Kittler / Pattern Recognition 33 (2000) 705}714
709
The support Q(n)(h "u i) is de"ned as i h Q(n)(h "u i)" + i h uhj,j|Ni
G
H
P(h "u j,∀j3N ) P(n)(h "u j)p(A Dh "u i,h "u j) j h i < j h ij i h j h , p( (h "u i) p( (h "u j) i h j h j|Ni
where p(A D h "u i,h "u j) is the compatibility coefij i h j h "cient quantifying the mutual support of the labelling (h "u i,h "u j). N denotes the index set of all nodes h i i h j excluding the node i, i.e. N "M1, 2, 2, i!1, i#1,2, NN. i
(13)
It is worth noting that when binary relations are not used the support function (12) becomes the standard evidence combining formula developed in Ref. [12],
pixel label probabilities using formula (11) contextual information would be drawn from increasingly larger neighbourhoods of each pixel. A more dramatic, complementary reduction in the computational complexity is achieved by noting that in practice many potential label con"gurations in the contextual neighbourhood of an object are physically inadmissible. By listing the admissible labellings in a dictionary, the above support function can be evaluated by summin g up only over the entries (h "ukj, ∀j3N ), ∀k i j h in the dictionary [5], i.e.
G
H
Z(uhi) P(h "u j,∀j3N ) P(n)(h "ukj)p(A Dh "u i,h "ukj) j h i < j h ij i h j h , Q(n)(h "u i)" + i h p( (h "u i) p( (h "ukj) i h j h k/1 j|Ni i.e.
G
H
1 P(n)(h "u j) j h Q(n)(h "u i)" + < i h p ( (h "u ) p( (h "u j) i j i i h j|Ni j h uh ,j|N ]P(h "u j,∀j3N ). (14) j h i On the other hand, when no additional unary relation measurements are available apart from the set used for generating the binary measurements, the support reduces to Q(n)(h "u i) i h 1 " + p ( (h "u ) i hi uhj,j|Ni
G
H
] < p(A Dh "u i,h "u j) P(h "u j,∀j3N ). (15) ij i h j h j h i j|Ni The probability updating rule (15) in this particular case will act as an ine$cient maximum value selection operator. Thus the updating process can be terminated after the "rst iteration, the maximum contextual a posteriori label probability selected and set to unity while the probability of all the other labels to zero. The support function (12) exhibits exponential complexity. In practice its use, depending on application, could be limited only to a contextual neighbourhood in the vicinity of the object being interpreted. Such a measure is appropriate for instance in the case of edge and line postprocessing where the objects to be labelled are pixel sites. A small neighbourhood, say a 3 by 3 window may be su$cient to provide the necessary contextual information. In any case by iteratively updating the
(12)
(16)
where Z(u i) denotes the number of dictionary entries h with label h set to u i. i h In many labelling problems neither of the above simpli"cations of the support function is appropriate. For instance, in correspondence matching tasks or object recognition all features of an object interact directly with each other. Moreover, without measurements, no labelling con"guration is a priori more likely than any other. Then it is reasonable to assume that the prior probability of a joint labelling con"guration can be expressed as P(h "u j,∀j3N )" < p( (h "u j). (17) j h i j h j|Ni Substituting Eq. (17) into Eq. (12) and noting that each factor in the product in the above expression depends on the label of only one other object apart from the object a under consideration, we can simplify the support comi putation as Q(n)(h "u ) i a " < + P(n)(h "u )p(A Dh "u ,h "u ). (18) j b ij i a j b i b j|N u |) It is interesting to note that through this simpli"cation the exponential complexity of the problem is eliminated. A further simpli"cation can be made under the assumption that the contextual support provided by the neighbouring objects is weak, i.e. it di!ers only marginally from some nominal value p represent0 ing indi!erence. In such situations the product evidence combination rule can be approximated by
710
J. Kittler / Pattern Recognition 33 (2000) 705}714
a sum rule Q(n)(h "u ) i a " + + P(n)(h "u )[ p(A D h "u , h "u )!p ], j b ij i a j b 0 j|Ni ub|) (19) which resembles the original support function suggested in Ref. [1]. The updating rule in Eq. (19) represents a benevolent information fusion operator, in contrast to the severe fusion operator constituted by the product rule. The iteration scheme can be initialised by considering as P(0)(h "u i) the probabilities computed by using the i h unary attributes only, i.e. (20) P(0)(h "u i)"P(h "u i D v ,y ). i h i i i h We discuss this initialisation process in detail elsewhere [15]. The problem of estimating the binary relation distributions is addressed in Refs. [34,35]. The computational complexity of the iterative process can be reduced by pruning the binary relations taking into account auxiliary information about the problem being solved [36]. The probabilistic relaxation has been used successfully in a number of application domains. It has been applied to object recognition based on colour using a colour adjacency graph representation [37]. The technique has been used to solve the correspondence problem in stereo matching [17]. It has been found useful in shape recognition [38] and is applicable also to 3D object recognition [32]. It has been demonstrated to have the potential to establish an accurate registration between infrared image data obtained by an airborne sensor and a digital map for the purposes of autonomous navigation [17] and in vision-based docking [39]. In the following section we discuss the relationship of the probabilistic relaxation with the Hough transform.
4. Generalised Hough transform The generalised Hough transform (GTH) devised by Ballard [30] and applied to many problems [29,40] can be brie#y described in terms of graph terminology as follows. An object model is represented in terms of a look-up table. An entry of the look-up table represents a binary relation between a virtual reference node and an object primitive node. The unary properties of the object node are used as an index into the look-up table. An accumulator array is set up over the domain of parameters which characterise the reference node and the array is initialised. During the recognition process the unary relations on a node are used to index into the precomputed object look-up table and the corresponding binary relations are used to calculate the parameters
of the reference node which would be consistent with the evidence furnished by the node. The corresponding cell of the accumulator has its counter incremented by one. When all the evidence is mapped into the parameter space via the look-up table, local peaks are searched for in the accumulator array. Each such peak of su$cient height indicates an instance of a hypothesised object. To exemplify the process on a 2D shape-matching problem, an object shape, in its simplest form, is represented in terms of its boundary points. The tangent angle at a boundary point is used as the index into a look-up table. The corresponding entry in the look-up table would be the radial distance between the boundary point and a shape reference point, and the direction from the boundary point to the reference point, expressed as the angle measured with respect to the x-axis. As the representation is not rotation and scale invariant, if an instance of the hypothesised shape can be subject to a similarity transformation, the dimensionality of the parameter space and the evidence accumulation process would have to take this into account accordingly. The original GTH scheme as described above is very ine$cient in many respects: f As the shape reference point has no unary relation properties, the description of the shape boundary points and the reference point is asymmetrical. In consequence, the binary relations between a shape boundary point and the shape reference are expressed in terms of a world coordinate system (e.g. shape coordinate system axes) rather than in a relative coordinate system. The practical consequence of this asymmetry is the need to cater for all the possible rotations of the shape by means of an additional parameter of the Hough space (over and above the translation and scale parameters). f A pointwise representation of a shape boundary is demanding in terms of look-up table size and evidence accumulation. f Any multidimensional parameter space requires a huge accumulator array associated with it. The storage space grows exponentially with the number of parameters characterising the unknown reference point. These shortcomings motivated Lee et al. [31] to propose a method whereby an object is represented in terms of object primitives such as straight-line segments approximating the boundary. The object reference point is con"ned to be coincident with one of the object primitives which should be present in the image. Each image feature (object primitive) has a counter associated with it. Image evidence for a given hypothesis is accumulated by identifying, via the binary relations stored in the look-up table, the corresponding candidate reference node and
J. Kittler / Pattern Recognition 33 (2000) 705}714
verifying, whether this node exists in the image. Provided the reference node is contained in the image, the available image evidence should coherently vote for the object reference node. A vote exceeding a prespeci"ed threshold would be indicative of an instance of the hypothesised object in the image. In order to make the approach resilient to occlusion which could cause the object reference node to be missing from the image, and consequently the proposed GTH process would fail to detect the instantiated object, one can use several reference points, each associated with a distinct feature of the model. Such a representation overcomes all the above-listed disadvantages. It is more economic, since it is expressed in terms of object primitives, rather than boundary points. As object primitives are used as reference points, the object description is fully symmetric, i.e. the same representation is employed both for object primitives and object reference points. A symmetrical representation facilitates the use of relative coordinate systems and this obviates the need to make provision for the object rotation parameter. Provided an object reference point is not occluded (it is contained in the image), its location, when identi"ed, is known. Therefore, there is no need to set up an accumulator array to "nd its parameters. The problem of searching the accumulator array for local maxima in the voting function also disappears. Suppose that an object is represented by a set ) of M primitives, )"Mu D k"1,2,MN and each primitive k is treated as a reference node. It is interesting to note that the Hough transform can be viewed to compute evidential support for the hypothesis that observed image primitive a , i"1,2,N corresponds to a reference node i u, i.e. h "u as i N H(h "u)" + + P(h "u j)q(A Dh "u,h "u j), i j h ij i j h j/1 uhj3) (21) where q(A D h , h ) represents the compatibility of the ij i j binary relation A of nodes a and a interpreted as ij i j h "u and h "u j, with the entry in the look-up table. i j h The probability P(h "u j) re#ects the information conj h tent of the unary measurement associated with object a . j If the measurement identi"es precisely the appropriate entry in the look-up table, the probability value will be one. This corresponds exactly to the Generalised Hough mapping process described earlier. For a given primitive, the unary relation picks appropriate entries from the look-up table, and provided a binary relation registered in the table is satis"ed, the primitive contributes a measure of supporting evidence for the hypothesis considered. When, from the point of view of indexing into the look-up table, the information content of the unary measurement is minimal, the probability will have a uni-
711
form distribution, i.e. P(h "u j)"1/M. For a given j h primitive, this corresponds to stepping through all the possible transformations (e.g. rotations for line segments under a similarity transformation) of the unary relation that have to be carried out in the classical GHT. Through the unary relations, each candidate transformation will establish new correspondences between the model and image object primitives. Note however that there is a signi"cant di!erence between the accumulator based GHT and the modi"ed GHT. In the accumulator-based GHT all the possible transformations would contribute votes somewhere in the accumulator array and they could create nonnegligible peaks. For this reason, a separate accumulator plane is used for each rotation hypothesis. In the modi"ed GHT there is a "xed set of locations de"ned by the observed data. The only thing that matters is whether an observation supports any of these reference point locations. If a piece of data supports some other location not on the list, this fact will not be recorded and therefore the possibility of detecting ghost shape instances will be minimised. The set of observed image primitives de"nes a very limited set of hypothesis which have to be veri"ed using the voting process. Conventionally, the voting kernel would be a top hat function but recently it has been recognised that in order to ensure robustness to contamination the kernel shape should be smooth. It has been shown in Refs. [41}43] that a parabolic kernel smoothly approaching zero at the boundaries of a window de"ning the width of the feature error distribution has such properties but a number of other robust kernels have been suggested in the literature. Now the voting function in Eq. (21) has a close similarity with the sum updating formula (19). In principle, the two computational processes are the same. However, their implementation in practice would di!er. The Hough transform tends to look simultaneously for an instance of an object and its pose (transformation between the model and its instance in the data). Thus for every candidate transformation the probability P(h "u j) would pick a unique correspondence between j h nodes and formula (21) simply performs a hypothesis veri"cation. If the accumulated evidence is less than a prespeci"ed threshold level, the pose hypothesis is rejected and a next candidate is considered. The probabilistic relaxation scheme in Eq. (19), on the other hand, compounds all the pose hypothesis by allowing P(h "u j) to be distributed over all the possible j h correspondences. From the point of view of the Hough transform this process is based on the premise that all the incorrect pose hypothesis will contribute as supporting evidence in an incoherent way and therefore they will not obscure the true solution to the data interpretation problem. All the reference points (object primitives) are interpreted in parallel. This results in a redistribution of the probability mass over the candidate hypothesis, given
712
J. Kittler / Pattern Recognition 33 (2000) 705}714
unary relations. This redistribution impacts on the interpretation in the next iteration of the voting process. Thus the parallel process of interpretation facilitates cooperative voting. At each iteration the distribution of the label probabilities sharpens and eventually the relaxation algorithm will be performing the veri"cation process in exactly the same way as the Hough transform. Here the pose of the object is not acquired explicitly. However, it can easily be determined from the primitive correspondences established through object hypothesis veri"cation. It is pertinent to point out that the test statistic in the Hough transform involves a smooth kernel function which weights the errors. It is a redescending kernel which minimises the e!ect of any contaminating distribution on the accuracy of the pose estimate. In contrast, probabilistic relaxation involves an error distribution function. Interestingly, a typical error distribution such as a Gaussian is very similar in shape to a redescending parabolic kernel typically used by the HT. It has been shown elsewhere [41}43] that the Hough transform can be viewed to implement robust hypothesis testing with a test statistic de"ned as the value of voting function (21) divided by the number of object primitives. The test statistic evaluates the likelihood that the errors between the hypothesised model and an observed data are consistent with an assumed measurement error distribution. It is interesting to note that this viewpoint can be extended to probabilistic relaxation with the sum updating rule (19). It has always been a point of criticism directed to the advocates of probabilistic relaxation, that the label probabilities computed by an iterative relaxation process loose meaning after the "rst iteration. The exposed relationship with the Hough transform suggests that at least with updating rule (19) the "nal result can be considered as a set of test statistics computed for the respective hypothesis relating to the detected object primitives. With this interpretation the label probabilities no longer play a signi"cant role. They merely identify the best hypothesis for each object primitive. The most important result of the relaxation process is the evidential support for each of these "ltered hypothesis rendered by the support function Q. The support function e!ectively embodies an appropriate test statistic which could be used to decide whether to accept or reject the best hypothesis. The hypothesis testing interpretation of probabilistic relaxation labelling has another merit. It is well known that in the case of object primitives with multiple interpretation, the iterative process leads to a dilution of label probabilities, with the probability mass being distributed over all the possible interpretations. The hypothesis testing viewpoint would assign no signi"cance to this phenomenon as the decisive output is the "nal support function value associated with a hypothesis, rather than the actual probability of the hypothesis being true.
By the same token, one could formulate the Hough transform as a voting process employing other voting functions. For instance, product function (18) could be used instead of Eq. (19). This would change the nature of the evidence accumulation process from a benevolent fusion to severe fusion as each inconsistent hypothesis relating to an object primitive would inhibit a global object interpretation in which the incorrectly identi"ed primitive would participate. In the extreme case, this voting function would accept an object hypothesis only if it was 100% supported by the observed primitives. It is clear that in the presence of noise, over or under segmentation resulting in a failure to detect object primitives, or occlusion and clutter, a hypothesised object would never be detected. In practice this problem is contained by introducing the concept of a null label which acts as a wild card. The assignment of a wild card to an object primitive is associated with an indi!erent vote which stops the node from inhibiting partially supported object interpretations. The advantage of the product voting function over the sum vote is a faster convergence of the relaxation process. So far the discussion focused on the relationship of probabilistic relaxation and the generalised Hough transform. The main reason for this is the fact that GHT utilises binary relations which are of utmost interest from the point of view of capturing the contextual information in probabilistic relaxation. However, there are relaxation processes which do not exploit relational information in measurements. They incorporate only the prior information on label con"gurations. Updating rule (14) is a typical example of such a case. It is possible to establish a relationship between this type of relaxation process and the standard Hough transform which is used for detecting parametric curves where only unary relations matter. Taking the line detection problem as an example, an edge pixel would be an object (line) primitive and the unary relations would be the edgel position and orientation. Starting from Eq. (14) it is possible to derive a test statistic typically used in soft kernel Hough transform voting. However, in this particular case it is far from clear how the relationship might bene"t the Hough transform method. In summary the relationship established between probabilistic relaxation and the generalised Hough transform, especially in its modi"ed form [31] has a number of practical implications. First of all, it suggests that the e!ect of an unknown object transformation can be dealt with by means of a parallel cooperative interpretation rather than by means of an exhaustive search through the parameter space of a transformation group. This has implications in terms of both, storage and computational requirements of the GHT. It also opens the possibility of using the GHT for detecting objects subject to more complex transformations such as a$ne or perspective [32]. The relationship also suggest the
J. Kittler / Pattern Recognition 33 (2000) 705}714
possibility of using alternative voting functions which may speed up the object detection process. Vice versa, probabilistic relaxation has been shown to be an over redundant interpretation process with every single image primitive used as a potential object reference point. By eliminating some of the redundancy both computational and storage requirements can be further reduced. This possibility will be investigated in future in connection with practical applications.
5. Conclusions In the paper, the recent developments in probabilistic relaxation which is commonly used as a tool for contextual sensory data interpretation were overviewed. The relationship of this technique with the Hough transform [7] was then established, focusing on the generalised Hough transform [30]. We showed that the label probability updating formula of the probabilistic relaxation process exploiting binary relations between object primitives, under the assumption that the primitives convey weak context, exhibits very close similarity to the voting function employed by the modi"ed GHT [31]. We argued that the relationship could be exploited by importing the positive features of the respective techniques to bene"t one another. For instance, the representational e$ciency of the Hough transform could suggest how to reduce the computational complexity of probabilistic relaxation. Vice versa, in the case of the generalised Hough transform we demonstrated that the e!ect of an unknown object transformation could be dealt with by means of a parallel cooperative interpretation rather than by means of an exhaustive search through the parameter space of a transformation group. This has implications in terms of both the storage and computational requirements of the GHT. It also opens the possibility of using the GHT for detecting objects subject to more complex transformations such as a$ne or perspective [32]. The relationship also suggested the possibility of using alternative voting functions which may speed up the object detection process.
Acknowledgements This work was supported by the Science and Engineering Research Council, UK (GR/161320).
References [1] A. Rosenfeld, R. Hummel, S. Zucker, Scene labeling by relaxation operations, IEEE Trans. System Man Cybernet. SMC-6 (1976) 420}433.
713
[2] D.L. Waltz, Understanding line drawings of scenes with shadows, in: P.H. Winston (Ed.), The Psychology of Computer Vision, McGraw-Hill, New York, 1975. [3] S. Peleg, A. Rosenfeld, Determining compatibility coe$cients for curve enhancement relaxation processes, IEEE Trans. Systems Man Cybernet. SMC-8 (1978) 548}555. [4] S. Zucker, R. Hummel, A. Rosenfeld, An application of relaxation labelling to line and curve enhancement, IEEE Trans. Comput. C-26 (1977) 394}404. [5] E.R. Hancock, J. Kittler, Edge labeling using dictionarybased relaxation, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-12 (1990) 165}181. [6] E.R. Hancock, J. Kittler, Relaxation re"nement of intensity ridges, Proceedings of 11th International Conference on Pattern Recognition 1992, pp. 459}463. [7] J. Kittler, J. Illingworth, A review of relaxation labelling algorithms, Image Vision Comput. 3 (1985) 206, 216. [8] R.M. Haralick, An interpretation of probabilistic relaxation, Comput. Vision, Graphics Image Process. 22 (1983) 388}395. [9] R.L. Kirby, A product rule relaxation method, Comput. Graphics Image Process. 13 (1985) 158}189. [10] S. Peleg, A new probabilistic relaxation scheme, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-2 (1980) 362}369. [11] R. Hummel, S. Zucker, On the foundations of relaxation labeling process, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-5 (1983) 267}286. [12] J. Kittler, E.R. Hancock, Combining evidence in probabilistic relaxation, Int. J. Pattern Recognition Artif. Intell. 3 (1989) 29}51. [13] O. Faugeras, M. Berthod, Improving consistency and reducing ambiguity in stochastic labeling, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-3 (1981) 412}423. [14] J. Kittler, P. Papachristou, M. Petrou, Combining evidence in dictionary based probabilistic relaxation. Proceedings of eighth Scandinavian Conference on Image Analysis, Tromso, 1993. [15] J. Kittler, W.J. Christmas, M. Petrou, Probabilistic relaxation for matching problems in computer vision, Proceedings of fourth International Conference on Computer Vision, Berlin, 1993. [16] J. Kittler, P. Papachristou, M. Petrou, Probabilistic relaxation in line postprocessing, Proceedings of Workshop on Statistical Methods in Pattern Recognition, Tromso, 1993. [17] W.J. Christmas, J. Kittler, M. Petrou, Structural matching in computer vision using probabilistic relaxation, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-17 (1995) 749}764. [18] J. Besag, On the statistical analysis of dirty pictures, J. Roy. Statist. Soc. 48, Series B (1986) 259}302. [19] A. Blake, The least disturbance principle and weak constraints, Pattern Recognition Lett. 1 (1983) 393}399. [20] A. Blake, A. Zisserman, Visual Reconstruction, MIT Press, Cambridge, MA, 1987. [21] C. Koch, J. Marroquin, A. Yuille, Analog neuronal networks in early vision, Proc. Natl. Acad. Sci. 83 (1986) 4263}4267. [22] A. Witkin, D. Terzopoulos, M. Kass, Signal matching through scale space, Int. J. Comput. Vision (1987) 133}144.
714
J. Kittler / Pattern Recognition 33 (2000) 705}714
[23] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6 (1984) 721}741. [24] S. Kirkpatrick, C.D. Gellatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671}680. [25] D. Geiger, F. Girosi, Parallel and deterministic algorithms from MRF's: surface reconstruction, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-13 (1991) 181}188. [26] E.R. Hancock, J. Kittler, Discrete relaxation, Pattern Recognition 23 (1990) 711}733. [27] R.C. Wilson, A.N. Evans, E.R. Hancock, Relational matching by discrete relaxation, Image Vision Comput. 13 (1995) 411}422. [28] R.C. Wilson, E.R Hancock, Relational matching with dynamic graph structures, Proceedings 5th International Conference Computer Vision, Cambridge, 1995, pp. 450}456. [29] J. Illingworth, J. Kittler, A survey of the Hough Transform, Comput. Vision Graphics Image Process. 44 (1988) 87}116. [30] D.H. Ballard, Generalising the Hough Transform to detect arbitrary shapes, Pattern Recognition 13 (1981) 111}122. [31] H.M. Lee, J. Kittler, K.C. Wong, Generalised Hough Transform in object recognition, Proceedings of 11th International Conference on Pattern Recognition, 1992, pp. 285}289. [32] Z. Shao, J. Kittler, Shape recognition using invariant unary and binary relations, in: C. Arcelli, L.P. Cordella, G. Sanniti di Baja (Eds.), Visual Form, World Scienti"c, Singapore, 1997. [33] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, Prentice-Hall, Englewood Cli!s, NJ, 1982. [34] W.J. Christmas, J. Kittler, M. Petrou, Probabilistic feature-labelling schemes: modelling compatibility
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
coe$cient distributions, Image Vision Comput. 14 (1996) 617}625. M. Pelillo, M. Re"ce, Learning compatibility coe$cients for relaxation labelling, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-16 (1994) 933}945. W.J. Christmas, J. Kittler, M. Petrou, Labelling 2-D geometric primitives using probabilistic relaxation: reducing the computational requirements, Electron. Lett. 32 (1996) 312}314. J. Matas, R. Marik, J. Kittler, Colour-based object recognition under spectrally non-uniform illumination, Image Vision Comput. 13 (1995) 663}669. W.J. Christmas, J. Kittler, M. Petrou, Location of objects in a cluttered scene using probabilistic relaxation, in: C. Arcelli, L.P. Cordella, G. Sanniti di Baja (Eds.), Aspects of Visual Form Processing, World Scienti"c, Singapore, 1994, pp. 119}128. W.J. Christmas, J. Kittler, M. Petrou, Error propagation for 2D-to-3D matching with application to underwater navigation, Proceedings of the Seventh British Machine Vision Conference, 1996, pp. 555}564. A. Califano, R. Mohan, Multidimensional indexing for recognising visual shapes, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1991, pp. 28}34. J. Princen, J. Illingworth, J. Kittler, Hypothesis testing: a framework for analysing and optimising Hough transform performance, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 329}341. J. Illingworth, G. Jones, J. Kittler, M. Petrou, J. Princen, Robust statistical methods of 2D and 3D image description, Ann. Math. Artif. Intell. 10 (1994) 125}148. P.L. Palmer, J. Kittler, M. Petrou, A Hough transform algorithm with a 2D hypothesis testing kernel, Proceedings of the 11th IAPR International Conference on Pattern Recognition, 1992, pp. 276}279.
About the Author*JOSEF KITTLER graduated from the University of Cambridge in Electrical Engineering in 1971 where he also obtained his Ph.D. in Pattern Recognition in 1974 and the ScD degree in 1991. He joined the Department of Electronic and Electrical Engineering of Surrey University in 1986 where he is a Professor, in charge of the Centre for Vision, Speech and Signal Processing. He has worked on various theoretical aspects of Pattern Recognition and on many applications including automatic inspection, ECG diagnosis, remote sensing, robotics, speech recognition, and document processing. His current research interests include Pattern Recognition, Image Processing and Computer Vision. He has co-authored a book with the title `Pattern Recognition: A Statistical Approacha published by Prentice-Hall. He has published more than 300 papers. He is a member of the Editorial Boards of Pattern Recognition Journal, Image and Vision Computing, Pattern Recognition Letters, Pattern Recognition and Arti"cial Intelligence, and Machine Vision and Applications.
Pattern Recognition 33 (2000) 715}723
Toward global solution to MAP image restoration and segmentation: using common structure of local minima Stan Z. Li* School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore Received 15 March 1999
Abstract In this paper, an iterative optimization algorithm, called the Comb algorithm, is presented for approximating the global solution to MAP image restoration and segmentation. The Comb derives new initial con"gurations based on the best local minimum found so far and leads a local search towards the global minimum. Experimental comparisons show that the Comb produces solutions of quality comparable to simulated annealing. ( 2000 Published by Elsevier Science Ltd. All rights reserved. Keywords: Combinatorial optimization; Genetic algorithm; Image restoration; Image segmentation; Markov random "elds (MRFs); Maximum a posteriori (MAP)
1. Introduction Image restoration is to recover a degraded image and segmentation is to partition an image into regions of similar image properties. E$cient restoration and segmentation are very important for numerous image analysis applications. Both problems can be posed generally as one of image estimation where the underlying image or segmentation map is to be estimated from the degraded image. Due to various uncertainties, an optimal solution is sought. A popular optimality criterion is the maximum a posteriori (MAP) probability principle in which both the prior distribution of the true image class and the conditional (likelihood) distribution of the data are taken into account. Contextual constraints, i.e. constraints between pixels, are important in image analysis. Markov random "elds (MRFs) or equivalently Gibbs distributions provide a convenient tool for modeling prior distributions of images which encode contextual constraints.
* Tel.: #65-790-4540; fax: #65-791-2687 E-mail address: [email protected] (S.Z. Li)
Maximizing the posterior is equivalent to minimizing the energy function in the corresponding Gibbs distribution. The MAP principle and MRF together form the MAP}MRF framework [1,2]. Minimization methods are an important part of the energy minimization approach. When the pixels of the image to be recovered take discrete values, as is the case dealt with in this paper, the minimization is combinatorial. It is desirable to "nd the global minimum. However, no practical algorithms guarantee a global minimum. The complication increases due to contextual constraints used in the MAP}MRF image estimation. Combinatorial optimization methods often used in statistical image analysis literature include the iterative conditional modes (ICM) [3] and simulated annealing (SA) [1,4]. The deterministic ICM uses the steepest descent strategy to perform local search. Although it quickly "nds a local minimum but the solution quality depends much on the initialization. Some initializations are better than others. An extension to steepest descent is the multi-start method: initialize a set of random con"gurations drawn from a uniform distribution, apply steepest descent independently to every con"guration,
0031-3203/00/$20.00 ( 2000 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 8 2 - 5
716
S.Z. Li / Pattern Recognition 33 (2000) 715}723
and choose, among all resultant local minima, the one with the lowest energy value as the "nal solution. The SA is a stochastic algorithm, as opposed to deterministic ones. It is shown that the SA with a slow enough schedule "nds a global solution with probability approaching one [1]. But such a slow schedule is impractical in most applications and therefore in practice, a quick annealing schedule is usually used. There have been developments in population-based methods such as genetic algorithms (GAs) [5] in recent years. Unlike the above-mentioned methods which operate on a single con"guration, a population-based method maintains and operates on a population of individuals, i.e. a collection of con"gurations. Two operations are used to produce o!spring: crossover and mutation. The resulting o!spring update the population according to the "ttest survive principle. Heuristics can be incorporated into GAs to constitute hybrid GAs [5]. Combining local search with a GA yields a hybrid GA also called memetic algorithm [6,7]. Applications of GAs in the image and vision area have also taken place; see for example Refs. [8}10]. In this paper, we present a new random search method, called the Comb method, for combinatorial optimization. Assume that an energy function has been given which in the paper is formulated based on the MRF theory for image restoration and segmentation. The Comb method maintains a number of best local minima found so far, as a population-based method. It uses common structure of the local minima to infer the structure of the global minimum. In every iteration, it derives one or two new initial con"gurations based on the Common structure (common labels) of the Best local minima (hence `Comba): if two local minima have the same label (pixel value) in a pixel location, the label is copied to the corresponding location in the new con"guration, otherwise a label randomly chosen from either local minimum is set to it. The con"guration thus derived contains about the same percentage of common labels as the two local minima (assuming the two have about the same percentage of common labels). But the derived con"guration is no longer a local minimum, and thus further improvement is possible. The new local minimum then updates the existing best ones. This process is repeated until some termination conditions are satis"ed. The resulting Comb algorithm is equivalent to a GA hybridized with steepest descent, in which the Comb initialization therein works like a uniform crossover operator. There have been various interpretations for the crossover operation. The idea of encouraging common structures in the Comb initialization provides a new perspective for interpreting the crossover operation in GA. Experiment results in both image restoration and segmentation are provided to compare the Comb method with the ICM, HCF [11] and SA. The results show that
the Comb yields better quality solutions than the ICM and HCF and comparable to SA. The rest of the paper is organized as follows: Section 2 describes the Comb method for MAP}MRF image restoration and segmentation. Section 3 presents the experimental comparisons. Conclusions are given in Section 4.
2. The Comb method for MAP}MRF image estimation In this section, we de"ne local minima of an energy function, and present the Comb method for obtaining good local minima. Before doing these, we "rst describe an energy function formulated for MAP}MRF image restoration and segmentation. 2.1. MAP}MRF image restoration and segmentation Let S"M1,2, mN index the set of sites corresponding to image pixels and denote the underlying image as f"M f , f ,2, f N"M f Di3SN. In our image estimation 1 2 m i problem, every pixel can take on a discrete value in the label set L"M1,2, MN, i.e. f 3L. Therefore, f is a coni "guration in the solution space F"Lm. The spatial relationship of the sites, each of which is indexed by a single number in S, is determined by a neighborhood system N"N Di3S where N is the i i set of sites neighboring i. A single site or a set of neighboring sites form a clique denoted by c. In this paper, only up to pair-site cliques de"ned on the 8-neighborhood system are considered. The type of the underlying image f can be blob-like regions or a texture pattern. Di!erent types are due to di!erent ways of interaction between pixels, i.e. due to di!erent contextual interactions. Such contextual interactions can be modeled as MRFs or Gibbs distributions of the form P( f )"Z~1]e~+c|CVc(f) where < ( f ) is the c potential function for clique c, C is the set of all cliques, and Z is the normalizing constant. Among various MRFs, the multi-level logistic (MLL) model is a simple yet powerful mechanism for encoding a large class of spatial patterns such as textured or non-textured images. In MLL, the pairsite clique potentials take the form: < ( f , f )"b if 2 i i{ c sites on clique Mi, i@N"c have the same label or < ( f , f )"!b otherwise where b is the parameter 2 i i{ c c for type-c cliques; while the single site potentials is de"ned by < ( f )"a where a is the potential for the 1 i I I label I"f . i When the true pixel values are contaminated by identical independently distributed (i.i.d.) Gaussian noise, the observed data, or the likelihood model, is d "f #e i i i where e &N(0, p2) is the zero mean Gaussian distrii bution with standard deviation p. With these prior and likelihood models, the energy in the posterior
S.Z. Li / Pattern Recognition 33 (2000) 715}723
717
distribution P( f Dd)Je~E(f) is E( f )" + ( f !d )2/[2p2]# + < ( f ) i i 1 i MN C i|S i| # + <(f,f ) (1) 2 i i{ M i, i{N|C The MAP estimate for the restoration or segmentation is de"ned as f H"arg min E( f ) f|F
(2)
The above minimization is combinatorial because the minimization of E( f ) is performed in the discrete space F. In this paper, we assume that the objective function ,i.e. the energy function E( f ), has been fully de"ned, which means that the MRF and noise parameters a, b and p are known; we concentrate on the problem of minimizing E( f ). 2.2. Local minima The minimal solution de"ned in Eq. (2) is meant to be the global one, i.e. the one with the lowest possible energy value. Finding the global solution of a combinatorial optimization problem is usually intractable, and therefore a local solution, hopefully a good one, is found instead. A local minimum f is said with respect to its neighborhood N( f ) where N( f ) is composed of all the con"gurations neighboring f. We may de"ne the following `k-neighborhood of a con"guration f Ha: Nk( f H)"Mx3FDx di!ers from f H by at most k labelsN. (3) For example, assuming that L"Ma, b, cN is the set of allowable label for every f (i"1,2, 5), then i Ma, b, a, a, aN is a 1-neighbor of Ma, a, a, a, aN and Mb, a, a, a, cN is a 2-neighbor of it. Note that Ma, b, a, a, aN is also a 2-neighbor of Ma, a, a, a, aN (due to the phrase `at mosta in the de"nition) but not vice versa. A point f H is a local minimum of E with respect to N if E( f H)(E( f )
∀f3N( f H).
(4)
A local minimum is also the global minimum if the neighborhood is de"ned to include all the other con"gurations, N( f H)"F!M f HN. In practice, local solutions are often de"ned with respect to the 1-neighborhood, N1, because "nding such a local solution is computationally more economical than using higher k-neighborhood. The steepest descent is a fast algorithm for "nding a local solution. Let E( f D f!M f N) be the local energy of i i label f given all the other labels. The algorithm iteratively i
Fig. 1. A steepest descent algorithm for N1.
minimizes E( f D f!M f N) for each i3S. A sequential veri i sion of such an algorithm is given in Fig. 1. The `coding methoda [12] may be incorporated into ICM for a parallelization implementation. Although fast, steepest descent has two major drawbacks: First, after reaching a local minimum, the steepest descent can no more improve it. This is unlike the SA in which occasional jump from a lower energy con"guration to a higher one is allowed so that improvements are still possible when a local minimum is reached. Second, the local solution found by the steepest descent depends very much on the initial con"guration. A good initialization scheme is always desired. The question is how to derive a good initial con"guration. The Comb provides a solution. 2.3. The Comb method The Comb method maintains a number of N best local minima found so far, denoted F"M f *1+ ,2, f *N+N, as a population-based method. In every iteration, it derives a new initial con"guration from F, and perform steepest descent using the derived initial con"gurations. If the found local minimum is better than an existing one in F, it replaces it. Ideally, we desire that all local minima in F converge towards the global minimum f *global+, in which case, there must be f *n+"f *global+, 1)n)N i i
(5)
for all i3S. We call f *global+ the minimal label at i. To i achieve Eq. (5), all the labels at i, M f *n+D∀nN, should "nally i converge to the minimal label f *global+. The Comb is peri formed towards this objective. The following heuristic is the basis for deriving new initial con"gurations: Although f *1+,2, f *N+ are local minima, they share some structure with the global minimum f *global+. More speci"cally, some local minima f *n+ have the minimal label f *n+"f *global+ for some i3S. i i Fig. 2 shows the (approximate) global minimum for an MAP}MRF restoration problem and some local minima
718
S.Z. Li / Pattern Recognition 33 (2000) 715}723
Fig. 2. The global minimum (upper-left) and "ve local minima. The local minima share some structure with the global minimum.
Table 1 Percentile (rounded-up to integers) of the sites (pixels) i3S at which at least k local minima f *n+ have the same label f *n+"f *global+ as the global minimum f *global+ i i k
0
1
2
3
4
5
6
7
8
9
10
%
100
98
95
87
75
60
43
28
16
7
2
found by using the multi-start method with initially random con"gurations. A statistic over a number of N"10 local minima is made to see how many minimal labels they have. Table 1 shows the statistic in terms of the percentile of the sites (pixels) i3S at which at least k local minima f *n+ have the minimal label f *n+"f *global+. i i The Comb initialization is aimed to derive con"gurations having a substantial number of minimal labels so as to improve F towards the objective of Eq. (5). Although a con"guration with a larger number of minimal labels does not necessarily have a lower energy value, we hope that it provides a good basis to start with, i.e. it can serve as a good initial con"guration. The Comb algorithm is described in Fig. 3. The initialization at the beginning of the Comb algorithm is done according to a uniform distribution as the multi-start method. This is followed by iterations of four steps. Firstly, two local minima in F, f *a+ and f *b+, (aOb), are randomly selected according to a uniform distribution. Secondly, a new initial con"guration f *0+ is derived from f *a+ and f *b+ using the standard Comb initialization, which will be explained shortly. Then, steepest descent is
Fig. 3. The Comb algorithm.
applied to f *0+ to produce a local minimum fH. Finally, the set F is updated by f H: If E( f H)(max E( f ), then f|F the con"guration arg maxM f D f3FN which has the highest energy value, higher than E( f H), is replaced by f H. The termination condition may be that all con"gurations in F are the same or that a certain number of local minima have been performed. The algorithm returns the best local minimum in F, i.e. the one having the lowest energy. The central part of the Comb method is the derivation of new initial con"gurations. The Comb is aimed to
S.Z. Li / Pattern Recognition 33 (2000) 715}723
derive f *0+ in such a way that f *0+ contains as many minimal labels as possible. Because the minimal labels are not known a priori, the Comb attempts to use common structure, or common labels, of f *a+ and f *b+ to infer the minimal labels. We say that f comm is a common label i of f *a+ and f *b+ if f comm"f *a+"f *b+. The Comb makes i i i a hypothesis that f comm is a minimal label if f *a+"f *b+. The i i i Comb initialization schemes are illustrated as follows: (i) The basic Comb initialization. For each i3S, if f *a+ and f *b+ are identical, then set f *0+"f *a+; otheri i i i wise set f *0+"rand(L) which is a label randomly i drawn from L. The use of the common label in the initial con"guration encourages the enlargement of common structure in the local minimum to be found subsequently. (ii) The standard Comb initialization. The basic Comb initialization is accepted with a probability 1!q where 0(q(1. The probabilistic acceptance of common labels diversi"es the search and prevents F from converging to a local minimum too soon. The standard Comb initialization is shown in the upper part of Fig. 3 where rand [0, 1] stands for an evenly distributed random number in [0, 1]. Then, how many minimal labels are there in f *0+ as the result of copying common labels, as the result of inferring minimal labels using common labels? In supervised tests where the (near) global minimum is known, we "nd that the percentages of minimal labels in f *0+ is usually only slightly (about 1.0}2.0) lower than those in f *a+ and f *b+. That is, the number of minimal labels retained in f *0+ is about the same as those in f *a+ and f *b+. Given this and that f *0+ is no longer a local minimum as f *a+ and f *b+, there is room to improve f *0+ using a subsequent local minimization. This makes it possible to yield a better local minimum from f *0+. There are two parameters in the Comb algorithm, N and q. The solution quality increases, i.e. the minimized energy value decreases, as the size of F, N, increases from 2 to 10, but remains about the same (probabilistically) for greater N values; and a larger N leads to more computational load. Therefore, we choose N"10. Empirically, when q"0, the algorithm converges sooner or later to a unique con"guration, and choosing a smaller N makes such convergence quicker. But q"0 often gives a prematured solution. The value of q"0.01 is empirically a good choice. The Comb algorithm corresponds to a hybrid GA as described in Fig. 4. The standard Comb initialization is e!ectively the same as a crossover operation followed by a mutation operation, the major and minor operations in genetic algorithms (GA) [5]. More exactly, f the basic Comb corresponds to uniform crossover and f the probability acceptance of in the standard Comb corresponds to mutation.
719
In GA, two o!spring, f *01+ and f *02+, are produced as the i i result of crossover. In the uniform crossover, either of the following two settings are accepted with equal probability: (i) f *01+"f *a+ and f *02+"f *b+, i i (ii) f *01+"f *b+ and f *02+"f *a+. i i So, if f *a+"f *b+, there must be f *01+"f *02+"f *a+, just i i i i i as in the Comb initialization. This encourages common labels because common labels are copied to the new initial con"gurations; in contrast, non-common labels are subject to swap. The above discussion is about the uniform crossover. It should be noted that even the simplest one-point crossover also works in a way that encourages common labels. The above suggests that the essence of both the Comb and GA is captured by the use of common structure of local minima. This is supported by the fact that the original Comb and the GA-like Comb yield comparable results: In the GA-like Comb algorithm (Fig. 4), when f *a+Of *b+, f *01+ and f *02+ inherit the values of f *a+ and f *b+, i i i i i i as does a crossover operator. However, setting f *01+i and i f *02+ to a random label rand(L), i.e. not necessarily ini heriting f *a+ and f *b+, leads to comparable results, as long i i as the common labels are copied to f *0+ when f *a+"f *b+. i i i Moreover, whether to derive one initial con"guration f *0+ or two initial con"gurations f *01+ and f *02+ does not matter; both schemes yield comparable results. In summary, the Comb and the GA-like Comb produce comparable results; and this suggests that retaining common labels is important and provides an interpretation for the crossover operation in GA. The Comb is better than the multi-start method. Running a steepest descent algorithm a number of times using the Comb initialization gets a better solution than running it the same number of times using the independent initialization of multi-start. The Comb has a much higher e$ciency in descending to good local minima because it makes use of the best local minima.
3. Experimental results In the following, we present two experiments, one for MAP}MRF image restoration and the other for segmentation, to compare the performance of the following algorithms: (1) the Comb of this paper; (2) the ICM [3]; (3) the HCF [11] (the parallel version); (4) the SA with the Metropolis sampler [4]. For the Comb, the parameters are N"10 and q"0.01. The implementation of SA is based on a procedure given in Ref. [13]. The schedules for SA are set to ¹(t`1)Q0.999¹(t) with ¹(0)"104. The initial con"guration for ICM, HCF and SA is taken as the ML estimate whereas those in F for Comb is entirely random. The termination condition for Comb is that all
720
S.Z. Li / Pattern Recognition 33 (2000) 715}723
Fig. 4. A GA-like Comb algorithm.
Fig. 5. Restoration of image 1. (a) Original clean image. (b) Observed noisy image (input data). (c) Comb solution. (d) ICM solution. (e) HCF solution. (f) SA solution.
S.Z. Li / Pattern Recognition 33 (2000) 715}723
con"gurations in F are the same or that a number of 10 000 new local minima have been generated. The "rst set of experiments is for MAP}MRF restoration performed on three synthetic images of M"4 gray levels, shown in Figs. 5}7. The original have the label set L"M1, 2, 3, 4N, and the pixel gray values also in M1, 2, 3, 4N. Table 2 gives the clique potential parameters a and b ,2, b for generating the three types of I 1 4 textures and the standard deviation p of the Gaussian noise.
721
The second experiment compares the algorithms in performing MAP}MRF segmentation on the Lenna image of size 256]240 into a tri-level segmentation map. The results are illustrated in Fig. 8. The input image is the original Lenna image corrupted by the i.i.d. Gaussian noise with standard deviation 10. The observation model is assumed to be the Gaussian distribution superimposed on the mean values of 40, 125 and 200 for the three-level segmentation. An isometric MRF prior is used with the four b parameters being (!1, !1, !1, !1).
Fig. 6. Restoration of image 2. Legends same as in Fig. 5.
Fig. 7. Restoration of image 3. Legends same as Fig. 5.
722
S.Z. Li / Pattern Recognition 33 (2000) 715}723
Table 2 The MRF parameters (a and b) and noise parameter p for generating the three images Image
p
a I
No. 1 No. 2 No. 3
1 1 1
0 0 0
b !1 !2 1
1
b !1 !2 1
2
b !1 1 1
3
Table 3 The minimized energy values for the restoration of images 1}3 and the segmentation of the Lenna image
b 4 1 1 1
Table 3 compares the quality of restoration and segmentation solutions in terms of the minimized energy values. We can see that the Comb outperforms the ICM and the HCF and is comparable to SA. A subjective evaluation of the resultant images would also agree to the objective numerical comparison. The quality of the Comb solutions is generally also better than that produced by using a continuous method of augmented Lagrange developed previously [14]. The Comb as a random search method needs many iterations to converge, the number being increasing as q decreasing. All the Comb solutions with q"0.01 are obtained when the limit of generating 10 000 local minima is reached. This is about 1000 times more than the fast converging ICM and HCF. Nonetheless, the Comb takes about 1/20 of the computational e!ort needed by the SA.
Comb No. 1 No. 2 No. 3 Lenna
!12 057 !10 944 !27 511 !175 647
ICM
HCF
SA
!10 003 !8675 !25 881 !171 806
!10 269 !9650 !26 629 !167 167
!11 988 !11 396 !27 526 !173 301
The Comb algorithm does not rely on initial con"gurations at the beginning of the algorithm to achieve better solutions; the maximum likelihood estimate can lead to a better solution for algorithms which operate on a single con"guration, such as the ICM, HCF and SA, but not necessarily for the Comb.
4. Conclusions The Comb attempts to derive good initial con"gurations from the best local minima found so far in order to achieve better solutions. To do so, it uses common structure of the local minima to infer label values in the global minimum. An initial con"guration thus derived has about the same number of minimal labels as the two local minima from which it is derived. However, it is no longer
Fig. 8. Segmentation of the Lenna image. (a) The noisy Lenna image. (b) ML con"guration. (c) Comb solution. (d) ICM solution. (e) HCF solution. (f) SA solution.
S.Z. Li / Pattern Recognition 33 (2000) 715}723
a local minimum and thus its quality can be improved by a subsequent local minimization. This makes it possible to yield a better local minimum and thus increments the solution quality step by step. The comparisons shows that the Comb produces better results than the ICM and HCF though at higher computational costs and comparable results to the SA at lower costs. This suggest that the Comb can provide a good alternative to the well-known global minimizer SA. Further, the Comb algorithm is applicable, in principle, to many optimization problems of vision and pattern recognition.
Acknowledgements This work was supported by NTU AcRF projects RG 43/95 and RG 51/97.
References [1] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images, IEEE Trans. on Pattern Anal. Mach. Intell. 6 (6) (1984) 721}741. [2] S.Z. Li, in: Markov Random Field Modeling in Computer Vision, Springer, New York, 1995. [3] J. Besag, On the statistical analysis of dirty pictures (with discussions), J. Roy. Stat. Soc. Ser. B 48 (1986) 259}302. [4] S. Kirkpatrick, C.D. Gellatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671}680.
723
[5] D.E. Goldberg, in: Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading MA, 1989. [6] P. Moscato, On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms, C3P Report 826, Caltech Concurrent Computation Program, 1989. [7] N.J. Radcli!e, P.D. Surry, Formal memetic algorithms, in: T. Fogarty (Ed.), Evolutionary Computing: AISB Workshop, Lecture Notes in Computer Science, Springer, Berlin, 1994, pp. 1}14. [8] B. Bhanu, S. Lee, J. Ming, Adaptive image segmentation using a genetic algorithm, IEEE Trans. Systems Man Cybernet. 25 (1995) 1543}1567. [9] Y. Huang, K. Palaniappan, X. Zhuang, J.E. Cavanaugh, Optic #ow "eld segmentation and motion estimation using a robust genetic partitioning algorithm, IEEE Trans. Pattern Anal. Mach. Intell. 17 (1995) 1177}1190. [10] G. Roth, M.D. Levine, Geometric primitive extraction using a genetic algorithm, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 901}905. [11] P.B. Chou, P.R. Cooper, M.J. Swain, C.M. Brown, L.E. Wixson, Probabilistic network inference for cooperative high and low level vision, in: R. Chellappa, Anil Jain (Eds.), Markov Random Fields: Theory and Applications, Academic Press, Boston, 1993, pp. 211}243. [12] J. Besag, Spatial interaction and the statistical analysis of lattice systems (with discussions), J. Roy Stat. Soc. Ser. B 36 (1974) 192}236. [13] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, in: Numerical Recipes in C, 2nd Edition, Cambridge University Press, Cambridge, 1988. [14] S.Z. Li, MAP image restoration and segmentation by constrained optimization, IEEE Trans. Image Process. 7 (12) (1998) 1780}1785.
About the Author*STAN Z. LI received the M.Eng degree from Hunan University, China, in 1982, M.Eng degree from the National University of Defense Technology, China, in 1985 and Ph.D. degree from the University of Surrey, UK, in 1991. All degrees are in EEE. He is currently a senior lecturer at Nanyang Technological University, Singapore. His research interests include computer vision, pattern recognition, image processing, multimedia information retrieval and optimization methods.
Pattern Recognition 33 (2000) 725 } 740
A region-level motion-based graph representation and labeling for tracking a spatial image partitionq Marc Gelgon!,",*, Patrick Bouthemy! !IRISA/INRIA, Campus universitaire de Beaulieu 35042 Rennes Cedex, France "Nokia Research Center, Tampere, Finland Received 15 March 1999
Abstract This paper addresses two image sequence analysis issues under a common framework. These tasks are, "rst, motion-based segmentation and second, updating and tracking over time of a spatial partition of an image. By spatial partition, we mean that constituent regions display an intensity, color or texture-based homogeneity criterion. Several issues in dynamic scene analysis or in image sequence coding can motivate this kind of development. A general-purpose methodology involving a region-level motion-based graph representation of the partition is presented. This graph is built from the topology of the spatial segmentation map. A statistical motion-based labeling of its nodes is carried out and formalized within a Markovian approach. Groups of spatial regions with consistent motion are identi"ed using this labeling framework, leading to a motion-based segmentation that is both useful in itself and for propagating the spatial partition over time. Results on synthetic and real-world image sequences are shown, and provide a validation of the proposed approach. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Image sequence analysis; Motion-based segmentation; Partition tracking; Markov random "elds
1. Introduction and related work Image segmentation, regardless of the segmentation criterion, is among the most fundamental tasks faced in image analysis. The problem of performing this segmentation on a whole set of successive frames is also frequently met. In this paper, we tackle two problems under a common framework. First, we aim at motion-based segmentation of an image sequence [1]. Second, we address the problem of updating and tracking spatial image partitions over time [2]. By image partition, we mean a set of disjoint regions, the union of which forms the
q This study is supported in part by DGA (DeH leH gation GeH neH rale pour l'Armement - French Ministry of Defense) through a student grant. * Corresponding author. Nokia Research Centre, Tampere, Finland. E-mail addresses: [email protected] (M. Gelgon), [email protected] (P. Bouthemy)
image. This partition can result from an intensity-based, color-based or texture-based segmentation. Studies in motion analysis have shown that motionbased segmentation would bene"t from including not only motion, but also the intensity cue, particularly to retrieve region boundaries accurately. Hence, the knowledge of the spatial partition can improve the reliability of the motion-based segmentation. Conversely, if the motion-based partition of an image is recovered and properly exploited, temporal tracking of a spatial partition of this image can be done in an more e$cient way than if spatial regions were tracked individually. As a consequence of these two remarks, we propose a scheme that builds both a spatial-based and a motionbased partition of an image, and that tracks both of them over time. Depending on the application goal under concern, the output partition relevant for the user may be either the spatial partition or the motion-based one. Such a scheme requires the construction of a relevant structure exploiting the motion information which relates two successive image partitions. This paper mainly
0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 8 3 - 7
726
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
focuses on this stage consisting in a region-level motionbased graph representation and labeling. To this end, region-level contextual information has to be formalized and exploited. The introduction of a region-level motion-based valued graph is proposed. The application presented here is concerned with texture-based segmentation in infra-red image sequences as well as greylevel and color segmentation in the visible domain, with motion as inter-frame transformation. Besides, such an updating-tracking scheme can facilitate the determination itself of the spatial partition map at each instant, in terms of quality of results and saving of computational time, by providing an appropriate prediction step. Several issues can motivate this kind of development. For instance, in a surveillance task, extracting small moving objects is easier within spatially homogeneous tracked regions than directly from the image. In this case, the sequence of spatial partitions is the relevant output. In other cases, it is the motion-based regions that form meaningful entities in terms of content understanding. The results we present include these two cases. Besides interpretation purposes, object-based coding applications can bene"t from such an achievement. The expanding "eld of content-based video indexing may also be aimed at, as will be mentioned in the conclusion. We now review some previous approaches concerned with segmentation taking both intensity and motion into account, with region grouping, or with region tracking. Taking both intensity and motion information into account in segmentation procedures is, among other reasons, motivated by the ability of intensity cues to locate boundaries accurately and to cope with image areas with poor intensity gradient information. These are often shortcomings for segmentation exploiting only motion information. On the other side, motion-based segmentation generally leads to a semantic description of the image, involving fewer and often more signi"cative regions than a spatial segmentation. In several approaches, intensity is involved at pixel level through a spatial segmentation, providing a set of regions that are handled by a region-level motion-based scheme. In [3}6], a spatial segmentation stage is followed by a motion-based region-merging phase. In [3], regions are grouped by iterating estimation of the dominant motion and grouping of regions that conform to that motion, while in [4], a kmedoid clustering algorithm if used. Other methods involve, in contrast, motion-based intermediate regions. A variety of methods have been proposed in this direction, generally carrying out region grouping also on a motion-based criterion. A k-means clustering algorithm in motion parameter space was used in [7]. With clustering method in particular, determination of the number of clusters is a key issue. This problem was addressed in [8] with an MDL-based approach. An explicit region-level
merging procedure has been embedded in a Markovian framework in [9,10]. Di!ering from two-level approaches, [11] proposes to perform spatial segmentation as an iterative process operating with progressive graphs. Given the partition at the current iteration, the adjacency graph is built and labeled on a spatial criterion, using stochastic dynamics and exploiting the desired connectivity of regions to reduce the space to be searched. The labeled graph provides an initial partition for the next iteration. By this means, the "nal partition displays both accurate boundaries and a reduced number of regions. Another possibility is to introduce spatial and motion information both at pixel level. In [12] both types of constraints, along with geometrical ones, are included in the same energy function in a Markovian}Bayesian scheme. The handling over time of image segmentation maps has already received some attention. Short-term approaches were chosen for motion-based segmentation in [13], and in [5] for grey-level segmentation. A longerterm view is introduced in [14], where temporal integration of frames is achieved by recursive registration of successive images. However, no formal tracking stage was introduced in these works. Occlusions and crossings have been coped with in [15], but considering only a small number of regions, and by tracking them independently. A 2D mesh model of an object of interest was employed in [16] to track its motion, intensity and boundary. The region segmentation hierarchical schemes discussed above are often applied to the segmentation of two frames only [4,8]. When applied to a whole sequence, in [3,5] only the "rst pair of frames is concerned with the layer of intermediate regions. Whereas this can be justi"ed since in [3,5] it is the motion-based partition which is the output partition of interest, the problem stated here requires intermediate a spatial region partition for all images. Besides, some of these methods are not incremental, i.e. they cannot straightforwardly bene"t, at t#1, from the segmentation map obtained at t. The method presented here yields a contribution in this direction, because it is incremental at both levels of the hierarchy (spatial and motion partitions). This paper is organized as follows. Section 2 introduces the proposed tracking algorithm and its advantages over an elementary tracking method. Several alternative spatial segmentation schemes are described in Section 3. Section 4 details the central stage of the approach, i.e., how a motion-based graph representation of the image can be built. The use of this graph for tracking is the scope of Section 5. Results obtained both on a patch of natural textures undergoing synthetic motions, and on various real-world image sequences are presented in Section 6. Finally, Section 7 contains concluding remarks.
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
727
2. Principles of the proposed approach We present in this section the main features of our original method for motion-based segmentation and partition tracking. Fig. 1 gives an overview of its general framework. A spatial partition for the "rst frame of the image sequence is "rst required. It can be part of the given input data, or build from a statistical markovian segmentation we propose here, that rely on texture, greylevel or color as possible alternative criteria. A spatial region graph is then derived from the spatial image partition P (Region adjacency graph, Fig. 1b). The nodes of this graph are then considered as sites of a region-level Markov random "eld (MRF), and are assigned motion labels using a statistical regularization scheme. A 2D motion model is estimated within each region, and the optimal motion label con"guration is sought for using an energy minimization approach, such that regions undergoing similar (resp. di!erent) motions are given the same (resp. di!erent) labels (Fig. 1c). This label map is considered in turn as a region-level and motion-based partition P , from which a second graph (tracking graph) is m derived. This graph, valued by motion information measured on the resulting regions (Fig. 1e) is the one used for tracking. Mechanisms are included for building predicted label maps at both pixel and region level. These predictions also exploit temporal "ltering of measured motion parameters. These predicted label maps at both pixel and
Fig. 2. Diagram of an elementary short-term tracking scheme, relying on alternate partition prediction and updating phases.
region level provide initial label con"gurations at these respective levels, that are close to the optimal con"gurations, ensures a fast energy minimization step. The comparison between our algorithm and an elementary (or short-term) tracking scheme (sketched in Fig. 2) highlights its advantages. The latter, as the one we have de"ned in Ref. [13], relies on alternate image partition prediction and updating phases. Given a partition determined at time t, motion is estimated within each region; a predicted segmentation map is then projected at t#1 using the estimated a$ne motion model in each region, then updated. With this simple short-term tracking scheme, the initial label con"guration used for the energy minimization process delivering the image segmentation at time t#1, and supplied by means of the described prediction
Fig. 1. The various structures involved in the method and how they are related to one another.
728
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
technique, is generally close to the optimal one. It enables the deterministic relaxation step, performing the minimization of the considered energy function, to converge quickly to an adequate local minimum, i.e. to update the partition in a satisfactory way. Moreover, provided no special event (like occlusion or crossing) occurs, the label associated to each region remains identical from t to t#1 in a straightforward manner. Yet, occlusions are not handled since such a scheme has no `memorya. If a region disappears and then reappears, it will be assigned a new di!erent label. Furthermore, this method does not involve region-level contextual information. This last point causes the algorithm to under-perform, for instance, in the case of a region where motion cannot be accurately measured or in the case of oversegmentation. In addition, the global evolution of the partition structure is not accounted for. While keeping the advantages of the simple tracking scheme, our approach brings in new bene"ts. Firstly, in poorly textured areas, insu$cient intensity gradient information is available for the di!erential motion estimation method to supply accurate motion estimates. Let us recall that this was not such a critical issue in Ref. [13], since the pixel-level segmentation criterion was not texture, grey-level or color but directly motion. In this case as in general, if several regions can be jointly considered for motion estimation, they are likely to provide more intensity gradient information that helps deliver more accurate motion estimates. The use of higher degree polynomial motion models than a$ne models may also be correctly achievable in that case. Secondly, as far as long-term tracking is concerned, involving region representation, recursive "ltering and an explicit formalized temporal evolution model, the tracking graph structure is obviously much simpler than the spatial region graph, while involving all the useful information. Another important feature of our method is the following. In contrast with approaches that operate a progressive and irreversible simplication of the partition topology through merging, the labeling approach presented here keeps track of the spatial regions composing a motionbased grouping, so that nodes that are identically labeled at a given iteration or at a given instant can subsequently be labeled di!erently along the energy minimization. Also, region-level contextual information can be taken into account through a contribution to the energy function. The sections to come present the di!erent stages of the proposed scheme. This paper does not discuss the thorough long-term tracking of regions, along with trajectography of regions.
3. Spatial segmentation scheme For the spatial image segmentation stage, we propose three approaches, depending on the sequences to be
processed. The proposed region homogeneity criterion could be texture, grey-level or color. The appropriate choice naturally depends on the nature of image content and availability of color. As a general rule, color should be used rather than grey-level. During the partition updating phase, the number of regions p is determined on-line. We outline here the main features of the spatial segmentation algorithm, indicating its speci"cities for each homogeneity criterion. The segmentation method operates within a Bayesian estimation framework. Let E"ME , s3SN be the label s "eld de"ned on the set S of sites s corresponding to the image discrete grid, i.e. sites are pixels. Let O"MO , s3SN s be the observation "eld. Let e"Me , s3SN, (resp. s o"Mo , s3SN) be a realization of E (resp. O). Given s a neighbourhood system, (E,O) is modeled as an Markov random "eld. The optimal label "eld e( is derived according to the Maximum a posteriori (MAP) criterion. Owing to the equivalence between Gibbs distributions and MRF, this optimal label con"guration in fact results from e( "arg min ;(e,o), where ) is the set of all pose|) sible realizations of E and ;(e,o) is the so-called energy function encompassing the interactions between labels and observations and prior information on the label "eld [17]. Let C be the set of all cliques c (a clique is a subset of sites which are mutual neighbors). We use a second-order neighborhood, but only two-site cliques are considered. The energy function ;(e,o) is expressed as a sum of two terms, which both break into the sum of local potentials de"ned on cliques: ;(e,o)"; (e,o)#; (e). 1 2 3.1. Texture-based data-driven term ; (e,o) expresses the relation between the observations 1 at hand and the labels to be determined. It is given by ; (e,o)"+ < (o(B ),o(R s)), 1 1 s e s
(1)
where < (o(B ),o(R s)) conveys the likelihood of a particu1 s e lar label being assigned to site s, given the observations o. o(B ) is the set of observation vectors in a local window s centered at s, and o(R s) is the set of observation vectors e corresponding to sites currently labeled e and forming s region R s. We outline the de"nition of this potential for e texture and color (next subsection). The potential < for 1 grey-level segmentation method is derived by simpli"cation of the color segmentation potential below. The unsupervised texture segmentation method described in Ref. [18] is employed. No prior information is required about the nature of the textures. In order to select the appropriate set of texture features to build the observation vectors Mo "[o1, 2, om], s3SN, two classes s s s
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
729
of images are considered. For signi"cantly textured images, like the infrared images of Fig. 6, or Brodatz texture patchwork Fig. 5, statistical features extracted from cooccurence matrices were added to grey-level and variance features. The potential < is de"ned as follows: 1
k is a predetermined positive constant, and d(e !e )"1, if e "e , 0 otherwise. By penalizing local s t s t con"gurations where two neighboring labels are di!erent, homogeneous regions are globally favored.
< (o(B ), o(R s)) 1 s e
3.4. Energy minimization
G
m #1 "+ !1 i/1
if d(o(i)(R s), o(i)(B ))'a(i), e s if d(o(i)(R s), o(i)(B ))(a(i), e s
(2)
where d(. , .) stands for the Kolmogorov}Smirnov distance between the distributions estimated respectively on the local window and on the entire current region R s, and i is e the observation vector number. Thresholds a(i) are predetermined constants. 3.2. Color-based data-driven term Designating by (r, g, b) the red, green and blue components of a pixel, we have selected a representation proposed in Ref. [19]. The three selected axes and the quanti"cation used are as follows: rg"r!g (16 quanti"cation levels),
(3)
by"2b!r!g (16 quanti"cation levels),
(4)
wg"r#g#b (8 quanti"cation levels).
(5)
The choice of this color space was driven by the satisfactory results obtained, with regard to its complexity [19], not being unaware that recent studies have proposed more e!ective representations of color. The introduction of color should not be considered as a major contribution of the paper, but rather an interesting alternative to grey-level segmentation, with a view to building a motion-based partition. Let i, j, k be three indices on the three chosen color axis and let us call respectively C (s) and H (R ) the local and global tri-dimeni, j, k i, j, k es sional color histograms measured of the above attributes. The potential < is de"ned as follows: 1 < (e , o , o(R s)) 1 s s e "+ + + D H (R )!C (s) D2 . i, j, k i, j, k es i j k
(6)
Energy minimization is performed using a modi"ed ICM algorithm [20]. A binary stability label is attached to each site, all of which are initially unstable. A site is randomly selected among the unstable sites. The set " of s candidate labels that may be assigned to site s, include labels currently assigned in the neighborhood l(s) of site s, the current label e and an outlier label t. This last s label enables the creation of new regions [13,18]. The local energy evaluation *; when considering a candis date label is given by f For rOt, *; (r) s
G
m #1 if d(o(i)(R(es"r)), o(i)(Bs))'a(i) "+ !1 if d(o(i)(R(e "r)), o(i)(B ))(a(i) i/1 s s #+ < (c ), 2 s cs
(8)
where c designates the subset of cliques c containing s. s f For r"t, we extend the de"nition of *; (r) as fols lows: *; (t)"+ k[1!2d(t!e )]#/ s t l(s)
(9)
r( , the optimal label among those labels, is supplied by r( "arg min *; (r). s r|"s
(10)
Besides, in the case of grey-level or color, a multiscale strategy is employed [21]. Once the relaxation process is completed, new labels are attributed to the connected subsets of sites with the o-label, which size exceeds a preset threshold.
3.3. Regularization term 4. Building of a region-level motion-based graph The regularization term ; (e) re#ects the a priori con2 straint on the label map. We have ; (e)" + < (s, t) 2 2 Ws, tX|C where < (s, t)"k(1!2d(e !e )), 2 s t
(7)
The central part of the algorithm is now introduced. Given the spatial partition P"MR , k"1,2, pN, conk taining p regions, an irregular graph is derived from its topology. We denote it by G, the nodes N of which k correspond to the regions R of the spatial partition. Let k arcs A join in G the nodes associated to adjacent regions j
730
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
in the spatial partition. G"MMN , 2, N N, MA , 2, A NN. 1 p 1 q
(11)
We aim at assigning a motion label to every node in the graph, with a view to partitioning this graph into node subsets corresponding to groupings of regions of coherent motion. Each grouping is hence numbered by its label. The labeling of the graph is formalized within a Markovian framework. To this purpose, we identify the nodes of the graph to the sites of a region-level MRF. The cliques are deduced in a straightforward manner from the arcs of the graph. Let l"Ml , 2, l N be the set 1 p of sites and !"Mc , 2, c N be the set of binary cliques. 1 q We now focus on the de"nition of a suitable energy function for our region grouping objective.
As in the case of the pixel-level energy function for the spatial segmentation stage, the region-level energy function ;@ is split up into several terms. It involves a observation-label interaction term, a geometric interaction term and a regularization term. However, the interaction term is here also de"ned over a binary clique. The choice of binary clique is explained as grouping regions according to motion consistency is done by considering pairs of neighboring regions. The energy function is expressed as ;@(e@,o@)" + <@ (e@(c ), o@(c )) 1 j j cj|! (12)
where e@(c ) stands for the pair of labels attached to the j clique c (c "Ml ,l N), and o@ for the region-level observaj j k k{ tions, which we will examine below. This is an elegant and #exible way to formalize the merging of regions of similar motions. Potential <@ will express a discrepancy 1 measure between the two motion model "elds attached to the sites l and l composing clique c . <@ takes into k k{ j 2 account the geometric degree of adjacency between adjacent regions, and ;@ favours a reduced number of 3 regions. The motion estimation technique and the chosen discrepancy measure are now presented. 4.1.1. Parametric motion estimation The inter-frame transformation between frame I at t time t and frame I at time t#1 is modeled by a set of t`1 2D a$ne motion models, one per region. The displacement vector at pixel site s"(x, y) in region R which k gravity center g "(xk, yk), is expressed as k g g
A
B
ak #ak (x!xk)#ak (y!yk) 2 g 3 g d k t`1 (s)" 0 (# )t ak #ak (x!xk)#ak (y!yk) 1 4 g 5 g
Y ) t`1"arg min + o(r(s, *#n)), (*#n k t k *#nk s|Rk(t)
(14)
where r(s, *#n)"I(s#d Y k(s), t#1)!I(s, t) k # #+I(s#d kY (s), t#1) ) d k(s), # *#
4.1. Energy function dexnition and minimization
# + <@ (e@(c ))#;@ (e@), 2 j 3 cj|!
in which the motion parameter vector (# )t`1"(ak , 2, ak ) is estimated on each region kt 0 5 R , k"1, 2, p, using the robust multi-resolution esk timator described in Ref. [22]. A M-estimator criterion is minimized by means of an iterative reweighted leastsquares technique embedded in a multiresolution framework. If #nY designates the estimate of # at iteration n, k k we have # "#nY #*#n, and the estimate of the ink k k crement *#n is given by k
(13)
(15)
Y
Y #*#nY where o() is Tukey's function. Then, #n`1"#n k k k and the process is iterated. This method only involves the computation of the spatio-temporal derivatives of the intensity function. An estimation of the covariance matrix associated to the motion parameter vector is also provided. 4.1.2. Construction of a motion-based distance between regions Owing to the robustness of the estimator used our motion measurement is rather insensitive to minor errors in region border determination, secondary motions due to small mobile objects if any within the region. A possible way of comparing the motions of two regions involves estimating a motion model on the union of regions. This can provide valuable information, but induces a combinatorial computational cost. In order to characterize the di!erence between the estimated motions within two neighbouring regions R and R , we prefer considering the two motion "elds k k{ issued from the motion models estimated within each region. Let us note that doing so, we do not resort to `displaced frame di!erencea-type criteria. We extend these "elds over the support corresponding to the union of the two regions. The discrepancy between these two extended "elds, denoted by D(c ), is expressed as the j average, over the union of the two regions, of a weighed distance e between the velocity vectors that form these "elds: 1 + e(d k(s),d k{(s)). D(c )" # # j card(R XR ) k k{ s|(RkXRk{)
(16)
<@ aims at assigning identical motion labels to nodes 1 when the attached motions are similar, and di!erent labels when motions are strongly di!erent. A binary penalty value resulting from a test on the hypothesis that
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
731
Fig. 3. <@ potential as a function of the di!erence D(c ) between the two motion model "elds, for identical labels and di!erent labels. 1 j Parameter values: q"2 and i"7.
two estimated motions really correspond to two really identical underlying motions is de"ned in Ref. [10]. In contrast, a progressive transition is introduced here, and potential <@ is de"ned as in relation (17). This function is 1 plotted in Fig. 3; it is expressed as follows: <@ (e@(c ), o@(c )) 1 j j
G
1
i " 1#e q (D(cj)~q) 1 1! 1#eiq (D(cj)~q)
if e@ "e@ , k k{
(17)
if e@ Oe@ . k k{
4.1.3. Regularization terms <@ corresponds to the regularization term. To take 2 into account the `degreea of adjacency between two regions, two geometrical features are computed per region pair R : the length of the common border, k, k{ denoted by m , and the distance between the region k, k{ gravity centers (Fig. 4). They are combined into a geometrical `compacity factora g of the region pair: k, k{ m k, k{ g " . k, k{ m #DDg !g DD k, k{ k k{ 2
(18)
This factor takes part in the de"nition of the potential <@ : 2
G
!b.g , b'0 k, k{ <@ (e@(c ))" 2 j 0
if e@ "e@ , k k{ if e@ Oe@ . k k{
(19)
Fig. 4. Measure of the adjacency degree between two neighboring regions R and R , based on the length m of the common k k{ k, k{ boundary and the distance between the gravity centers.
The third energy term expresses a prior constraint on the partition structure, consisting of a penalty proportional to the number of motion-based region groups. Denoting by E the set of all di!erent labels assigned to nodes, dE the number of elements in this set, and setting a constant j, this energy term is de"ned as follows: ;@ (e@)"j ) dE. 3
(20)
The a priori introduced here is that there should be few regions. This idea that generally the fewer the models to
732
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
explain the data the better, is discussed extensively in Ref. [23]. This term does not a!ect adjustment of motion boundaries, but attempts to reduce the appearance of spurious motion-based regions composed of a single spatial region. The relative small number of regions allows us to utilize an energy minimization technique based on the HCF method [20]. During the labeling process, motion models are not re-estimated on groups, but per spatial region only initially once and for all. For the "rst frame of the sequence, all regions are initially given di!erent motion labels. Sites are visited according to their rank in an unstability stack [20]. Candidate labels at a given site include the current label at this site and the labels currently assigned to the neighbor sites. An extraneous label is also proposed. For each candidate label, the local energy variation involved is computed. For the extraneous label, the potentials de"ned in Eqs. (17) and (19) are calculated considering e@ Oe@ (the computation k k{ of these potential do not require the knowledge of the precise labels). The label giving rise to the highest decrease in local energy variation is then selected. The addition of the extraneous label to the list of candidate labels makes possible a correct on-line determination of the number of relevant motion entities. We arbitrarily chose to label disconnected site subsets with di!erent labels. If necessary, this choice can easily be set on or o!. According to the "nal label values, the labeled graph is partitioned into subsets of identically labeled nodes. A second graph G , called `tracking grapha, can be m deduced from this partition. A node in G is associated m to each subset and, if at least one arc in G joins the two subsets, then the two corresponding nodes in G are linked by an arc. The next section describes m how the partition tracking stage can exploit this graph G . m 5. Partition tracking using the graph G
m
Tracking of spatial regions aims "rst at establishing a correspondence between these regions in successive frames. It can also increase reliability, e$ciency and consistency of features attached to the regions to be tracked, such as geometry and motion. To this end, the tracking graph is introduced. Its purpose is two-fold. First, it allows to maintain label consistency over time, both for pixel-level and region-level labeling. Secondly, it may improve the reliability of motion estimates through temporal recursive "ltering. 5.1. Region-level label map prediction We "rst examine how G can be predicted at t#1. We seek to build a relevant label con"guration to initialize
the motion-based region-level relaxation at t#1. Given a spatial partition P at time t, the spatial partition t P at t#1 can be split into two subsets. Let P int`1 t`1@t clude the spatial regions that are already existing in P , t and P include the spatial regions that emerged at t`1@t t#1. We have P "P XP . t`1 t`1@t t`1@t Prior to motion model estimation at t#1, no information is available to favor any particular labeling for spatial regions created at t#1. We hence attach a new initial motion label to the corresponding nodes. On the other hand, the prior belief that the motion-based region grouping should be maintained from t to t#1 suggests that for regions that survive from t to t#1, node labels have to be initialized at t#1 with the label obtained at t. If we denote by e8 @ the predicted label attached to the k, t`1 node corresponding to region R , k
G
e8 @ e8 @ " k, t k, t`1 a new label
if R 3P , k t`1@t . if R 3P k t`1@t
(21)
Owing to the scheme de"ned in Section 4, with which spatial regions are not irreversibily merged, the groupings determined for a given frame can be called into question for the next one. The label con"guration is then appropriately updated during the energy minimization step, an update of the number of motion-based regions being jointly performed. A$ne motion models are then estimated on each motion-based region group. We consider the temporal evolution of the motion parameters of the motion models as a stochastic processes. For each region grouping R@ , n the six estimated motion parameters are considered non-correlated and account for measurements supplied to six independent Kalman "lters. A "rst-order derivative temporal evolution model is selected here, as in Ref. [15] for a similar usage. Evolution of the state of a motion parameter a can be approximated by the following dyi namic system:
CD
C
a
DC D C D
dt
1 i (t#1)" a5 0 i
1
ameasured(t)"a (t)#f(t). i i
a e i (t)# 1 , a5 e i 2
(22) (23)
Process noise is modeled by the zero-mean Gaussian vector (e ,e )T. This vector characterized by its covariance 1 2 matrix R, hence by the variance p2R:
C D
dt3 3 R"p2R dt2 2
dt2 2
.
(24)
dt
Measurement noise f(t), also modeled as zero-mean Gaussian noise, is characterized by its variance p2. f
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
At time t, the Kalman "lter provides a predicted motion model at t#1 that can hold as an initial value for the motion estimator (measurement) at t#1 on coherent region groups (Fig. 1e). Initializing the state vector makes use of the "rst three measurements. The predicted parameters and then "ltered parameters of the recursive "lters at a given node of G can be identically attributed m to all nodes of G in the corresponding node subset. The "ltered parameter vectors are passed on to their respective regions in the spatial partition, in order to provide a prediction for this spatial partition at t#1. 5.2. Pixel-level label map prediction We now explain the spatial partition prediction techY stand for the "ltered motion parameter nique. Let #t@t k vector from t to t#1. Given the estimated spatial label map e Y at time t, the predicted spatial label "eld e8 t t`1 is derived from a motion-oriented propagation of labels [13]: e8 (s)"e (s#d t@teY s t`1 (s)). t`1 t (# )t
733
The graph G must thus be built upon the topology of the spatial partition augmented with these additional detected sub-regions. For each region R , k there may exist one or several non-conforming subregions Rnc , where z is the sub-region index and nc is an k, z overscript denoting non conforming. A motion model # is estimated on every sub-region. Node labeling k, z proceeds as presented in Section 4. Motion models can then be estimated on coherent region sets. The graph G accounts for these detected sub-regions, by including also a node for them. Let s be a pixel site in R , R being k k composed of several sub-regions: one sub-region Rc k conforming to the dominant motion in R , and possibly k several sub-regions Rnc , non-conforming to this motion k, z model. The motion-oriented propagation of labels in R simply takes into account which sub-region s k belongs to
Y
G
e (s#d t@tRY k t`1(s)) (# ) t e8 (s)" t t`1 e (s#d t@tRncY k, z t`1 ) t (# )t (s)
if s3R , k, c if s3Rnc . k, z
(25)
Since s#d (s) usually points to a place of non-integer # coordinates, the label is assigned to the four nearest sites on the image grid. Sites that receive no label or multiple labels are respectively assigned `uncovereda and `occlusiona labels. Both labels are considered neutral when it comes to the relaxation algorithm. The number of iterations required in the pixel-level and region-level energy function minimization steps are automatically determined by the algorithm and are a function of the complexity of both partitions.
6. Extension to multiple motion models per spatial region Though there is in general some intensity spatial gradient, or texture variation, on motion boundaries, it may not always be signi"cant enough to give rise to a spatial boundary. This is illustrated in Fig. 8, in which the woman on the right is swinging her left arm upwards. A spatial segmentation seems inappropriate to retrieve the arm. As a result, a single motion descriptor cannot describe correctly the apparent motion, and the partition prediction is inaccurate in this particular area. Inadequacy of a single motion model may also occur when the complexity of the apparent motion, due for instance to signi"cant depth variation, depth discontinuities or nonrigid motions, is beyond the description ability of the 2D a$ne motion model used here. We propose to alleviate this issue as follows. On every spatial region R , we detect sub-regions that do not k conform to the estimated motion model #Y , using the k Markovian multiscale technique described in Ref. [24]. Only sub-regions of a signi"cant size are kept.
7. Results The proposed scheme has been validated on both synthetic and real-world image sequences. Parameters a(i) were set to 0.2 once and for all, empirically inferred from the de"nition of the Kolmogorov-Smirnov distance. The pixel-level regularization constant, k, can reasonnably set between 0.2 and 0.4 (it is set to 0.3 in practise). The control over the number of spatial regions is mainly left to /, which setting depends on the selected segmentation criterion (texture, grey-level or color). The same value has shown satisfactory for most tested sequences, for a given criterion. q is the most important criterion, since it controls the tolerated motion discrepancy between motions. As a general rule, on should use small values of q for spatial partition tracking and for video conding. For dynamic content analysis (interpretation), there may be di!erent motion-based partitions that are all meaningful from some point of view. It is hard to give a general rule since, for instance, discrimination of particularly slow motion requires a very low value of q, which may cause undesired regions due to some parallax e!ect, in some other sequence. Chosen values of q for the sequences shown in this section are as given in the table below. Since motion is the most important criterion and that regularization should only in practise determine the label only in ambiguous cases, we set b and j both to 0.1. The measurement and process noise variances for temporal "ltering are taken constant and both equal to 0.01. Such a parametrized model is satisfactory for the sequences tested, but if prior knowledge was available about the dynamics tracked objects, or if some scheme was added to learn these parameters, values better suited to each sequence may be chosen.
734
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
Spatial segmentation
Motion-based grouping
Temporal "ltering
a(i)
k
b
j
p2 f
p2 R
0.2
0.3
0.1
0.1
0.01
0.01
Patchwork
Power station
Renata
Interview
Mobi
Car
q
0.2
0.2
2
2
2
2
Criterion
Texture
Texture
Grey-level
Grey-level
Color
Color
Fig. 5. Patchwork sequence: (a) true region numbering, original image with estimated vectors sub"elds corresponding to the motion models and motion boundaries superimposed at time (b) t"4, (c) t"16, (d) t"20 and (e) t"38.
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
The method was "rst applied to synthetic sequences. The di!erent regions of a 256]256 image texture Patchwork made up of natural textures taken from the Brodatz album, are imparted di!erent and time-varying a$ne motions. Image intensities are "rst quantized on 20 levels. Texture features used are mean value, local variance, and a statistics extracted from co-occurence matrices of grey-level values calculated on 7]7 pixel local windows, namely correlation [18]. The true region labels are given in Fig. 5a. Regions 1 and 2, in the foreground, undergo horizontal translational motion, "rst accelerating then slowing down. Region 3 is imparted combined translation, rotation, divergence, while region 4 undergoes combined divergence and translation. A "rst increasing, then decreasing divergence is applied to region 5. The determined motion boundaries and the estimated motion model "elds superimposed on the original images are shown in Fig. 5 for various frames. Three groupings are initially formed: respectively (1), (2), (3, 4, 5) (Fig. 5b). The region-level label con"guration then varies along the sequence. Indeed, as the motion in region 3 gets strongly di!erent from the motion imparted to regions 4 and 5, region 3 becomes a separate motion entity (Fig. 5c), in accordance with the ground truth. A new motion-based region is further created (Fig. 5d), because of the increas-
735
ing strength of the divergence applied to region 5. Close to the end of the sequence, regions 2 and 5 become almost static, and thus form a grouping with very slowly moving region 4. In this example, regions undergoing similar motions are correctly grouped and region-level labeling is consistent over the sequence. The number of groupings is also updated in agreement with the ground truth. Accuracy of retrieved boundaries and of estimated motion models is satisfactory. The Power Station infra-red image sequence (Fig. 6, of size 500]236 pixels) corresponds to the surveillance domain, on which it is of interest to structure the frames into regions of homogeneous appearance, so as to adapt and facilitate subsequent detection of small moving objects. Since only camera motion is present, motion discontinuities are only due to di!erences in depth relatively to the camera, or to di!erent surface orientations. Texture is taken as the segmentation criterion, all parameters being as for the Patchwork sequence. Results are shown in Fig. 6 for two frames. Motion grouping can be observed between regions that are located at similar depths. Labels and boundaries are consistently maintained through the sequence. In the Renata sequence (Fig. 7, size 360]288 pixels), the woman is going right, while being tracked by the
Fig. 6. `Power stationa sequence (infra-red, by courtesy of SAT): original images with (a) superimposed texture region boundaries, (b) spatial segmentation maps and (c) motion-based region groupings for frames 1 and 15.
736
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
Fig. 7. `Renataa sequence: (a)}(c) spatial regions and (d)}(f) original images with motion-based boundaries superimposed for frames 9, 19 and 29.
camera. The spatial regions are shown in Figs. 7a}c and the motion-based region boundaries in Figs. 7d, e and f for frames 9, 19 and 29, respectively. Again, labels of spatial regions are temporally consistent, and the motion entity (the woman) is correctly extracted, which demonstrates the e$ciency of the updating-tracking method. In the Interview sequence (Fig. 8, of size 337]268 pixels), the woman on the right is getting up, while bringing her left arm upwards. Meanwhile, the camera is tilting upwards, more slowly than the woman's motion, causing a downwards apparent motion for the rest of the scene. At the beginning of the sequence, the woman is getting up quickly, then progressively more slowly. This result illustrates how the motion-based grouping can be improved by creating sub-regions on a motion criterion. Fig. 8a contains the spatial partition. It can be seen that the left arm is included in the same region as an important area of background, and that the hand is attached to the lower part of the body. The assumption of motion
unicity in spatial regions is here clearly at least twice broken. In Fig. 8b, the result of sub-regions detection is shown. Both problems, related to the arm and the hand, are alleviated, and the arm and the hand are correctly grouped and delimited after the motion grouping process (Fig. 8c), as well as most of the rest of the body. Some dark background is nevertheless attached to the moving body. Because of its uniformity, and that the moving occluding contour is the only available information, the resulting estimated motion model is similar to that of the woman. Figs. 8d and e show a comparison of the motion model "elds as estimated on spatial regions, and those as estimated on motion-based regions. The rotational motion of the arm is correctly extracted in the latter, whereas it is formed of almost translational piece-wise sub-"elds in the former, which are not such a good description. In Fig. 9, we show some results of a comparison of motion-based partitions between the approach proposed in this paper and a segmentation method described in
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
737
Fig. 8. `Interviewa sequence: (a) original image (frame 74) with spatial region boundaries, (b) spatial region boundaries with subregion boundaries, (c) motion-based region contours, (d) motion model "elds as estimated on spatial regions and sub-regions and (e) as estimated on motion-based regions groupings.
Ref. [25]. We selected this method because it represents a class of techniques that take a di!erent approach, in the sense that it relies on motion-only. Fig. 9a is to be compared with Fig. 9c and Fig. 9b with Fig. 9d. Our new approach is more accurate in the localization of boundaries (e.g. right arm, head), but is less e$cient in regions where motion estimation or subregion detection cannot be achieved correctly. The e!ect of temporal "ltering can be illustrated on the following example. Temporal evolution of the vertical translation parameter of the region group corresponding to the woman is considered. Comparison of the measured and "ltered parameters (Fig. 10) shows how the motion estimated can be temporally stabilized by "ltering. Temporally smooth apparent motion variations are indeed physically more realistic than the more hectic raw motion estimates. The use of Kalman "ltering is often bene"cial. Though, occasionally, it may not be as e!ective, because the chosen evolution model and its parameters may not be well suited to real parameter evolution. For the Interview sequence, processing each pair of frames takes around 80 s using non-optimized code on an UltraSparc, 4 s of which are devoted to the motion model
estimation and 4 s to the region-level computations (distance calculations and energy minimization), the rest of the time is in fact spent by the updating of the spatial segmentation. In the Mobi sequence (Fig. 11), size 337]268 pixels, MPEG-1 decompressed, color), the train is pushing a rolling spotted ball leftwards, while the calendar is pulled upwards. The camera is panning and tracks the train. The original image for the "rst frame, the color boundaries and motion-based boundaries are respectively shown in Figs. 11a, b and c. It can be seen that the motion-based groups obtained mainly correspond to meaningful motion entities (background, ball, train), and that they are accurately retrieved. The Car1 sequence (Fig. 12) is also a color sequence. The original image for the "rst frame, the color boundaries and motion-based boundaries are respectively shown in Figs. 12a, b and c.
1 We would like to thank INA (Institut National de l'Audiovisuel, DeH partement Innovation, France) for providing this sequence.
738
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
Fig. 9. `Interviewa sequence: comparison of the region-level motion-based partition method with a direct pixel-level motion-based segmentation technique [25]. Motion region boundaries (a) at time t"30 and (b) t"62 for the technique presented in this paper, (c) at time t"30 and (d) t"62 for the pixel-level motion-based technique.
8. Conclusion
Fig. 10. Vertical translation motion parameter (a ) associated to 1 the motion-based region group corresponding to the woman: comparison between measured and "ltered estimates. X-axis: frame number.
A global method for motion-based segmentation and spatial image partition updating and tracking has been presented, through the de"nition of a motion-based graph representation of the spatial partition as the keytool to prediction and tracking. Having estimated a motion model on every spatial region, region grouping is formalized as an energy minimization problem, taking motion, geometric and contextual information into account. Motion-based region boundaries and the number of region groups are jointly determined and updated along the sequence. Promising results have been obtained on image sequences of relatively high complexity, providing good structuration of the content in terms of mobile elements. In comparison with usual region tracking techniques, the original introduction of region-level context allows
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
739
Fig. 11. `Mobia sequence: for the "rst frame of the sequence, the "gure shows the original image (a), the spatial boundaries (b) and motion boundaries (c).
Fig. 12. `Cara sequence: for the "rst frame of the sequence, the "gure shows the original image (a), the spatial boundaries (b) and motion boundaries (c).
motion estimation accuracy and map prediction coherence and quality to be improved. Also, considering that the general task at hand is to track a given spatial partition, an extension of the method to the case in which a spatial region has to be described by several motion models has been proposed. We have proposed in Ref. [26] to cope with occlusions and crossings. Should some regions be occluded, tracking could then rely only on the predicted motion and predicted geometry of these regions, and bene"t further from the simplicity of structure of the motion-based graph G relatively to the spatial-based graph G, as demonm strated in the Interview, Mobi and Renata sequences, for instance. This also provides a high-level representation and interpretation of the dynamic content of the image sequence. In the context of content-based video indexing, this work has contributed to structuring a video in terms of relevant spatio-temporal regions [27]. This leads to video summaries and moving object indexed from their motion. The underlying spatial segmentation directly provides a texture or color information for each region, and the distance used to compare local and region statis-
tics could also be employed to compare queries and extracted regions. Besides, a motion descriptor is attributed to every region, permitting queries combining texture and motion.
References [1] M. Gelgon, P. Bouthemy, A region-level graph labeling approach to motion-based segmentation, Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Puerto-Rico, June 1997, pp. 514}519. [2] M. Gelgon, P. Bouthemy, A region-level motion-based graph representation and labeling for tracking a spatial image partition, Proceedings of IAPR Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), Venice, May 1997, Lecture Notes in Computer Science, vol. 1223, Springer, Berlin. [3] S. Ayer, P. Schroeter, J. BiguK n, Segmentation of moving objects by robust motion parameter estimation over multiple frames, Proceedings of Third European Conference on Computer Vision, Stockholm, May 1994, pp. 316}327.
740
M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740
[4] F. Dufaux, F. Moscheni, A. Lippman, Spatio-temporal segmentation based on motion and static segmentation, Proceedings of Second IEEE International Conference of Image Processing, Washington, October 1995, pp. 306}309. [5] V. Garcia-Garduno, C. Labit, On the tracking of regions over time for very low bit rate image sequence coding, Proceedings of Picture Coding Symposium PCS'94, Sacramento, CA, September 1994, pp. 257}260. [6] L. Wu, J. Benois-Pineau, Ph. Delagnes, D. Barba, Spatiotemporal segmentation of image sequences for objectoriented low bit-rate image coding, Signal Process. Image Commun. 8 (1996) 513}543. [7] J.Y.A Wang, E.H Adelson, Representing moving images with layers, IEEE Trans. Image Process. 3 (5) (1994) 625}638. [8] H Zheng, D. Blostein, Motion-based object segmentation and estimation using the MDL principle, IEEE Trans. Image Process. 4 (9) (1995) 1223}1235. [9] C. Hennebert, V. Rebu!el, P. Bouthemy, A hierarchical approach for scene segmentation based on 2D motion, Proceedings of the 13th International Conference on Pattern Recognition Vienne, August 1996, pp. 218}222. [10] W. Xiong, C. Gra$gne, A hierarchical method for detection of moving objects, Proceedings of First IEEE International Conference of Image Processing, Austin, November 1994, pp. 795}799. [11] J. Wang, Stochastic relaxation on partitions with connected components and its application to image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 20 (6) (1998) 619}636. [12] M.J. Black, Combining intensity and motion for incremental segmentation and tracking over long image sequences, Proceedings of Second European Conference on Computer Vision, Santa Margherita Ligure, Italie, May 1992, pp. 485}493. [13] P. Bouthemy, E. Franc7 ois, Motion segmentation and qualitative dynamic scene analysis from an image sequence, Int. J. Comput. Vision 10 (2) (1993) 157}182. [14] M. Irani, B. Rousso, S. Peleg, Detecting and tracking multiple moving objects using temporal integration, Proceedings of Second European Conference on Computer Vision, Santa Margherita Ligure, Italy, May 1992, 282}287. [15] F. Meyer, P. Bouthemy, Region-based tracking using af"ne motion models in long image sequences, CVGIP: Image Understanding 60 (2) (1994) 119}140.
[16] C. Toklu, A.T. Erdem, M.I Sezan, A.M. Tekalp, Tracking motion and intensity variations using hierarchical 2D mesh modeling for synthetic object trans"guration, Graphical Models Image Process. 58 (6) (1996) 553}573. [17] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (6) (1984) 721}741. [18] C. Kervrann, F. Heitz, A Markov random "eld modelbased approach to unsupervised texture segmentation using local and global spatial statistics, IEEE Trans. Image Process. 4 (6) (1995) 856}862. [19] M.J. Swain, D. Ballard, Color indexing, Int. J. Comput. Vision 7 (1) (1991). [20] P.B. Chou, C.M. Brown, The theory and practise of Bayesian image modelling, Int. J. Comput. Vision 4 (1990) 185}210. [21] F. Heitz, P. PeH rez, P. Bouthemy, Multiscale minimization of global energy functions in some visual recovery problems, CVGIP: Image Understanding 59 (1) (1994) 125}134. [22] J.-M. Odobez, P. Bouthemy, Robust multiresolution estimation of parametric motion models, J. Visual Commun. Image Representation 6 (4) (1995) 348}365. [23] Y.G. Leclerc, Constructing simple stable descriptions for image partitioning, Int. J. Comput. Vision 3 (1989) 73}102. [24] J.-M. Odobez, P. Bouthemy, Separation of moving regions from background in an image sequence acquired with a mobile camera, In: H.H. Li, S. Sun, H. Derin (Eds.), Video Data Compression for Multimedia Computing, Kluwer Academic Publisher, Dordrecht, 1997, pp. 283}311. [25] J.M. Odobez, P. Bouthemy, Direct incremental modelbased image motion segmentation for video analysis, Signal Processing 66 (3) (1998) 143}156. [26] M. Gelgon, P. Bouthemy, J.-P. Le Cadre, Associating and estimating trajectories of multiple moving regions with a probabilistic multi-hypothesis tracking approach, First International Symposium of Physics in Image Processing, Paris, January 1999. [27] M. Gelgon, P. Bouthemy, Determining a structured spatio-temporal representation of video content for e$cient visualisation and indexing, Fifth European Conference on Computer Vision (ECCV'98), Freiburg, Germany, June 1998, pp. 595}609 (II).