THE DISSIMILARITY REPRESENTATION FOR PATTERN RECOGNITION Foundations and Applications
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:
H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)
Vol. 50: Empirical Evaluation Methods in Computer Vision (Eds. H, 1. Christensen and P. J. Phillips) Vol. 51: Automatic Diatom Identification (Eds. H. du Buf and M. M. Bayer) Vol. 52: Advances in Image Processing and Understanding A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof) Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal) Vol. 54: Fundamentals of Robotics - Linking Perception to Action (M. Xie) Vol. 55: Web Document Analysis: Challenges and Opportunities (Eds. A. Antonacopoulos and J. Hu) Vol. 56: Artificial Intelligence Methods in Software Testing (Eds. M. Last, A. Kandel and H. Bunke) Vol. 57: Data Mining in Time Series Databases y (Eds. M. Last, A. Kandel and H. Bunke) Vol. 58: Computational Web Intelligence: Intelligent Technology for Web Applications (Eds. Y, Zhang, A. Kandel, T. Y. Lin and Y. Yao) Vol. 59: Fuzzy Neural Network Theory and Application (P.Liu and H. LI) Vol. 60: Robust Range Image Registration Using Genetic Algorithms and the Surface Interpenetration Measure (L. Silva, 0. R. P. Bellon and K, L. Boyer) Vol. 61 : Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications (0.Maimon and L. Rokach) Vol. 62: Graph-Theoretic Techniques for Web Content Mining (A. Schenker, H. Bunke, M. Last and A. Kandel) Vol. 63: Computational Intelligence in Software Quality Assurance (S. Dick and A. Kandel) Vol. 64: The Dissimilarity Representation for Pattern Recognition: Foundations and Applications (flzbieta Pekalska and Roberi P. W. Duin) Vol. 65: Fighting Terror in Cyberspace ( f d s . M. Last and A. Kandel)
*For the complete list of titles in this series, please write to the Publisher.
THE DISSIMILARITY REPRESENTATION FOR PATTERN RECOGNITION Foundations and Applications
Elibieta Pekalska
Robert P. W. Duin
Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Delft, The Netherlands
vp World Scientific NELVJERSEY
*
LONDON
*
SINGAPORE
-
RElJlNG
-
ShAYGHAI
-
HONG K O N G
-
TAIPEI
-
CHEWKAI
Published b-y
World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA qflce: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
U K @ice: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
THE DISSIMILARITY REPRESENTATION FOR PATTERN RECOGNITION Foundations and Applications Copyright 02005 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or p u t s thereoj may not be reproduced in any,form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval s.ysrem now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-256-530-2
Printed by FuIsland Offset Printing (S) Pte Ltd, Singapore
To t h e ones who ask questions and look for answers
This page intentionally left blank
Preface
Progress has not followed a straight ascending line, h u t a spiral with rhythms of progress and retro,qression, o,f evolution and dissolution. JOHANN WOLFGANG
VON
GOETHE
Pattern recognition is both an art and a science. We are able to see structure and recognize patterns in our daily lives and would like to find out how we do this. We can perceive similarities between objects, between people, between cultures and between events. We are able to observe the world around us, to analyze existing phenomena and to discover new principles behind them by generalizing from a collection of bare facts. We are able to learn new patterns, either by ourselves or with the hclp of a teacher. If we will ever be able to build a machine that does the same, then we will have made a step towards an understanding of how we do it ourselves. The two tasks, the recognition of known patterns and the learning of new ones appear to be very similar, but are actually very different. The first one builds on existing knowledge, while the second one relies on observations and the discovery of underlying principles. These two opposites need to be combined, but will remain isolated if they are studied separately. Knowledge is formulated in rules and facts. Usually, knowledge is incomplete and uncertain, and modeling this uncertainty is a challenging task: who knows how certain his knowledge is, and how can we ever relate the uncertainty of two different experts? If we really want to learn something new from observations, then a t least we should use our existing knowledge for their analysis and interpretation. However, if this leads to destruction of all inherent organization of and relations within objects themselves, as happens when they are represented by isolated features, then all what is lost by (not incorporated in) the representation has to be learned again from the observations. These two closely related topics, learning new principles from observavii
...
Vlll
T h e dissinailarity representation for p a t t e r n recognition
tioris and applying existing knowledge in recognition, appear to be hard to combine if we concentrate on these opposites separately. There is a need for an integrated approach that starts in between. We think that the notion of proximity between objects might be a good candidate. It is an intuitive concept that may be quantified and analyzed by statistical means. It does not a priori tear the object into parts, or more formally, does not require neglecting the inherent object structure. Thereby, it offers experts a possibility to model their knowledge of object descriptions and their relations in a structural way. Proximity may be a natural concept, in which the two opposites, statistical and structural pattern recognition, meet. The statistical approach focuses on measuring characteristic numerical features and representing objects as points, usually in a Euclidean or Hilbert feature space. Objects are different if their point representations lie sufficiently far away from each other in this space, which means that the corresponding Euclidean distance between them is large. The difference between classes of objects is learned by finding a discrimination function in this feature space such that the classes, represented by sets of points, are separated as well as possible. The structural approach is applicable to objects with some identifiable structural organization. Basic descriptors or primitives, encoded as syntactic units, are then used to characterize objects. Classes of objects are either learned by suitable syntactic grammars, or the objects themselves are compared by the cost of some specified match procedure. Such a cost expresses the degree of difference between two objects. One of the basic questions in pattern recognition is how to tell the differencc between objects, phenomena or events. Note that only when the difference has been observed and characterized, similarity starts to play a role. It suggests that dissimilarity is more fundamental than similarity. Therefore, we decided to focus more on the concept of dissimilarity. This book is concerned with dissimilarity representations. These are riurncrical representations in which each value captures the degree of' conimonality between pairs of objects. Since a dissimilarity measure can be defined on arbitrary data given by collections of sensor measurements, shapes, strings, graphs, or vectors, the dissimilarity representation itself becomes very general. The advantages of statistical and structural approaches can now be integrated on the level of representation. As the goal is to develop and study statistical learning methods for dissimilarity representations, they have to be interpreted in suitable mathematical frameworks. These are various spaces in which discrimination
Preface
ix
functions can be defined. Since non-Euclidean dissimilarity measures are used in practical applications, a study beyond the traditional use of Euclidean spaces is necessary. This caused us to search for more general spaces. Our work is founded on both mathematical and experimental research. As a result, a trade-off had to be made to present both theory and practice. We realize that the discussions may be hard to follow due to a variety of' issues presented and the necessary brevity of explanations. Although some foundations are laid, the work is not complete, as it requires a lot of research to develop the ideas further. In many situations, we are only able to point to interesting problems or briefly sketch new ideas. We are optimistic that our use of dissimilarities as a starting point in statistical pattern recognition will pave the road for structural approaches to extend object descriptions with statistical learning. Consequently, observations will enrich the knowledgebased model in a very generic way with confidences and natural pattern classifications, which will yield improved recognition. This book may be useful for all researchers and students in pattern recognition, machine learning and related fields who are interested in the foundations and application of object representations that integrate structural expert knowledge with statistical learning procedures. Some understanding of pattern recognition as well as the familiarity with probability theory, linear algebra arid functional analysis will help one in the journey of finding a good representation. Important facts from algebra; probability theory and statistics are collected in Appendices A D. The reader may refer to [Fukunaga, 1990; Webb, 1995; Duda et al., 20011 for an introduction to statistical pattern recognition, and to [Bunke and Sanfeliu, 1990; Fii, 19821 for an introduction to structural pattern recognition. More theoretical issues are treated in [Cristianini and Shawe-Taylor, 2000; Devroye et al.: 1996; Hastie et al.; 2001; Vapnik, 19981, while a practical engineering approach is presented in [Nadler and Smith, 1993; van der Heiden et d., 20041. Concerning mathematical concepts, some online resources can be used, such as http: //www .probability.net/, http: //mathworld.wolfram. corn/, http : //planetmath. org/ and http : //en. wikipedia. org/. ~
Credits. This research was supported by a grant from the Dutch Organization for Scientific Research (NWO). The Pattern Recognition Group of the Faculty of Applied Sciences at Delft University of Technology was a fruitful arid inspiring research environment. After reorganization within the university, we could continue our work, again supported by NWO, in the
X
T h e dissamilarity r e p r e s e n t a t t o n for patterm. recognttton
Information and Communication Theory group of the Faculty of Electrical Engineering, Mathematics and Computer Science. We thank both groups and their leaders, prof. Ian T. Young and prof. Inald Lagendijk, all group members and especially our direct colleagues (in alphabetic order): Artsiom Harol, Piotr Juszczak, Carmen Lai, Thomas Landgrebe, Pavel Paclik, Dick de Ridder, Marina Skurichina, David Tax, Sergey Verzakov and Alexander Ypma for the open and stimulating atmosphere, which contributed to our scientific development. We gained an understanding of the issues presented here based on discussions, exercises in creative thinking and extensive experiments carried out in both groups. We are grateful for all the support. This work was finalized, while the first author was a visiting research associate in the Artificial Intelligerice group at University of Manchester. She wishes to thank for a friendly welcome. This book is an extended version of the PhD thesis of Elzbieta Pekalska and relies on work published or submitted before. All our co-authors are fully acknowledged. We also thank prof. Anil Jain and Douglas Zongker, prof. Horst Biinke arid Simon Gunter, Pavel Paclik and Thomas Landgrebe, and Volker Rotli for providing some dissimilarity data. All the data sets are described in Appendix E. The experiments were conducted using PRTools [Diiiri et ul., 2004b1, DD-tools [Tax, 20031 and own routines. To the Reader. Our main motivation is to bring the attention to the issue of representation and one of its basic ingredient: dissimilarity. If one wishes to describe classes of objects within our approach this requires a mental shift from logical and quantitative observations of separate features to an intuitive and possibly qualitative perceiving of the similarity betwcen objects. As the similarity judgement is always placed in some context, it can only be expressed after observing the differences. The 1110ment the (dis)similarity judgements are captured in some values, one may process from whole (dissimilarity) to parts (features, details or numerical descriptions). As a result, decision-theoretic methods can be used for learning. The representation used for the recognition of patterns should enable integration of both qualitative and quantitative approaches in a balanced manner. Only then, the process of learning will be enhanced. Let it be so. Dear Reader, be inspired! Wishing You an enjoyable journey,
Elibieta Pqkalska and Robert P.W. Duin, Chester / Delft, June 2005.
Notation and basic terminology
Latin symbols
matrices, vector spaces, sets or random variables scalars, vectors or object identifiers vectors in a finite-dimensional vector space basis vector estimated mean vectors Gram matrix or Gram operator estimated covariance matrix dissimilarity function, dissimilarity measure dissimilarity matrix functions identity matrix or identity operator centering matrix number of clusters space dimensions kernel number of objects or vectors, usually in learning neighborhood sets i-th object in the representation set R probability function projection operator, projection matrix or probability orthogonal matrix representation set R = { P I ,p 2 , . . . ,p,} similarity function, similarity measure similarity matrix or stress function i-th object in the training set T training set T = { t l , t a , . . , tiv} weight vectors
xi
The dissimilarity representation for pattern recognition
xii
Greek symbols scalars or parameters vectors of parameters Kronecker delta function or Dirac delta function evaluation functional dissimilarity matrix used in multidimensional scaling trade-off parameter in mathematical programming field, usually R or C, or a gamma function regularization parameter i-th eigenvalue diagonal matrix of eigenvalues mean, probability measure or a membership function mean vector mappings covariance matrix dissimilarity function set or a closed and bounded subset of Rm
Other symbols
A
c cm 2)
3
G, K: 'FI
z J
Q
w R+
% R"
ST I
u.v,x z
a-algebra set of complex numbers m-dimensional complex vector space domain of a mapping set of features Krein spaces Hilbert space indicator or characteristic function fundamental symmetry operator in Krein spaces set of rational numbers set of real numbers set of real positive numbers
R+ u (0) rn-dimensional real vector space m-dimensional spherical space, m+l 2 2 = {xERm+l: 2 , - 7- } transformation group subsets. subspaces or random variables set of integers
s,m
cz=l
xiii
Notation and basic terminology
Sets and pretopology A, D, . . . , Z
sets
( A i , A z , .. . , A n } cardinality of A generalized interior of A generalized closure of A set union of A and B set intersection of A and B set difference of A and B set symmetric difference, AAB = (A\B) U (B\A) Cartesian product,, A x B = { ( a ,b ) : ~ E AA b E B } power set, a collection of all subsets of X neighborhood system neighborhood basis neighborhood, pretopological or topological space defined by the neighborhoods n/ (pre)topological space defined by a neighborhood basis neighborhood, pretopological or topological space defined by the generalized closure algebraic dual space of a vector space X continuous dual space of a topological space X generalized metric space with a dissimilarity p metric space with a metric distance d &-ballin a generalized metric space (X, p ) , B E ( Z )= { y E X : p ( y 9 z )< F }
collection A of subsets of the set s2 satisfying: (1) Q E A , (2) A E A=+ (R\A)cA, (3) ( V k A k E A A A = U g I A k ) * A E A p : A 4 RO,is a measure on a a-algebra A if p ( @ )= 0 and p is additive, i.e. 1-1 Ak) = C kp ( A k ) for pairwise disjoint sets Ak measurable space; R is a set and A is a a-algebra measure space; /I, is a measure probability space normal dist,ribution with the mean vector p and the covariance matrix C probability of an event A E R
(uk
P(A)
The dissimilarity representation f o r p a t t e r n recognition
conditional probability of A given that B is observed likelihood function is a function of 6’ with a fixed A, such that L(0lA) = cP(AIB = 6’) for c>O expected value (mean) of a random variable X defined over ( 0 ,A, P ) ;E [ X ]= xdP variance of a random variable X , V ( X ) = E[(X-E[X])Z] standard deviation of a random variable X ,
s,
4x1 = d r n
k-th central moment of a random variable X ,
“-wl)kl
Pk(X) = cumulative distribution function probability deriisty function
Mappings and functions &:X+Y
4 is a mapping
(function) from
X
to
Y;
X is the domain of q5 and Y is the codomain of q5 range of 4 : X + Y ,R$= { ~ E YjZrx : y = 4(x)} 407 injection
surjectioii bijection homomorphism eridomorphism isomorphism automorphisrri monomorphism linear form functiorial irri ( 4 ) ker ( 4 ) concave function corivex function logistic function logarithmic function
composition of mappings 4 : X ---t Y such that (21# Z Z ) + (4(21)#4(xz)) holds for all ~ 1 ~ ExX2 ; 724 # Y 4 : X + Y ,X onto Y , such that 724 = Y injection which is also a surjection linear mapping froni one vector space to another linear mapping from a vector space to itself homomorphism which is a bijection endoniorphism which is an isomorphism homomorphism which is an injection homomorphism from a vector space X to the field r linear form iniage of a homomorphism 4 : X + Y ,724 kernel of a homomorphism 4: X + Y ker(q5) = { z E X :q5(x) = 0} f ( CY 2 1 (1- C Y ) Z Z ) 2 ~ f ( 2 2 ) (1- ~ ) f ( 2 2 holds ) for all 2 1 , xz E Df and all a: E [0,1] f is convex iff -f is concave f ( x ) = 1/(1+ e x p ( - c z ) ) f ( x ) = log(z); here log denotes a natural logarithm
+
+
Notation and basic terminology
sigmoid function gamma function
xv
f(x) = 2 / ( l + exp(-z/a)) - 1 st-ie-zds,t > o
r ( t )=
Jr
Vectors and vector spaces
u,v,x,y, z Z = X x Y Z=X$I'
Z=X@Y
V. W , X ,
Y,Z
{xt1I=1 0 1 ei
XT X+ X'Y
xi Y X* X' C(X7
L ( X ,y > CJX, r) LJX, Y )
vector spaces Cartesian product of vector spaces direct sum of vector spaces; each z E Z can be uniquely decomposed into z E X and y E Y such that z = .r y and X n Y = (0) tensor product of vector spaces; for any vector space U and any bilinear map F : X x Y + U , there exists a bilinear map H : Z 4 U such that F ( x ,y) = H ( x @ y) for all z E X and y E Y vectors in finite-dimensional or separable vector spaces {Xi x2 7 . . . , x n 1 column vector of zeros column vector of ones standard basis vector, e , = 1 and e3 = 0 for j # 1: transpose of a real vector conjugate transpose of a complex vector inner product of vectors in R'" inner product of vectors in Cm algebraic dual space of a vector space X cont,inuous dual space of a topological space X space of linear functiorials from X onto the field I?, equivalent to algebraic dual X * space of linear operators from X onto Y space of continuous linear functionals from X onto the field r, equivalent to continuous dual X ' space of continuous linear operators from X onto Y
+
1
Inner product spaces and normed spaces
L:
closed and bounded set, R c R" set of all functions on R set of all continuous functions on R set of function classes, Lebesgue measurable: on f 2 L; = { f E C ( R ) : (J, I f ( z ) l P dz)i/p< oo}, p 2 1
LF
LpM
0
an) C(R) M(R)
=
{ f E M ( R ) : (J, I f ( z ) l P p ( d ~ ) ) ~ < / Pm}, p 2 1
The dzssimilarity representatzon f o r p a t t e r n recognition
inner product norm &norm of X E I W ~ lIxIJP , = (Czl)x,)*)~/P, p 21 !,-norm of f E L F ; llf /I” = (J, If ()”.I dz)l’p, P L 1 space X equipped with the inner product (., .) space X equipped with the norm I / . I / space X equipped with the dissimilarity p orthogonal complement to X Hilbert space reproducing kernel Hilbert space with the kernel K Banach space (Rm,/ I . lip), p 2 1 Banach space (Rm, / I . l i p ) . p >. 1 Indefinite inner product spaces Hilbert spaces (Ic+. (., .)) and (Ic-, -(., .)) Krein space, Ic = Ic+ @ K - and Ic- = Ic; Hilbert space associated with a Krein space K IIcl = Ic+ @ IIc-1, where Ic- = K i and 1K-l = (L, (.. .)) pseudo-Euclidean space with the signature ( p ,q ) inner product in a K r e h space K inner product in a pseudo-Euclidean space E reproducing kernel Krein space with the kernel K fundamental projections identity operator in a Krein space; I = P+ + Pfundamental symmetry in a K r e h space; J = P+ - Pfundamental symmetry in Iw(P14) H-scalar product, [z, y] = (3%. y ) ~ H-norm, IlxlirL = [I.+ Operators in inner product spaces and normed spaces
( u t J ) matrix or an operator A with the elements atJ 2-th row of a matrix A j - t h column of a matrix A a I,A determinant of a matrix A det(A) A*B Hadaniard product, A * B = (aLII btJ) A*” Hadaniard power, A*P = ( a f J ) * .B Hadaniard power, a*B = ( n b 7 2 )where , UER AT transpose of a real matrix A
A
=
02. , A ,
Notation and basic terminology
At AX A hermitian A symmetric
A orthogonal A unitary A cnd
A cpd
A nd A nsd A Pd A psd
xvii
conjugate transpose of a complex niatrix A adjoint A in a Hilbert space; A X = AT or A X = At A = At A = AT A A T = I and ATA = I A A t = I and AtA = I A = At is conditionally negative definite if x t A x 5 0 and x t 1 = O for x # 0 A = At is conditionally positive definite if x t A x 2 0 and x t 1 = o for x # o A = At is negative definite if x t A x < 0 for x # 0 A = At is negative sernidefiriite if x t A x 5 0 for x # 0 A = At is positive definite if x t A x > 0 for x # 0 A = At is positive semidefinite if x t A x 2 0 for x # 0
Operators in indefinite inner product spaces
A*
A A A A A
J-self-adjoint 3-isomctric J-coisometric J-symmetric J-unitary
space of continuous linear functionals from a Kreiri space K into thc field I? space of continuous linear operators from a Krein space Ic into a Krein space G adjoint of an operator A, A t C(K,G ) is such that (A f ,g)B = ( f .A * ~ ) holds K for all f E K a11d 9 E
A = A* A E G ( K , G ) is isometric if A*A = Ic A E C(K, G ) is coisometric if AA* = 1, (4, g)lc = ( f >A S ) K for all f , g IC ( A f , A!dK = ( f , S ) K for all f , g K
Dissimilarities d
D D*2 D ( T ,R ) S d2
D E ,D2 dP DP dmax
dissimilarity measure dissimilarity matrix, D = ( d t J )
D*2 = ( d ; J ) dissimilarity representation similarity matrix, S = (sz7) Euclidean distance Euclidean distance matrix .$-distance &distance matrix &-distance
xviii
T h e dissimilarity representation for p a t t e r n recognition
&distance matrix Hausdorff distance modified-Hausdorff distance square Mahalanobis distance Levenhstein distance, normalized Levenhstein distance Kullback-Leibner divergence J-coefficient information radius divergence Bhat t acharayya distance Chernoff distance Hellinger coefficient Tversky dissimilarity and Tversky similarity cut semimetric based on the set V Graphs and geometry
cut on X G = (V,E) ad,jacent nodcs linear hull cone convex hull hyperprism
hypercylinder
partition of a set X into V arid X\V graph with a set of nodes V and a set of edges E = { ( u , w )u: , u E V } two nodes in a graph joined by an edge huiir(x)={C;=l p t z Z :X , E V A ~ , E I ' } , ~ C R { X : huIlR+( X ) = X} Pt J,: 2%EV, Pt 2 0 A Pt = I} figure generated by a flat region in Rm, moving parallel to itself along a straight line hyperprisni for which the flat region in Rm-' is a hypersphere
c;=1
{c:=l
rn
hypersphere hyperplane
{ x E R ~llx:ll; = R2}with the volume V = 2KkZ mr(F) and the area A = 2Rm-1T' r(?) m-dimensional hyperplane, {xERm+l:
parallelotope
polyhedron polyhedral cone polytope simplex
c : ; '
w,x,= wo}
collection of points in Rm bounded by m pairs of ( m- l)-dimensional hyperplanes (a generalization of a parallelogram) { x EA~ xP 5 b, : AER""" A bER"} {xER": A x 5 0, AEIW"~"} collection of points bounded by m-dimensional hyperplanes (a generalization of a triangle in 2D) polytope; a collection of points in Rm enclosed by (m+1) (m- 1)-dimensional hyperplanes
Abbreviations
’
iff cnd CPd nd nsd Pd Pdf psd k-CDD k-NN
NN k-NNDD AL CCA CH CL CNN
cs
CPS DS GNMC GMDD LLE LogC LP LPDD LSS MAP MDS ML MST NLC NMC NQC NN PCA
if and only if
conditionally negative definite conditionally positive definite negative definite negative semidefinite positive definite probability density function positive semidefinite &Centers Data Description &Nearest Neighbor rule Nearest Neighbors k-Nearest Neighbor Data Description Average Linkage Curvilinear Component Analysis Compactness Hypothesis Complete Linkage Condensed Nearest Neighbor Classical Scaling Classifier Projection Space Dissimilarity Space Generalized Nearest Mean Classifier Generalized Mean Data Description Locally Linear Embedding Logistic regression linear Classifier Linear Programming Linear Programming Dissimilarity data Description Least Square Scaling Maximum A Posteriori Multidimensional Scaling Maximum Likelihood Minimum Spanning Tree Normal density based Linear Classifier Nearest Mean Classifier Normal density based Quadratic Classifier Nearest Neighbor rule Principal Component Analysis
xix
xx
RKHS RKKS RNLC RNQC QC QP SL SOM SRQC
sv
SVM SVDD
so
WNMC
The dissamilarity representation for pattern recognition,
Reproducing Kernel Hilbert Space Reproducing Kernel Krein Space Reqularized Normal density based Linear Classifier Reqularized Normal density based Quadratic Classifier Quadratic Classifier Quadratic Programming Single Linkage Self-organizing Map Strongly Reqularized Quadratic Classifier Support Vector Support Vector Machine Support Vector Data Description Support Object Weighted Nearest Mean Classifier
Contents
Preface
vii
Notation and basic terminology
xi
A bbreuintions
xix
1. Introduction
1
1.1 Recognizing the pattern . . . . . . . . . . . . . . . . . . . . 1.2 Dissimilarities for representation . . . . . . . . . . . . . . . 1.3 Learning from examples . . . . . . . . . . . . . . . . . . . . 1.4 Motivation of the use of dissinlilarity representations . . . . 1.5 Relation to kernels . . . . . . . . . . . . . . . . . . . . . . . 1.6 Outline of the book . . . . . . . . . . . . . . . . . . . . . . . 1.7 In summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 . Spaces
1 2 4 8 13 14 16 23
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 A brief look a t spaces . . . . . . . . . . . . . . . . . . . . . 2.3 Generalized topological spaces . . . . . . . . . . . . . . . . . 2.4 Generalized metric spaces . . . . . . . . . . . . . . . . . . . 2.5 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Normed and inner product spaces . . . . . . . . . . . . . . . 2.6.1 Reproducing kernel Hilbert spaces . . . . . . . . . . 2.7 Indefinite inner product spaces . . . . . . . . . . . . . . . . 2.7.1 Reproducing kernel Krein spaces . . . . . . . . . . . 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 . Characterization of dissimilarities 3.1 Embeddings, tree models and transformations . . . . . . . . xxi
25 28 32 46 56 62 69 71
85 87 89
90
The dissimilarsty representation for p a t t e r n recognition
xxii
3.2 3.3
3.4
3.5
3.6
3.7
3.1.1 Embeddings . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Distorted metric embeddings . . . . . . . . . . . . . Tree models for dissimilarities . . . . . . . . . . . . . . . . . Useful transformations . . . . . . . . . . . . . . . . . . . . . 3.3.1 Transformations in sernimetric spaces . . . . . . . . . 3.3.2 Direct product spaces . . . . . . . . . . . . . . . . . . 3.3.3 Invariance and robustness . . . . . . . . . . . . . . . Properties of dissiniilarity matrices . . . . . . . . . . . . . . 3.4.1 Dissimilarity matriccs . . . . . . . . . . . . . . . . . 3.4.2 Square distances and inner products . . . . . . . . . Linear embeddings of dissimilarities . . . . . . . . . . . . . 3.5.1 Euclidean embedding . . . . . . . . . . . . . . . . . . 3.5.2 Correction of non-Euclidean dissimilarities . . . . . . 3.5.3 Pseudo-Euclidean embedding . . . . . . . . . . . . . 3.5.4 Generalized average variance . . . . . . . . . . . . . . 3.5.5 Projecting new vectors to a n embedded space . . . . 3.5.6 Reduction of dimension . . . . . . . . . . . . . . . . 3.5.7 Reduction of complexity . . . . . . . . . . . . . . . . 3.5.8 A general embedding . . . . . . . . . . . . . . . . . . 3.5.9 Spherical enibeddings . . . . . . . . . . . . . . . . . . Spatial representation of dissimilarities . . . . . . . . . . . . 3.6.1 FastMap . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Multidiniensional scaling . . . . . . . . . . . . . . . . 3.6.3 Reduction of complexity . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 . Learning approaches 4.1 Traditional learning . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Data bias arid model bias . . . . . . . . . . . . . . . 4.1.2 Statistical learning . . . . . . . . . . . . . . . . . . . 4.1.3 Inductive principles . . . . . . . . . . . . . . . . . . . 4.1.3.1 Empirical risk minimization (ERM) . . . . . 4.1.3.2 Principles based on Occam’s razor . . . . . . 4.1.4 Why is the statistical approach not good enough for learning from objects? . . . . . . . . . . . . . . . . . 4.2 The role of dissimilarity representations . . . . . . . . . . . 4.2.1 Learned proximity representations . . . . . . . . . . 4.2.2 Dissimilarity representations: learning . . . . . . . . 4.3 Classification in generalized topological spaces . . . . . . . .
90 95 95 99 99 102 103 105 105 116 118 118 120 122 124 125 127 128 129 130 132 133 135 143 144 147 148 148 151 154 156 160 163 166 171 172 175
Contents
xxiii
4.4 Classification in dissimilarity spaces . . . . . . . . . . . . . 4.4.1 Characterization of dissimilarity spaces . . . . . . . . 4.4.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Classification in pseudo-Euclidean spaces . . . . . . . . . . 4.6 On generalized kernels and dissimilarity spaces . . . . . . . 4.6.1 Connection between dissimilarity spaces and psendoEuclidean spaces . . . . . . . . . . . . . . . . . . . . 4.7 Disciission . . . . . . . . . . . . . . . . . . . . . . . . . . . .
209 211 215
5. Dissimilarity measures 5.1 Measures depending on feature types . . . 5.2 Measures between populations . . . . . . 5.2.1 Normal distributions . . . . . . . . 5.2.2 Divergence measures . . . . . . . . 5.2.3 Discrete probability distributions . 5.3 Dissimilarity measures between sequences 5.4 Information-theorctic measures . . . . . . 5.5 Dissimilarity measures between sets . . . 5.6 Dissimilarity measures in applications . . 5.6.1 Invariance and robustness . . . . . 5.6.2 Example nieasures . . . . . . . . . 5.7 Discussion and conclusions . . . . . . . . .
180 180 185 196 205
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
6 . Visualization
216 228 228 229 233 234 237 238 242 242 242 250 255
6.1 WIultidimensional scaling . . . . . . . . . . . . . . . . . . . . 257 259 6.1.1 First examples . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Linear and nonlinear methods: cxamples . . . . . . . 261 267 6.1.3 Implemeritation . . . . . . . . . . . . . . . . . . . . . 6.2 Other mappings . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.3 Examples: getting insight into the data . . . . . . . . . . . 274 6.4 Tree models . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 7 . Flirther da.ta exploration
7.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Standard approaches . . . . . . . . . . . . . . . . . . 7.1.2 Clustering on dissimilarity representations . . . . . . 7.1.3 Clustering examples for dissimilarity representations
289 290 290 295 303
xxiv
T h e dissimilarity representation for pattern recognition
7.2 Intrinsic dimension . . . . . . . . . . . . . . . . . . . . . . . 309 7.3 Sampling density . . . . . . . . . . . . . . . . . . . . . . . . 319 7.3.1 Proposed criteria . . . . . . . . . . . . . . . . . . . . 320 7.3.2 Experiments with the NIST digits . . . . . . . . . . . 325 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 8 . One-class classifiers 8.1 General issues . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Construction of one-class classifiers . . . . . . . . . . 8.1.2 Onc-class classifiers in feature spaces . . . . . . . . . 8.2 Domain descriptors for dissimilarity representations . . . . 8.2.1 Neighborhood-based OCCs . . . . . . . . . . . . . . 8.2.2 Generalized mean class descriptor . . . . . . . . . . . 8.2.3 Linear programming dissimilarity data description . 8.2.4 More issues on class descriptors . . . . . . . . . . . . 8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Experiment I: Condition monitoring . . . . . . . . . 8.3.2 Experiment 11: Diseased mucosa in the oral cavity . . 8.3.3 Experiment 111: Heart disease data . . . . . . . . . . 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . Classification 9.1 Proof of principle . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 NN rule vs alternative dissimilarity-based classifiers . 9.1.2 Experiment I: square dissimilarity representations . . 9.1.3 Experiment 11: the dissiniilarity space approach . . . 9.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Selection of t.he representation set: the dissimilarity space approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Prototype selection methods . . . . . . . . . . . . . . 9.2.2 Experimental setup . . . . . . . . . . . . . . . . . . . 9.2.3 Results and discussion . . . . . . . . . . . . . . . . . 9.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 9.3 Selection of the representation set: the embedding approach 9.3.1 Prototype selection methods . . . . . . . . . . . . . . 9.3.2 Experiments and results . . . . . . . . . . . . . . . . 9.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . .
333 336 337 341 346 348 350 353 359 366 366 374 377 379 383 384 384 388 389 395 396 398 401 404 416 417 418 421 428
Contents
XXV
9.4 On corrections of dissimilarity measures . . . . . . . . . 9.4.1 Going more Euclidean . . . . . . . . . . . . . . . 9.4.2 Experimental setup . . . . . . . . . . . . . . . . 9.4.3 R.esults and conclusions . . . . . . . . . . . . . . 9.5 A few remarks on a simulated missing value problem . . 9.6 Existence of zero-error dissimilarity-based classifiers . . 9.6.1 Asymptotic separability of classes . . . . . . . . 9.7 Final discussion . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
428 429 430 432 439 443 . 444 451
10. Combining
453
10.1 Combining for one-class classification . . . . . . . . . . . 10.1.1 Combining strategies . . . . . . . . . . . . . . . . 10.1.2 Data and experimental setup . . . . . . . . . . . 10.1.3 Results and discussion . . . . . . . . . . . . . . . 10.1.4 Summary and conclusions . . . . . . . . . . . . . 10.2 Combining for standard two-class c1assificat)ion . . . . . . 10.2.1 Combining strategies . . . . . . . . . . . . . . . . 10.2.2 Experiments on the handwritten digit set . . . . 10.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . 10.3 Classifier projection space . . . . . . . . . . . . . . . . . . 10.3.1 Construction and the use of CPS . . . . . . . . . 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. Representation review and recommendations
11.1 Representation review . . . . . . . . . . . 11.1.1 Three generalization ways . . . 11.1.2 Representation formation . . . . 11.1.3 Generalization capabilities . . . 11.2 Practical considerations . . . . . . . . . 11.2.1 Clustering . . . . . . . . . . . . . 11.2.2 One-class classification . . . . . 11.2.3 Classification . . . . . . . . . . .
12. Conclusions and open problems
455 456 459 462 465 466 466 468 470 473 474 475 483 485
. . . . . . . . . 485 . . . . . . . . . . 486 . . . . . . . . . . 489 . . . . . . . . . . 492 . . . . . . . . . . 493 . . . . . . . . . 495 . . . . . . . . . . 496 . . . . . . . . . 497 503
12.1 Summary and contributions . . . . . . . . . . . . . . . . . 505 12.2 Extensions of dissimilarity representations . . . . . . . . . 508 12.3 Open questions . . . . . . . . . . . . . . . . . . . . . . . . 510
The disszrnilarity representation f o r pattern. recognition
xxvi
Appendix A
515
On convex arid concave functions
Appendix B Linear algebra in vector spaces
519
B . l Some facts on matrices in a Euclidean space . . . . . . . . . 519 B.2 Some facts on matrices in a pseudo-Euclidean space . . . . 523
Appendix C
Measure and probability
527
Appendix D
Statistical sidelines
533
D.l D.2 D.3 D.4
Likelihood arid parameter estimation . . . . . . . . . . . . Expectation-maximization (EM) algorithm . . . . . . . . Model selection . . . . . . . . . . . . . . . . . . . . . . . . . PCA and probabilistic models . . . . . . . . . . . . . . . . D.4.1 Gaussian model . . . . . . . . . . . . . . . . . . . . . D.4.2 A Gaussian mixture model . . . . . . . . . . . . . . D.4.3 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . D.4.4 Probabilistic PCA . . . . . . . . . . . . . . . . . . D.4.5 A mixture of probabilistic PCA . . . . . . . . . . .
Appendix E
Data sets
E . l Artificial data sets . . . . . . . . . . . . . . . . . . . . . . . E.2 Real-world data sets . . . . . . . . . . . . . . . . . . . . . .
. 533 . 535 .
. . .
536 538 538 539 541 542 543 545 545 549
Bihliqqraphy
561
Index
599
Chapter 1
Introduction
T h u s all h u m a n knowledge begin,s with, intuitions, then goes t o concepts, and is completed in ideas. IMVIMANUEL KANT
1.1
Recognizing the pattern
We recognize many patterns’ while observing the world. Even in a country never visited before, we recognize buildings, streets, trees, flowers or animals. There are pattern characteristics, learned before, that can be applied in a new environment. Sometimes, we encounter a place with objects that are alien t o us, e.g. a garden with an unknown flower species or a market place with strange types of fish. How do we learn these patterns so that this place will look more familiar on our next visit‘? If we take the time, we are able to learn some patterns by ourselves. If somebody shows us around, points out and explains what is what, we may learn faster and group the observations according to the underlying concepts. What is the first step in this categorization process? Which principle is used in the observations to constitut,e the first grouping? Are these descriptive features like color, shape or weight? Or is it our basic perception that some objects are somehow different and others are similar? The ability to observe the differences and the similarities between objects seems to be very basic. Discriminating features can be found once we are familiar with Similarities. This book is written from the perspective that the most primary observation we can make when studying a group of objects or phenomena is that some are dissimilar and others are similar. From this starting point, we aim to define a theory for learning and recognizing patterns by automatic means: sensors and computers that try to imitate the human ability We will use the word ‘pattern’ exclusively to refer to quantitativelqualitative characteristics between objects. In the literature, however, ‘pattern’ is also used to refer to a single object for which such characteristics are studied. We will avoid this usage here.
The disszmzlarity representation for pattern recognition
2
A
B
C
D
E Figure 1.1
F
G
H
I
J
K
Fish contours.
of pattern recognition. We will develop a framework in which the initial
representation of objects is based on dissimilarities, assuming that a human expert can make explicit how to measure them from sensor data. We will develop techniques to generalize from dissimilarity-based representations of sets of objects to the concepts of the groups or classes that can be distinguished. This is in contrast to the traditional paradigm in automatic pattern recognition that starts from a set of numerical features. As stated above, features are defined after dissimilarities have been observed. In a featurebased approach, more hunian expertise may be included. Consequently, if this is done properly. a feature description should be preferred. If, however, this expertise is not available, then dissimilarities have to be preferred over arbitrarily selected features. There are already many applied studies in this area based on dissimilarities. They lack a foundation and, consequently, consistent ways for building a generalization. This book will contribute to these two. In the first part, Chapters 2 to 5, concepts and theory are developed for dissimilarity-based pattern recognition. In the second part, Chapters 6 to 10. they are used for analyzing dissimilarity data and for finding and classifying patterns. In this chapter, we will first introduce our concepts in an intuitive way.
1.2
Dissimilarities for representation
Human perception and inference skills allow us to recognize the common characteristics of a collection of objects. It is, however, difficult to formalize such observations. Imagine, for instance, the set of fish shape contours [Fish contours, site] as presented in Fig. 1.1. Is it possible to define a simple rule that divides them into two or three groups? If we look at the contours. wc firid that some of the fish are rather long without characteristic fins (shape C, H and I), whereas others have distinctive tails as well as fins, say a group
3
htroduction
A (a) Fish shapes
H
(b) Area difference
9
(c) By covers
(d) Between skeletons
-i B/
Figure 1.2 Various dissimilarity measures can be constructed for matching two fish shapes. (b) Area difference: the area of non-overlapping parts is computed. To avoid scale dependency, the measured difference can be expressed relative to the sum of the areas of the shapes. (c) Measure by covers: one shape is covered by identical balls (such that the ball centers belong to it), taking care that the other shape is covered as well. The shapes are exchanged and the radius of the minimal ball is the sought distance. 111 both cases above, B is covered such that either A or H are also covered. (d) Mcasure between skeletons: two shape skeletons are compared by summing up the differences between corresponding parts, weighting missing correspondences more heavily.
of fin-type fish. Judging shapes F and K in the context of all fish shapes presented here, they could be found similar to other fin-type fish: A, B, D, E, G and J. By visual inspection, they do riot really appear to be alike. as they seem to be thinner and somewhat larger. If the exairiples of C, H and I had been absent, the differences between F and K and other fin-type fish would havc been more pronounced. Furthermore. shape A could be considered similar to F and K , but also different due to the position arid shape of its tail and fins. This simple example shows that without any extra knowledge or a clear context, one cannot claim that the identification of two groups is better than the identification of three groups. This decision relies on a free interpretation of what makes objects similar to be considered as a group. For the purpose of automatic grouping or identification, it is difficult to determine proper features, i.e. mathematically encoded particular properties of the shapes that would precisely discriminate between different fish
4
The dissimilarity representation for pattern recognition
and at the same time emphasize the similarity between resembling examples. An alternative is to compare the shapes by matching them as well as possible and determining the remaining differences. Such a match is found with respect to a specified measure of dissimilarity. This measure should take on small values for objects that are alike and large values for distinct objects. There are many ways of comparing two objects, and hence there are many dissimilarity measures. In general, the suitability of a measure depends on the problem at hand and should rely on additional knowledge one has about this particular problem. Example measures are presented in Fig. 1.2, where two fish shapes are compared. Here, the dissimilarity between two similar fish, A and B, is much smaller than between two different fish, B and H. Which to choose depends on expert knowledge or problem characteristics. If there is no clear preference for one measure over the other, a number of measures can be studied and combined. This may be beneficial, especially when different measures focus on different aspects of patterns.
1.3
Learning from examples
The question how to extract essential knowledge and represent it in a formal way such that a machine can ‘learn’ a concept of a class, identify objects or discriminate between them, has intrigued and provoked many researchers. The growing interest inherently led to the establishment of the areas of pattern recognition, machine learning and artificial intelligence. Researchers in these disciplines try to find ways to mimic the human capacity of using knowledge in an intelligent way. In particular, they try t o provide mathematical foundations and develop models and methods that automate the process of recognition by learning from a set of examples. This attempt is irispircd by the human ability to recognize for example what a tree is, given just a few examples of trees. The idea is that a few examples of objects (and possible relations between them) might be sufficient for extracting suitable knowledge to characterize their class. After years of research, some practical problems can now be successfully treated in industrial processing tasks such as automatic recognition of damaged products on a conveyor belt, or to speed up data-handling procedures. or the automatic person identification by fingerprints. The algorithms developed so far are very task specific and, in general, they are
Introduction
5
still far from reaching the human recognition performance. Although the models designed are becoming more and more complex, it seems that to take them a step further, one will need to analyze t,heir basic underlying assumptions. An understanding of the recognition process is needed; not only the learning approaches (inductive or deductive principles) must be understood, but mainly the basic notions of class, measurenient, process and the representation of objects derived from these. The formalized representation of objects (usually in mathematical terms) and the definition of classes determine how the act of learning should be modeled. While many researchers are concerned wit,h various algorithmic procedures, we would like to focus on the issue of representation. This work is devoted t o part,ic.iilar representations, na.mely dissimilarity representations. Below and in the subsequent sections, we will give some insight into the nature of basic problenis in pattern recognition and machine learning and motivate the use of dissimilarity representations. While dealing with entities to be compared, we will always refer to them as to object,s, elements or instances, regardless of whether they are real or abstract. For instance, images, textures and shapes are called objects in the same way as apples and chairs. An appropriate representation of objects is based on data. These are usually obtained by a measurement device and encoded in a numerical way or given by a set of observations or dependencies; presented in a structural form, e.g. a relational graph. It is assumed that objects can, in general, be grouped together. Our aim then is to identify a number of groups (clusters) whose existence supports an understanding of not only the data, but also the problem itself. Such a process is often used to order information and to find suitable or efficient descriptions of the data. The challenge of automatic object recognition is to develop computer methods which learn to identify whether an object belongs to a specific class or learn to distinguish between a number of classes. Typically, the system is first presented with a set of labeled objects, the training set,, in some convenient representation. Learning consists of finding the class descriptions such that t,he system can correct,ly classify novel examples. In practice, the entire system is trained such that the given examples are (mostly) assigned to the correct class. The underlying assumption is that the training examples are representative and sufficient for the problem at hand. This implies that the system can extrapolate well to previously unseen examples, that is, it can generalize well. There are two principal directions in pattern recognition, statistical arid
6
T h e dissimilarity representation f o r pattern recognition
Table 1.1 Basic differences between statistical and structural Pattern Recognition [Nadler and Smith, 19931. Distances are a common factor used for discrimination in both approaches. __ Properties
Statistical
Structural
Foundation
Well-developed mathematical theory of vector spaces Quantitative Numerical features: vectors of a fixed length Element position in a vector Easily encoded Vector-based methods Metric, often Euclidean Relies on distances or inner products in a vcctor space Due to improper features and probabilistic models
Intuitively appealing: human cognition or perception Qualitative: structural/syntactic Morphological primitives of a variable size Encoding process of primitives Needs regular structures Graphs, decisions trees, grammars Defined in a matching process Grammars recognize valid objects; distances often used Due to improper primitives leading to ambiguity in the description
Approach Descriptors Syntax Noise Learning Dissimilarity Discrimination Class overlap
structural (or syntactic) pattern recognition [Jain et al., 2000; Nadler and Smith, 1993; Bunke and Sanfeliu, 19901. The basic differences are summarizcd in Table 1.1. Both approaches use features to describe objects, but these features are defined difEerently. In general, features are functions of (possibly preprocessed) measurements performed on objects, e.g. particular groups of bits in a binary image summarizing it in a discriminative way. The statistical, decision-theoretical approach is (usually) metric and quantitative, while the structural approach is qualitative [Bunke and Sanfeliu, 1990; Nadler and Smith, 19931. This means that in the statistical approach, features are encoded as purely numerical variables. Together, they constitute a feature vector space, usually Euclidean, in which each object is represented as a point2 of feature values. Learning is then inherently restricted to the rria,thexnatical methods that one can apply in a vector space, equipped with additional algebraic structures of an inner product, norm and the distance. In contrast, the structural approach tries to describe the structure of objects that intuitively reflects the human perception [Edelman et al., 1998; Edelnmn, 19991. The features become primitives (subpatterns), fundamental structural elements, like strokes, corners or other morphological elements. 21n this book, the words ‘points’ and ’vectors’ are used interchangeably. In the rigorous mathematical sense, points and vectors are not the same, as points are defined by fixed sets of coordinates in a vector space, while vectors are defined by differences between points. In statistical pattern recognition, objects are represented as points in a vector space, but for the sake of convenience, they are also treated as vectors, as only then they define the operations of vector addition and multiplication by a scalar.
Introduction
7
Characterization I Decision
I
t Generalization/ inference
I
Representation
Segmentation
1
..
..__
i
Measurements
...*
~ - -
I
t
Objects
Figure 1.3 Components of a general pattern recognition system. A representation is either a numerical description of objects and/or their relations (statistical pattern recognition) or their syntactical encoding by set of primitives together with a set of operations on objects (structural pattern recognition). Adaptation relies on a suitable change (simplification or enrichment) of a representation, e.g. by a reduction of the number of features, relations or primitives describing objects, or some nonlinear transformation of the features, to enhance the class or cluster descriptions. Generalization is a process of determining a statistical function which finds clusters, builds a class descriptor or constructs a classifier (decision function). Inference describes the process of a syntax analysis, resulting in a (stochastic) grammar. Characterization reflects the final decision (class label) or the data description (determined clusters). Arrows illustrate that a building of the complete system may not be sequential.
Next, the primitives are encoded as syntactic units from which objects are constructed. As a result, objects are represented by a set of primitives with specified syntactic opcrations. For instance, if the operation of concatenation is used, objects are described by strings of (concatenated) primitives. The strength of the statistical approach relies on well-developed concepts and learning techniques, while in the structural approach, it is much casier to encode existing knowledge on the objects. A general description of a pattern recognition system is illiistratcd in Fig. 1.3; see also [Duin et al., 20021 for a more elaborate discussion and [Nadler and Smith, 19931 for an engineering approach. The description starts from a set of measurements performed on a set of objects. These measurements may be subjected to various operations in order to extract the
8
T h e dissimdarity representation f o r p a t t e r n recognition
essential information ( e g . to segment an object from the image background arid identify a number of characteristic subpatterns), leading to some nunierical or structural representation. Such a representation has evolved from an initial description, derived from the original measurements. Usually, it is not directly the most appropriate one for realizing the task, such as identification or classification. It may be adapted by suitable transformations, e.g. a (nonlinear) rescaling of numerical features or an extension and redefinition of primitives. Then, in the generalization/inference stage, a classifier/identifier is trained, or a grammar3 is determined. These processes should include a careful treatment of unbalanced classes, non-representative data, handling of missing values, a rejection option, combining of inforrnatiori and combiriing of classifiers and a final evaluation. In the last stage, a class is assigned or the data. are characterized ( e g . in terms of clusters and their relations). The design of a complete pattern recognition system may require repetition of some stages to find a satisfactory trade-off between the final recognition accuracy or data description and the computational and storage resources required. Although this research is grounded in statistical pattern recognition, we recognize the necessity of combining numerical and structural information. Dissimilarity measures as the common factor used for discrimination, Table 1.1, seems to be the natural bridge between these two types of information. The integration is realized by a representation. A general discussion on the issue of representation can be found in [Duin et al., 2004al.
1.4
Motivation of the use of dissimilarity representations
The notion of similarity plays a pivotal role in class formation, since it might he seen as a natural link between observations on objects on the one hand arid a judgment on their shared properties on the other. In essence, similar objects can be grouped together to form a class, and consequently u class is a set of sim,ilar objects. However, there is no such thing as a general object similarity that car1 be universally measured or applied. A comparison of two objects is always with respect to a frame of reference, i.e. a particular point of view, a context, basic characteristics, a type of domain, or attributes considered (see also Fig. 1.1). This means that background information, or 3Primitives are interpreted as syntactic units or symbols. A grammar is a set of rules of syntax that enables the generation of sentences (structures) from the given symbols (units).
Introduction
(
9
Measurements or intermediate representation
1
1 Feature-based representation
1
Define a set of features
Dissimilarity-based representation
1 Define a dissimilarity measure \I
Represent objects as points in a feature vector space
i
Impose the geometry, e.g. of the Euclidean distance between the points
1 Interpret the dissimilarities in a suitable space to reflect the distance geometry
Figure 1.4 The difference with respect to the geometry between the traditional featurebased (absolute) representations and dissimilarity-based (relative) representations.
the existence of other classes; will influence the way objects are compared. For instance, two brothers may not appear to resemble each other. However, they may appear much more alike if compared in the presence of their parents. The degree of similarity between two objects should be determined relative to a given context or a procedure. Any measurement of similarity of objects will be based 011 certain assumptions concerning the properties of their relation. Such assumptions come from some model. Similarity can be modeled by a measure of sirriilarity or dissimilarity. These are intimately connected; a small dissimilarity and a large similarity both imply a close resemblance of objects. There exist ways of changing a similarity value into a dissimilarity value and vice versa, but the interpretation of the measure might be affected. In this work. we mostly concentrate on dissimilarities, which by their construction, focus on the class and object differences. The choice for dissimilarities is supported by the fact that they can be interpreted as distances in suitable vector spaces, and in many cases, they may be more intuitively appealing. In statistical pattern recognition, objects are usually encoded by feature values. A feature is a corijunction of measured values for a particular attribute. For instance, if weight is an attribute for the class of apples, then a feature consists of the measured weights for a number of apples. For a set T of N objects, a feature-based representation relying on a set .F of m features is then encoded as an N x m matrix A(T, F), where each
10
The dissimilarity representation f o r pattern recognition
row is a vector describing t8hefeature values for a particular object. Feat,ures 3 are usually interpreted in a Euclidean vector space equipped with the Euclidean metric. This is motivated by the algebraic structure (defined by operations on vectors) being consistent with the geometric (topological) structure defined by the Euclidean distance (which is then defined by the norm). Then all traditional mathematical concepts and methods, such as continuity, convergence or differentiation are applicable. The continuity of algebraic operations ensures that the local geometry (defined by the Euclidean distance) is preserved throughout the space [Munkres, 2000; Kothe, 19691. Discrimination techniques operating in vector spaces make use of their homogeneity and other properties. Consequently, such spaces require that up to scaling all the features are treated in the same way. Moreover, there is no possibility to relate the learning to the geometry defined between the raw representations of the training examples. The geometry is simply imposed beforehand by the nature of the Euclidean distance between (reduced) descriptions of objects, i.e. between vectors in a Euclidean space; see also Fig. 1.4. The existence of a well-established theory for Euclidean metric spaces made researchers place the learning paradigm in that context. However, the severe restrictions of such spaces simply do not allow discovery of structures richer than affine subspaces. From this point of view, the act of learniiig is very limited. We argiic here that the notion of proximity (similarity or dissimilarity) is more fundamental than that of a feature or a class. According to an intuitive definition of a class as a set of similar objects, proximity plays a crucial role for its constitution, and not features, which may (or may not) come later. From this point of view, features might be a superfluous step in the description of a class. Surely, proximity can be specified by features, such as their weighted linear combination, but the features should be meaningful with respect to the proximity. In other words, the chosen combination of features should reflect the (natural) proximity between the objects. On the other hand, proximity can be directly derived from raw or pre-processed measurements like images or spectra. Moreover, in the case of syrnbolic objects, graphs or grammars, the determination of numerical features might be an intractable problem, while proximity may be easier to define. This emphasizes that a class of objects is represented by individual examples which are judged to be similar according to a specified measure. A dissimilarity representation of objects is then based on pairwise comparisons and is expressed e.g. as an N x N dissimilarity matrix D ( T , T ) .Each entry of D is a dissimilarity value computed between pairs of objects; see
Introduction
11
OBJECTS
\ I / ensor measurements
f
e
y
ABSOLUTE REPRESENTATION
A(T,F)
feature 2
ilarity measure
RELATIVE REPRESENTATION
feature 3
feature 1
Figure 1.5 Feature-based (absolute) representation vs. dissimilarity-based (relative) representation. In the former description, objects are represented as points in a feature vector space, while in the latter description, objects are represented by a set of dissimilarity values.
also Fig. 1.5. Hence, each object z is represented by a vector of proximities
D ( z ,T ) to the objects of T (precise definitions will be given in Chapter 4). For a number of years, Goldfarb and colleagues have been trying to establish a new mathematical formalism allowing one to describe objects from a metaphysical point of view, that is, to learn their structure and characteristics from the process of their construction. This aims at unifying the geometric learning models (statistical approach with the geometry imposed by a feature space) and symbolic ones (structural approach) using dissimilarity as a natural bridge. A dissimilarity measure is determined in a process of inductive learning realized by so-called evolving transformation systems [Goldfarb. 1990; Goldfarb and Deshpande. 1997; Goldfarb and Golubitsky, 20011. Loosely speaking, such a system is composed of a set of primitive structures, basic operations that transform one object
12
The dissimilarity representation f o r p a t t e r n recognition
into another (or which generate a particular object) and some composition rules which permit the construction of new operations from existing ones [Goldfarb et al., 1995, 1992, 2004; Goldfarb and Deshpande, 1997; Goldfarb and Golubitsky, 20011. This is the symbolic component of the integrated model. The geometric component is defined by means of a dissimilarity. Since there is a cost associated with each operation, the dissimilarity is determined by the minimal sum of the costs of operations transforming one object int,o another (or generating this particular object). In this sense, the operations play the role of features, and the dissimilarity - dynamically learned in the training process - combines the objects into a class. In this book, t,he study of dissimilarity representations has mainly an epistemological character. It focuses on how we decide (how we make a model to decide) that an entity belongs to a particular class. Since such a decision builds on the dissimilarities, we come closer to the nature of what a class is, as we believe that it is proximity which defines the class. This approach is much more flexible than the one based on features, since now, the geometry and the structure of a class are defined by the dissimilarity measure, which can reflect the structure of the objects in some space. Note that the reverse holds in a feature space, that is, a feature space determines the (Euclidean) distance measure, and hence the geometry; see also Fig. 1.4. Although, dissimilarity information is further treated in a numerical way, the development of statistical methods dealing with general dissimilarities is the first necessary step towards a unified learning model, as the dissimilarity measure may be developed in a structural approach. Notwithstanding the fact that integrated model may be constructed for objects containing an inherent, identifiable structure or organization, like apples, shapes, spectra, text excerpts etc., current research is far from being generally applicable [Korkin and Goldfarb, 2002; Goldfarb and Golubitsky, 2001; Goldfarb et al., 2000b, 20041. On the other hand, there are a number of instances or events which are mainly characterized by discontinuous numerical or categorical information, e.g. gender, or number of children, etc. Therefore, wc may have to consider heterogeneous types of information to support decisions in medicine, finance, etc. In such cases, the symbolic learning model cannot be directly utilized, but a dissimilarity can be defined. This emphasizes the importance of techniques operating on general dissimilarities. The study of proximity representations is the necessary starting point from which to depart on a journey into alternative inductive learning methodologies. These will learn the proximity measure, and hence a class description, from examples.
Introduction
13
1.5 Relation to kernels Kernel methods have become popular in statistical learning [Cristianini arid Shawe-Taylor, 2000; Scholkopf arid Smola, 20021. Kernels are (conditionally) positive definite (cpd) functions of two variables, which car1 be thought to encode similarities between pairs of objects. They are originally defined in vector spaces, e.g. based on a feature representation of objects, and interpreted as generalized inner products in a reproducing kernel Hilbert space (RKHS). They offer a way to construct non-linear decision functions. In 1995, Vapnik proposed an elegant formulation of the largest margin classifier [Vapnik, 19981. This support vector machine (SVM) is based on the reproducing property of kernels. Since then, many variants of the SVM have been applied to a wide range of learning problems. Before the start of our research project [Duin et al., 1997, 1998, 19991 it was already recognized that the class of cpd functions is restricted. It does riot accommodate a number of useful proximity measures already developed in pattern recognition and computer vision. Many existing similarity measures are not positive definite and many existing dissimilarity measures are not Euclidean4 or even not metric. Examples are pairwise structural alignments of proteins, variants of the Hausdorff distance, and normalized edit-distances; see Chapter 5. The major limitation of using such kernels is that the original formulation of the SVM relies on a quadratic optimization. This problem is guaranteed to be convex for cpd kernels, and therefore uniquely solvable by standard algorithms. Kernel matrices disobeying these requirements are usually somehow regularized, e.g. by adding a suitable constant to their diagonal. Whether this is a beneficial strategy is an open question. Although our research was inspired by the concept of kernel, the line we followed heavily deviates from the usage of kernels in machine learning [Shawe-Taylor and Cristianini, 20041. This is caused by the patternrecognition background of the problems we aim to solve. Our starting point is a given set of dissimilarities, observed or determined during thc development of a pattern recognition system. It is defined by a human expert and his/her insight into the problem. This set is, thereby, an alternative to the definition of features (which also have to originate from such expertise). A given Euclidean distance matrix may be transformed into a kernel and interpreted as a generalized Gram matrix in a proper Hilbert space. 4The dissimilarity measure being Euclidean is inherently related t o t,he corresponding kernel being positive definite; this is explained in Chapter 3.
The dissimilarity representation for p a t t e r n recognition
14
[
(
Characterizationof dissimilaritymatrices
Chapter3
Learning aspects
I
I
Reprera;;;;,review
j [T)
j
ConclusionsI open problems Chapter 12
Figure 1.6 Conceptual outline of the book
However, many general dissimilarity measures used in pattern recognition give rise to indefinite kernels, which have only recently become of interest [Haasdonk, 2005; Laub and Miiller, 2004; Ong et al., 20041, although we had already identified their importance before [Pckalska et al., 2002bl. How to handle these is an important issue in this book.
1.6
Outline of the book
Dissimilarities play a key role in the quest for an integrated statisticalstructural learning model, since they are a natural bridge between these two approaches, as explained in the previous sections. This is supported by the theory that (dis)similarity can be considered as a link between perception and higher-level knowledge, a crucial factor in the process of human recognition and categorization [Goldstone, 1999; Edelman et ul., 1998; Wharton et al., 19921. Throughout this book, the investigations are dedicated to dissimilarity (or similarity) representations. The goal is to study both methodology and approaches to learning from such representations. An outline of the book is presented in Fig. 1.6.
Introduction
15
The concept of a vector space is fundamental to dissimilarity representations. The dissimilarity value captures the notion of closeness between tjwo objects, which can be interpreted as a distance in a suitable space: or which can be used to build other spaces. Chapter 2 focuses on mathematical characteristics of various spaces, among others (generalized) metric spaces, norm spaces and inner product spaces. These spaces will later become the context in which the dissimilarities are interpreted arid learning algoritlinis are designed. Familiarity with such spaces, their properties and their interrelations is needed for further understanding of learning processes. Chapter 3 discusses fundamental issues of dissimilarity measures and generalized metric spaces. Since a metric distance, particularly the Euclidean distance, is mainly used in statistical learning, its special role is explained and related theorems are given. The properties of dissimilarity matrices are studied, together with some embeddings, i.e. spatial represeritations (vectors in a vector space found such that the dissiniilarities are preserved) of symmetric dissimilarity matrices. This supports the analysis of pairwise dissimilarity data D ( T ,T ) based on a set of examples T . Chapter 4 starts with a brief introduction into traditional statistical learning, followed by a more detailed description of dissimilarity reprcsentations. Three different approaches to building classifiers for such representations are considered. The first one uses dissimilarity values directly by interpreting them as neighborhood relations. The second one interprets them in a space where each dimension is a dissimilarity to a particular object. Finally, the third approach relies on a distance-preserving embedding to a vector space, in which classifiers are built. In Chapter 5, various types of similarity and dissimilarit,y measures are described, together with their basic properties. The chapter ends with a brief overview of dissimilarity measures arising from various applications. Chapters 6 and 7 start from fundamental questions related to exploratory data analysis on dissimilarity data. Data visualization is one of the most basic ways to get insight into relations between data instances. This is discussed in Chapter 6. Other issues related to data exploration and understanding are presented in Chapter 7. They focus on methods of unsupervised learning by reflecting upon the intrinsic dimension of the dissimilarity data, the complexity of the description and data striicture in terms of clusters. A possible approach to outlier detection is analyzed in Chapter 8 by coiistructing one-class classifiers. These methods are designed to solve problems, where mainly one of the classes, called the target class, is present.
T h e dissimilarity representation for p a t t e r n recognition
16
Objects of the other, outlier, class occur rarely, cannot be well sampled, e.g. due the measurement costs or are untrustworthy. We introduce the probleni arid study a few one-class classifier methods built on dissimilarity represent at ions. Chapter 9 deals with classification. It practically examines three approaches to learning. For recognition, a so-called representation set is used instcad of a complete training set. This chapter explains how to select such a set out of a training set arid discusses the advantages and drawbacks of t hc studied techniques. Chapter 10 investigates combining approaches. These either combine different dissiniilarity representations or different types of classifiers. Additionally. it briefly discusses issues concerning nieta-learning, i.e. conceptual dissimilarity representations resulting from combining classifiers, one-class classifiers or weak models. in general. Chapter 11 discusses the issue of representation in pattern recognition and provides practical recommendations for the use of dissimilarity representations. Overall conclusions are given in Chapter 12. Appendices A-D provide additional information on algebra, probability and statistics. Appendix E describes the data sets used in the experiments.
1.7
In summary
Dissimilarity representations are advantageous for identification and recognition, especially in the following cases: 0 0
0 0 0 0
0 0 0
sensory data, such as spectra, digital or hyperspectral images data represented by histograms, contours or shapes, phenomena that can be described by probability density functions, binary files, text-related problems, when objects are encoded in a structural way by trees, graphs or strings. when objects are represented as vectors in a high-dimensional space, when the features describing objects are of mixed types, as a way of constructing nonlinear classifiers in given vector spaces.
Mathematical foundations for dissimilarity representations rely on:
(1) topology and general topology [Sierpiliski, 1952; Cech, 1966; Kothe, 1969; Willard, 1970; Munkres, 2000; Stadler et al., 2001; Stadler and Stadler, 2001b],
Introduction
17
(2) linear algebra [Greub, 1975; Bialynicki-Birula, 1976; Noble and Daniel, 1988; Leon, 1998; Lang, 20041, ( 3 ) operator theory [Dunford and Schwarz, 1958; Sadovnichij, 19911. (4) functional analysis [Kreyszig, 1978; Kurcyusz, 1982; Conway, 1990; Rudin, 1986. 19911, ( 5 ) indefinite inner product spaces [BognBr, 1974; Alpay et al., 1997; Iohvidov et al., 1982, Dritschel and Rovnyak, 1996; Constantinescu and Gheondea, 20011, (6) probability theory [Feller, 1968, 1971; Billingsley, 1995; Chung, 20011, (7) statistical pattern recognition [Devijver and Kittler, 1982; Fukunaga, 1990; Webb, 1995; Devroye et al., 1996; Duda et nl., 20011, (8) statistical learning [Vapnik, 1998; Cherkassky and Mulier, 1998; Hastie et al., 20011. (9) the work of Scholkopf and colleagues [Scbolkopf. 1997, 2000; Scholkopf et al., 1999b, 1997a, 1999a, 2000b], (10) the results of Goldfarb [Goldfarb, 1984, 1985, 19921, and inspiration from many other researchers. We will present a systematic approach to study dissimilarity representations and discuss some novel procedures to learning. These are inevitably compared to the nearest neighbor rule (NN) [Cover and Hart, 19671, the method traditionally applied in this context. Although many researchers have thoroughly studied the NN method and its variants together with design of perfect dissimilarity measures (appropriate to the character of the NN rule), to our knowledge little attention was dedicated to alternative approaches. An exception are the support vector machines. These rely on a relatively narrow class of (conditionally) positive definite kernels, which, in turn, are special cases of similarity representations [Duin et al., 1997, 19981. Only recently the interest has arisen in indefinite kernels [Haasdonk, 2005; Laub and Muller, 2004; Ong et al., 20041. The methods presented here are applicable to general (dis)similarity representations, and this is where our main contribution lies. A more detailed description of the overall contributions is presented below.
Representation of objects. A proximity representation quarititatively encodes the proximity between pairs of objects. It relies on the representa-
18
T h e dissimilarity representation f o r p a t t e r n recognition
tion set, R, a relatively small collection of objects capturing the variability in the data. Each object is described by a vector of proxirnities to R. In the beginning, the representation set may consist of all training examples as it is reduced later in the process of instance selection. Here, a number of selection criteria are proposed and experimentally investigated for different learning frameworks. In this way, we extend the notion of a kernel t o t,hat of a proximity representation. If R is chosen to be the set of training examples, then this representation becomes a generalized kernel. When a suitable similarity measure is selected, a cpd kernel is obtained as a special case. Using a proximity representation, learning can be addressed in a more general way than by using the support vector machine. As such, we develop proximity representations as a first step towards bridging the statistical and structural approaches to pattern recognition. They are successfully used for solving object recognition problems.
Data understanding. Understanding data is a difficult task. The main consideration is whether the data sampling is sufficient to describe the problem domain well. Other important questions refer to intrinsic dimension, data structure, e.g. in terms of possible clusters and the means of data visualization. Since there exist marly algorithms for unsupervised learning, our primary interest lies in the former questions. In this book, three distinct approaches to learning from dissimilarity representations are proposed. The first, one a.ddresses the given dissimilarities directly. The second addresses a dissimilarity representation as a mapping based on the representation set R. As a result, the so-called dissimilarity space is considered, where each dinlension corresponds to a dissimilarity to a particular object from R. The third one relies on an approxiniate embedding of dissimilarities into a (pseudo-)Euclidean space. The approaches are introduced, studied arid applied in various situations. Domain description. The problem of describing a class has gained a lot of attention, since it can be identified in many applications. The area of interest covers all problems where specified targets have to be recognized and anomalies or outlier situations have to be detected. These might be examples of any type of fault detection, abnormal behavior, or rare diseases. The basic assumption that an object belongs to a class is based on the idea that it is similar to other examples within this class. The identification procedure can be realized by a proximity function equipped with a threshold, determining whether or not an instance is a class member. This proximity function can be e.g. a distance to a set of selected prototypes.
Introduction
19
Therefore, the data represented by proximities is more natural for buildirig concept descriptors, since the proximity function can directly be built on these proximities. To study this problem, we have not only adopted known algorithnis for dissimihrity representa,tions, but have also implemented and investigated new methods. Both in terms of' efficiency and performance issues. our methods were found to perform well.
Classification. We propose new methodologies to deal with dissiniilarity/similarity data. These rely either on approximat,e embedding in a pseudo-Euclidean space and construction of the classifiers there, or on building of the decision rules in a dissimi1arit)yspace, or on designing of neighborhood-based classifiers, e.g. the NN rule. In all cases, foundations are established, that allow us t o handle general dissimilarity measures. Our methods do not require metric constraints, so their applicability is quite universal. Combining. The possibility to combine various types of information has proved to be useful in practical applications; see e.g. [MCSOO,2000; NICS02, 20021. We argue that combining either significantly different dissimilarity representations or classifiers different in nature on the same representation can be beneficial for learning. This may be useful when there is a lack of expertise of how a well-discrimination dissimilarity measure should be designed. A few measures can be considered, taking into account differeiit characteristics of the data. For instance, when scanned digits should hc compared, one measure focuses on the contour information, while others on the area or on statistical properties. Applications. The proximity measure plays an important role in many research problems. Proximity representations are widely used in many areas, although often indirectly. They are used for text or iniage retrieval, data visualization, the process of learning from partially labeled sets, etc. A number of applications is discussed where such measures ares found to be advantageous. In essence. The study on dissimilarity representations applies to all dissimilarities, independently of the way they have been derived, e.g. from raw data or from an initial representation by features, strings or graphs. Expert knowledge oil the application can be used to formulate this initial representation and in the definition of the proximity measure. This makes the dissimilarity representations developed natural candidates for combining
20
T h e dzssamzlaraty r e p r e s e n t a t z o n for p a t t e r n recognztion
the strengths of structural and statistical approaches in pattern recognition and machine learning. The advantage of the structural approach lies in encoding both domain knowledge and the structure of an object. The benefit of the statistical approach lies in a well-developed mathematical theory of vector spaces. First, a description of objects in the structural framework can be found. This can then be quantized to capture the dissimilarity relations between the objects. If necessary, other structurally and statistically derived measures can be designed and combined. The final dissimilarity representation is then used in statistical learning. The results in this work justify the use and further exploration of dissimilarity information for pattern recognition and machine learning.
PART 1
Concepts and theory
Budowatem na piasku I zawalito sie. Budowatem na skale I znwalilo sie. Teraz budujgc Zaczn,e od dymu z komina.
I built on the sand And at tumbled down. I built on 0; rock And it tumbled douin,. Now when I build, I shall be,qi,n With the smoke from. the chimney.
“PODWALINY” , LEOPARDSTAFF
“FOUNDATIONS”, LEOPOLDSTAFF
This page intentionally left blank
Chapter 2
Spaces
Ring the bells that still can ring Forget yo’w perfect offeer.ing There i s a crack: in ever-ything That’s h,o,w the light gets in. “ANTHEM>’LEONARDCOHEN
Many dissimilarity measures have been designed and are used in various ways in pattern recognition, machine learning, computer vision and relat,ed fields. What is missing, however, is a general and unified framework for learning from examples that are represented by their dissimilarities to a set of representation objects. Different aspects of the measures, such as showing of the Euclidean behavior, metric or asymmetric properties, may lead to different learning approaches. In the statistical approach to pattern recognition, objects are represented as points in a vector space, equipped with an additional algebraical structure of an inner product and the associated norm. This is iisually a Euclidean or a Hilbert space. The distance between the points is then naturally measured by the Euclidean or Hilbert distance. If beneficial. other metric distances niay be introduced, usually froni the family of the &distances or C,-distances. Classifiers are functions defined by firiit,e vector representatioris in this vector space. Usually, they are designed, i)ased on the assumed model, applied probabilistic reasoning or used pairwisc distances. The question we begin with is more difficult. How can a learning task be performed given a set of pairwisc dissimilaritics? Dissimilarities are measured according to a specified dissimilarity measure, which is not necessarily a metric and not necessarily a measure in the strict niathcmatical sense. It quantifies the similarity or commonality between two objects by taking small values for two similar objects and large values for two distinct objects. Additionally, when possible, sensor measurements or other intermediate description of the set of examples niay be given. The challenge is
23
24
T h e dissimilarity representation f o r p a t t e r n recognition
to discover the structure in the data, identify objects of a particular class or learn t o distinguish among the classes, knowing the procedure according to which the dissimilarity is computed and the dissimilarity values between a set of (training) examples. As no vectorial representation of the objects is provided, the challenge is now to use the dissimilarities in a meaningful way. To make use of statistical learning, we must find an appropriate framework for the interpretation of dissimilarities. The concept of a (vector) space is important for the development of a theoretical foundation, both from the representational and algorithmic point of view, since we will rely on numerical procedures and deal with numerical representations of the problems. Dissimilarities quantitatively express the differences between pairs of objects. while learning algorithms usually optimize some error or loss function for a chosen numerical model. Dissimilarities, therefore, have a particular meaning within a frame of specified assumptions and models. Spaces possessing different characteristics will allow different interpretations of the dissimilarity data, which will lead to different learning algorithms. Therefore, before discussing dissimilarity representations and learning methods, we need essential concepts arid properties of various spaces. This chapter is motivated by the lack of a consistent and clearly ideritifiable mathematical theory on general dissimilarity measures, not only in the pattern recognition field, but also in mathematics. In its foundations, such a theory should rely on the notion of nearness between two objects. Therefore. the theory of spaces plays a key role, since suclu a nearness can easily be introduced here. Most of the existing theories deal with norms, which are often used to define metrics. Usually, Euclidean, city block or max-norm distances are considered. Other interesting contributions can be fourid in various subfields of mathematics, such as non-Euclidean geornetries [Blumenthal, 1953; Coxeter, 19981, differential geometry [Kreyszig, 1991; Struik, 19881, algebras [Paulsen, 20021 and operator spaces [Effros and Ruan, 2000; Pisier, 20031. Additional inspiration can be found in the fields of experimental psychology and artificial intelligence. These, however, remain of interest for future study. To oiir knowledge, no book yet exists that explains the theoretical background of general dissimilarity measures and studies learning problems from such a perspective (although a general study on pattern theory in this direction by Grenander is available [Grenander, 1976, 1978, 19811). Therefore, this chapter is meant to fill this gap. It not only introduces spaces with their basic properties, but it also shows the relations between them. Conse-
Spaces
25
quently, the concepts are presented from a mathematical point of view and supported, if possible, by examples from pattern recognition. The purpose of this chapter is to bring together and present a basic theory on spaces in the context of general dissimilarities, both metric and non-metric. The spaces described here will serve as interpretation frameworks of dissimilarity data. The connections will become clear in Chapters 3 arid 4. We will start by recalling basic notions from set theory.
2.1
Preliminaries
Throughout this book, a set X is a collection of objects of any kinds, both real and abstracts, such as real-word objects, digital images, binary strings, points in a plane, real numbers or functions. These are called elements of X . In some cases, a set can be determined by a means of a property of its elements, such as the set of convex pentagons, non-decreasing functions or scanned handwritten digits of '1'. The set of natural numbers is W. The sets of real, positive real and nonnegative real numbers are denoted by R, B+ and R : , respectively. The set of complex numbers is denoted by @. If X and Y are two sets, then X U Y is their union, A n B is their intersection, X\Y is their difference and X A Y = ( X \ Y ) U (Y \X ) is their symmetric difference. X x Y = { (2, y) : .?:EXA y E Y } denotes a Cartesian product. P ( X ) is a power set, which is a collection of all subsets of X . An index set I defines a correspondence between i E I and either an element ai of a set A or a subset Ai of A. A family of sets in A is denoted by A = {Ai : i E I } . The union, intersection and Cartesian product can be extended to a family of sets as UiEIAi, Ai and Hi,, Ai,respectively.
ni,,
Definition 2.1 (Mapping, function) Let X and Y be two sets. If wit,h each element zE X we associate a subset F ( z ) of Y , then the correspondence z d F ( z ) is a mapping of X into Y or a function from X to Y . If the set F ( z ) consists of a single element; then the mapping is single-valued, and multi-valued, otherwise. Mapping, function or transformation will be used interchangeably. Definition 2.2 (Basic facts on functions)
0
Let f : X + Y be a function from X to Y . X is the domain, of f a.nd Y is the codomain o f f . The range of f is Rf= {y E Y : gZEx y = q5(z)}. The inverse function of f is a mapping f-' : Y 4 X that satisfies f - ' ( f ( z ) ) = z and f ( f - l ( y ) ) = y for all Z E X and ~ E Y .
26
0
The dissimilarity representation for pattern recognztzon
The image of z is f ( i c ) . The preimage of y are all z E X whose image is y. i.e. fpl(y) = {:I; E X : f ( z ) = y}. The image of A c X is the set f ( A ) c Y consisting of all elerncrits of Y which equal f ( a ) for some (1, E A. The preimage (inverse image) of B c Y is the set f - l ( B ) c X consisting of all elements ic E X such that f(x)E B.
Definition 2.3 (Composition) Let f : X + Y and g : Y 4 2 be fuiictions. Then g o f : X 4 Z is a composition of mappings such that (.9 O .f)(.) = 9
(.f.I).(
Definition 2.4 (Injection, surjection, bijection) 0 A fiinction f : X t Y is injective or one-to-one if it maps distinct arguments to distinct images, i.e. x1 # 2 2 + f ( z 1 ) # f ( i c 2 ) holds for all :I: 1, :1'2 t x . 0 A function f : X + Y is ective if' its maps to all images, i.e. if for evcry y t Y , there exists Z E X such that f ( z ) = y. In other words, f is 0
surjective if its rangc is cqual to its codonlain. A function f : X --f Y is bijective if it is both injective and surjective, i.c. if' for cvery Y ,there exists exactly one Z E X such that f ( x ) = y.
The composition of two injections (surjections, bijections) is again an injection (surjection, bijection).
Definition 2.5 (Binary relation) Let X and Y be two sets. A binary relation, R is a subset of the Cartesian product X X Y ,i.e. R C X X Y . A subsct of X x X is a binary relation on X. One writes zRg to indicate that :1:
is in relation with y.
Definition 2.6 (Equivalence relation) is a binary relation
N
An equivalence relation on X
which is
(1) reflexive: .7: z or all EX. (2) syrrirnetric: (x y) + ( y x) for all z . y ~ X . ( 3 ) transitive: (x y A y z) + (x z) for all 2 , y, zE X . N
-
- -
-
-
-
The set of all elements of X equivalent to z is an equivalen,ce class of n: and denoted hy This means that [z] = {y : y t X A y z}.
[XI.
Definition 2.7 (Partially ordered set) A partially ordered set is a pair ( X ,5 ) ;where X is a set arid 5 is a partial order on X , which is: (1) reflexive: x 5 z or all z t X. (2) antisyrnrnetric: (x 5 y A y 5 x) + x = y for all Z, EX. ( 3 ) transitive: (z 5 p A y 5 z)+ (z 5 z) for all z, y, Z E X .
Spaces
27
Definition 2.8 (Upper bound, lower bound ) Let ( X ,5)be a partially ordered set and Y c X.Y is partially ordered under Y is bounded from above (from below) if there exists z E X for which y 5 z (x 5 y) holds for all y E Y.:2: is an 'upper (lo,wer) bound of Y .
s.
Definition 2.9 (Directed set) A partially ordered set X is a directed set if for any z, y E X , there exists z E X such that x 5 z and y 5 z . X is inversely directed if for any x,y E X , there exists z E X such that z 5 x and
z
An iriiportaiit axiom in set theory is the axiom of choice. It is generally accepted as true. Theorem 2.1 (Zermelo/Zorn) A x i o m of ch,oice can, be equivalen,tly stated as: ( I ) Let A be a set of non-empty sets. T h e n there exists a cho%cefimction, f defined o n A, such that f o r each set A in A, f ( A ) E A. (2) Given a n y set of mutually disjoint non-empty sets, there exists at least one set that contains exactly one element in c o m n o n with each of the non-empty sets. (3) If every totally ordered subset of a partially ordered set X has un upper bound th,en. X h,us a. m.azimal element. Definition 2.11 (Finite, countable and uncountable sets) Let A be any set. 0 A is finite if it consists of a finite number of elements. Otherwise, it is infinite. 0 A is countable if there exists an injection f : A ---t N. A is in,finitely countable if there exists a bijection f : A + N. 0 A is uncountable iff A is not countable. All uncountable sets are, therefore, infinite.
T h e dzssimilarity representation for pattern recognztzon
28
Example 2.1 Every subset of a countable set is countable. Cartesian product of finitely many countable sets is countable. The set, of natural numbers is infinitely countable and its cardinality if defined as No (aleph-null). The set of prime numbers, the set of integers or the set of rational numbers arc also countable and have the cardinality of No. The set of real numbers is an uncountable set with the cardinality denoted by c or 21 (beth-one). An uncountable set has the cardinality strictly greater than N o . Example sets with the cardinality of c are [0,1],the set of irrational numbers, the power set of natural numbers, the set of all infinite sequences consisting of' zeros and ones (or of integers), the set of all open sets in R" or the set of all continuous functions from R to B. The set of all functions from R 4 R has tlie cardinality of 2 2 = 2'l, which is larger than XI.
2.2
A brief look at spaces
111general, a space is a set of elements augmented with a type of relation or an additional structure. From a pattern recognition point of view, a space shonld posscs some properties such that a finite representation of objects can be characterized for learning. This niearis that some of the characteristics may be induced or imposed on tlie data instances considered to ensure a generalization' for future examples. Intuitivcly, a space should possess a notion of closeness between its elements, which is compatible with the algebraic structure', whenever such a structure is available. A space is often considered to already posses a structure of a high degree, as those of' metric vector spaces. Euclidean vector spaces are the most widely used example of these, as they reflect our intuitive understanding of a space: vectors are elements that can be added, multiplied by a scalar or projected onto each other. The Euclidean metric is a distance experienced 'To generalize is 'to derive or induce (a general conception or principle) from particulars' [Webster dictionary, site]. 2For instance, the structure of a vector space is based on the algebraic operations of vector addition arid multiplication by a scalar. Thanks to them, a linear combination is defined and, consequently, a hyperplane. In general, the operations defined and permitted in a space, can possibly lead to a creation of constructs, which are different (more complex) than the basic elements themselves, but still reside in this space.
Spaces
29
neighborhood space pretopological space topological space Hausdotff space ‘ (=-space
Figure 2.1
Some generalized topological spaces
by us daily. Usually, more primitive spaces are explained using high-level concepts, such as a norm or metric. Note that a metric can be introduced not only in a vector space, but on an arbitrary set. An example is a set of binary strings or a set of proteins, where a distance measure ca.pt,urcsthe similarity (likeliness) betwecn two strings or t,wo proteins. In such cases. the most natural iiieasure, derived in agreement with the physical process (e.g. evolutionary changes) behind the creation of set examples, may iiot be metric. Therefore, we will use the word ‘dissimilarity’ to indicat,e that metric constraints, i.e. reflexivity, definiteness, syrnmetry and triangle inequality, Def. 2.38, are not necessarily satisfied. Practical problems ask for an investjigation of spaces that are more primitive than metric (or Euclidean). A bottom-up approach, starting from a notion of a neighborhood or of a convergence, is then needed. However, as a commoii approach to learning is to use metric (Euclidean) distances or to impose tlicrn by a suitable correction of‘ the given dissimilarity measure, we will discuss then1 well. Sections 2.3-2.7 will briefly introduce basic coriccpts of gerieralizcd topological spaces, generalized metric spaces and inner product spaces, as well as their essential properties. We will briefly mention the spaces to be introduced in the subsequent sections. The riotion of a neighborhood3 (or of a generalized closure) is the basis for the construction of more coinplex spaces, among others neighborhood spaces, pretopological spaces arid topological spaces. For a general illustration of the interrelations between generalized topological spaces, see Fig. 2.1. The idea of such a pictorial schema is to present how we can build from a very general space satisfying a few constraints, more specific spaces, possessing more structiire. ‘Even more primitive concepts can be used, like filter, convergence or nearness [Gastl and Hammer, 1967; Cech, 1966; Stadler and Stadler, 2001al.
30
The dissimilarity representation for pattern recognitzon
..........................
I/
inner product space
;
Hilbert space
,
\
\
Euclidean space
normed space
\ I\ 1
pseudo-Euclidean space I
space ..........Pontryagin .... Krein space ~
Krein mace
J
,
indefinite inner product space
\
Fiaure 2.2 Schematic relations between some classes of generalized inner product spaces. RKHS denot,es a reproducing kernel Hilbert space and RKKS stands for a reproducing kernel Krein space. I
Basically, if one pictorial space is ‘encapsulated’ by another, the former fulfils more requirements and possesses more properties than the latter, and, consequently, it is more specific and its structure is richer. A set with a neighborhood system creates a neighborhood space. If the intersection of two neighborhoods belongs to the neighborhood system, this leads to a pretopological space. Adding the concept of a ’proper’ boundary, i.e. an idempotent closure operator, gives rise to a neighborhood basis consisting of open scts. As a result, we get a topological space. Imposing the existence of disjoint neighborhoods for distinct elements (which implies that thc sequences of elements have at most one limit) yields a Hausdorff space. By requiring more and more separation axioms (by the means of topological operations) between disjoint sets and distinct elements, more advanced spaces arc obtained, finally leading to a metric space4. By providing the Euclidean distance to a vector space, a Euclidean space is obtained. This brief presentation shows that a Euclidean space is highly structured. Having int,roduced generalized topological spaces, we will consider the linear space as the foundation for more complex spaces. The following spaces are briefly discussed: normed and (indefinite) inner product spaces with their relations to a metric space. Our attention is specifically devoted to Euclidean (Hilbert) and pseudo-Euclidean (Krein ) spaces. Since the inner product and metric are essential concepts for describing relations between object representations, we consider the dependencies between 4The ordering of spaces from extremely general to very specific is by no means unique. One may arrive a t a metric from a uniform structure and a proximity structure [Kothe, 1969: Willard, 19701.
Spaces
'
31
hollow space (1)
- {
'd
~.~...
/f semimetric space (1),(2),(4) , ,
I
, :, ,, ,, ,, ,
......................................
if
,
, , ,, ,,
~
'
inner product space
,#
Hilbert space
/
j
i' 2 :a, 6
:'
jC
I
,, ,I .-.-
>,: m8 i%
normedspace
'
,0 , ,
h
I 7
metric space (1)-(4)
\
(Euclidean spa&
quasimetric space (1)-(3)
-..~
,j
Figure 2.3 Schematic relations between some classes of generalized metric spaces. The numbers correspond to the conditions of Def. 2.38.
some classes of spaces from these two perspectives; see Figs. 2.2 and 2.3 for schematic diagrams. In these pictorial schenias, if one space is 'embraced' by another, it is either more restricted than or a special case of the first. For instance, a (finite-dimensional) Euclidean space is a particular case of' a Hilbert space, which, in turn, is an inner product space and a special case of a Banach space. The latter is an example of nornied spaces, which, if metric is defined, becomes a metric space. If the metric requirements are weakened, then more general spaces, like quasimetric or premetric spaces are obtained. Scctioris 2.3-2.7 will mainly provide definitions, theorenis and basic characterizations of various spaces. The presentation is by no means complete. Only basic facts are chosen to show how spaces, rich in structure, are created from simple ones. Most of the proofs are omitted as they can be found in standard textbooks. The ones presented here are either new or have an educational value. We think that this chapter is essential to build an understanding of various spaces, even if the theory can be mostly found in standard books on algebra, functional analysis and linear spaces. The goal of our presentation is to prepare for the introduction of Krein spaces and dissiniilarity representations later on, and to make the reader aware of specific properties of a Euclidean space, which do not necessarily hold in other spaces. This is an important point as our scientific intuition often relies on early experiences with low-dimensional Euclidean spaces. This niay lead to hidden assumptions or expectations, which prevent us from investigating more general learning theories. Although Euclidean (Hilbert) spaces proved to be good frameworks for statistical learning paradigms, there are many practical problems that can-
32
The dzssimilarity representation foT pattern recognition
not be directly explained there. Such spaces have strong limitations, as they are rich in algebraic structure. One, therefore, needs to make compromises or to investigate other possibilities, which would hopefully lead to alternative approaches. 2.3
Generalized topological spaces
Stantlard t,extbooks on topology define a topology on a set X by the means of a collection of open sets. Open sets are the basic notion of topology. For instance, in application to digital image processing, they are used to construct a digital topology on the 'integer planei ZXZ; see [Khalimsky, 1987; Khalirnsky et al., 1990; Kong et al., 1991, 19921. When topology is discussed in norined vector spaces, the norm defines a metric distance, which is used to construct open ball neighborhoods, defined as B,(z)= {y E X : d ( z . y ) < E } , for positive E . These open sets determine the natural topology in metric spaces. The concept of neighborhood is, however, more fundarricntal than the concept of distance, since a metric (normed) space is already a high-level construction with a high degree of geometric structure; see Figs.2.1, 2.2 and 2 . 3 . Topology can be derived in a bottom-up way, whcrc the notion of a norm or a distance is not yet available. This can be achieved by the use of neighborhoods or generalized closure operators. For an introduction to standard topology, see e.g. the books of [Willard, 1970; Munkres, 20001. For more general topics, please refer to the following books [Sierpiiiski, 1952: G a d , 1964: Cech, 1966; Kothe, 1969; Kelley, 19751, as well as the articles [Day, 1944; Gastl and Hammer, 1967; Gnilka, 1994, 1995; Stadler et al., 2001, 2002; Stadler and Stadler, 2001bl. Our point is that neighborhoods or generalized closure operators can be considered as basic concepts t o build a (pre)topological space and t o express the relations between objects. This can be especially advantageous when one directly works with a represcntation domain of objects, such as a collection of strings. Since our analysis starts from dissimilarity relations between a set of' examples, the neighborhoods will be defined by the use of dissimilarities in generalized metric spaces. One of the most crucial characteristics a space should reflect is the notion of closeness (nearness), i.e. being able to tell whether two elements are near or not. Note that, at the most basic level it might be impossible to distinguish that an element n: is nearer to z than to y , although it can be judged that z is near to both y and z . So: the relation of nearness may be based on the relations between sets and may not be quantitative. It does
Spaces
33
not need tjo be symmetric, i.e. J: can be near to y, but not vicc versa. (The nearness can also be seen as an asymmetric resemblance relation, e.g. of a child to a parent.) A further study in this direction may lead to the so-called proximity spaces [Willard, 1970; Cech, 19661. A possible formalization of the notion of nearness for the set X can be made by defining for each element II: E X a collection of subsets of X , called neighborhoods of 2 . Intuitively, the basic properties of neighborhoods should be that each element, 5 is contained in all its neighborhoods, any set containing a neighborhood is a neighborhood, so consequently the entire set is the largest neighborhood of each of its points. Below, formal definitions are presented. Definition 2.12
(Generalized topology via neighborhoods)
Let
P ( X )be a power set of X , i.e. the set of all subsets of X . The neighborhood function hf: X + P ( P ( X ) )assigns to each x E X the collection n/(z)of all its neighborhoods of z such that:
(1) Every
:L'
belongs to all its neighborhoods:
Y 2 c x Y N E N ( ~2 )E N . (2) Any set containing a neighborhood is a neighborhood: =+ hfEN(2)). vINEN(Z)v'nrcx ( N c ( 3 ) The intersection of t,wo neighborhoods is a neighborhood: ~ I N ; M a v ( 2N ) nM E N ( Z ) . (4) For any neighborhood of IC, there exists a neighborhood of a neighborhood of each of its elements:
2
that is
Y N E N ( Z j 3M€N(Zj Yiy€,R.l ME"Y).
The pair ( X ,N ) with N satisfying the first two requirements is a neiyhborhood space [Gastl and Hammer, 19671. The pair ( X , N ) ,obeying conditions (1) ( 3 ) is called a pretopological space. If all conditions are satisfied, then ( X ,N ) becomes a topological space. ~
Definition 2.13 (Neighborhood basis) A subfamily &(x) of the neighborhood system N ( x )is a neighborhood basis (or a local basis) at :I: if the following conditions are fulfilled:
(1) V N E I v B ( 5 ) Z E N . (2) YN,"tND(Z) 3MENB(2)M
c N n N'.
A neighborhood basis uniquely defines a pretopological space. This follows, since a neighborhood system satisfying the conditions (1) ( 3 ) of Def. 2.12 is built by taking all subsets of X larger than the basis neighborhoods, i.e. N ( z ) = (111 C X : 3 N E ~ B ( zN) c M } . Therefore, instead ~
34
The dissimilarity representation f o r pattern recognition
I
I
(b)
Figure 2.4 Illustration on neighborhoods. (a) Examples of neighborhoods of z from the set X . (b) A nested neighborhood basis of X . (c) A neighborhood N of a set Y C X ; dashed ovals correspond to neighborhoods of elements ?/ E Y.
of considering a complete neighborhood system, only a neighborhood basis can be used for the definition of pretopology. Note that a pretopological space niay have many bases, each of them capable of describing the entire space. Neighborhood systems are a general tool to represent knowledge on relations between the elements of a set X . A neighborhood of an element II: is somewhat similar to 2 , however, since it is a set, its elements are not distiriguishable froin 2 . When additional structure is added, the e1ement.sin neighborhoods may become distinguishable. For instance, neighborhoods can be defined by the use of binary relations; siniilarity and dissimilarity measiires or hierarchic systems. See also Fig. 2.4 for an illustration on neighborhoods.
Example 2.2 (Neighborhood bases) (1) Let X = { a ,b, c, d , e}. The neighborhood basis emphasizes particular rclations between the elements. For instance, for the relations on the right side below, it is defined as:
Ni?(a)= { { a > ,{ a ,b, .>I. NB(0)= { { u , b : c } } .
NB(c) = { { c } ,{a?b; c}, { c , d. e } } . N B ( d ) = {{c.d,e}}.
N B ( e ) = { { e } , { c ,d , e } } Extension of the above neighborhood relations to a set of integers is the Khalimsky line: used to define a digital topology [Khalimsky, 1987; Khalimsky et al.: 19901. (2) Let p : X x X + 'w? be a general dissimilarity measure as in Def. 2.45, such that p ( z , x) = 0. Then B~(z) = {y t X : p(x,y) < S } is a neighborhood of z for a given 6> 0. The neighborhood basis is then defined as
NB(2)= {&(z): &>0}.
Spaces
35
( 3 ) Let X be a set. A hierarcliical clustering (see Sec. 7.1) can be seen as a successive top-down decomposition of X, represented by a tree. The root describes the complete set and it is the largest cluster. Its children nodes point to a decomposition of X into a faniily of pairwise disjoint, clusters. Each cluster can be further deconiposed into smaller clusters, represented by nodes in t,he tree, until the single elements in the leaves. In this way, sequences of nested clusters are created. A neighborhood of II: is a cluster ch at the level h in the subtree containing the leaf 2 . Then &(x) = {CtL: z E Ch,}. Notc that the requiremerit of disjoint clusters at each level is riot essential for the definition of N B ( Z ) .
Definition 2.14 (Neighborhood of a set) Let ( X ; N )be a pretopological space and let Y C X . Then N is a neighborhood of 1’ iff N contains a neighborhood Nu of each E Y . The neighborliood system for I’ is then n/(g). See also Fig. 2.4(c). given by N ( Y )= Definition 2.15 (Open and closed sets via neighborhoods) Let X be a set. A 2 X is an open set if it is a neighborhood of each of its elements, i.e. V l z E ~A E N ( z ) . A is a closed set if (X\A) is open.
A neighborhood function N defines a generalized topology on the set X , as presented in Def. 2.12. Neighborhoods can be used to define genersliaed interior and closure operators, which may further define open arid closed sets, the basic concepts in a topological space. Since the properties of t,hc neighborhood, closure and interior functions can be translated into each other, t,hey are equivalent constructions on X . For instance, a generalized closure can be considered as a principal concept to define other operators on sets [Gastl and Hammer, 1967; Stadler et al., 2001; Stadler arid Statiler, 20021. Definition 2.16 (Generalized closure) Let P ( X ) be a powcr set of X. A genegrulized closure is a function P ( X ) + P(X)which for each A c X assigns A - c X such that 0- = 0 arid A c A-.
The generalized closure is not idempotent. This means that for A c X . the condition A - - = A- does not necessarily hold, as required for the topological closure. The interior function and neighborhood system N can be now defined by the generalized closure. Definition 2.17 (Generalized interior) Let P ( X )be a powrr set of X . A generalized znterzor is a function P ( X ) 4 P ( X ) which for each subset A
T h e dissimilarity representation for pattern recognition
36
Table 2.1 Equivalent axioms for the neighborhood system and the generalized closure operator. X is a set and A , B , N , M represent any of its subsets. Axioms (1)-(3) describe neighborhood spaces, axioms (1)-(4) define pretopological spaces and axioms (1)--(5) define topological spaces. Closure A -
Propcrties
(5) Idempotent
of X assigns a subset A" of X such that A" one can write that A- = X\(X\A)".
=
X\(X\A)-.
Equivalently,
Definition 2.18 (Neighborhood system) The neighborhood N : X + P ( P ( X ) )is a function which for each Z E X assigns the collection of neighborhoods defined as N(z) = { N E P ( X ) : II: $ ( X \ N ) - } . Equivalently, one can write that Z E N (X\N)$N(z).
*
Definition 2.19 (Generalized topology via closure) Let P ( X ) be the power set of X . Consider a generalized closure - : P ( X ) + P ( X ) with the following properties: (1) 0- = 0. ( 2 ) Expansive: VACX A C: A-. (3) Monotonic: V A , n-c x A C B jA(4) Sublinear: VA.BCX ( AU B ) - C A(5) Idempotent: VACX A-- = A-.
B-
u B-
If axioms (1) (3) are fulfilled, then ( X . - ) is a neighborhood spacc. If axionis (1) (4) hold, then ( X ,-) is a pretopological space. If all conditions are satisfied, (X. -) defines a topological space; see also Table 2.1. -
~
Corollary 2.1 Axioms given in Table 2.1 are eyuzvalen,t. Proof. Let X be a set and let N . M be any subsets of X . Then the basic fact in set theory is that following equivalence N C:M @ ( X \ M ) C ( X \ N ) holds. In the proof, we will make use of this and Def. 2.18, in which the generalized closure is defined by the neighborhood system. The latter means that Z E N - H ( X \ N ) @N(z). The proof follows.
Spaces
37
(1) Assume @ = 0- holds for every zEX. From Def. 2.18, 'dZEx~9'0-H ' d z a z @ ( X \ W @ Y 7 x x X EN(Z). (2)
Assume that the generalized closure is expansive. Let
:1: E
X arid
N E N ( ~ By ) . Def. 2.18, this is equivalent to z @ ( X \ N ) - . Making use of the expansive property of the closure, on has (X\N) C: (X\N)-. I t follows that X\(X\N)C ( X \ ( X \ N ) ) = N. For any z E X the following equivalence z @ (X\N)- H z E X\(X\N)holds. Since X\(X\N)- C (X\(X\N)) = N, then z E X\(X\N)+ :I: E N . Therefore, z; @ (X\N)z E N . Hence, we have proved that N E N(z) X€N.
*
x E X and N E N ( z ) . Assume that N E N ( x ) + z E N holds for any N C X . By Def. 2.18, for any 1c one has 1c E N + :I:# (X\N) + (X\N) @ N ( x@ ) z E N-.As z E N + z E N - consequently, N C N-.
-+== Let
( 3 ) Let z E X . Assume that N EN(^) and N C M . The latter is equivalent to (X\n/l) C ( X \ N ) . Since the generalized closure is monotonic, N C A!! @ ( ( X \ M ) C ( X \ N ) ) + ( ( X \ M - 2 ( X \ N ) - ) holds for all N , M C X . The latter set relation is equivalent to stating that x 9' (X\N)- + n: @ ( X \ M ) - , which by Def. 2.18: is equivalent to N E N(z)+ A f ~ " ( z ) . Since ( N € N ( z )A N C &I), then MEN(^).
C N - U M - hold for all N ; Af i X . Assume that N , M E N ( z ) . Replacing N by (X\N) and M by ( X \ M ) , one gets: ( ( X \ N ) U ( X \ M ) ) - 2 (X\N)- U (X\A,f)-. Herice 12: E ( ( X \ N ) U (X\M))- + ( Z E (X\N)- V z E ( X \ M - ) , which is equivalent to { z # ( X \ N ) - A z g ( X \ M ) - + z $ ( ( X \ N ) U ( X \ M ) ) - } . Since N , MEN(rc) and from de Morgan's law (X\N)U(X\M)= X \ ( N n M ) , the latter implication is equivalent to (N E N ( z ) A M E N(x:))+ ( N Ti M) € N ( z )by Def. 2.18.
(4) Let ( N U A d -
(5) Let z E X and N E N ( z ) . Assume that the generalized closure is idenipotent for all subsets of X . Therefore, one can write ( X \ N ) - = ( X \ N ) ) . BasedonDef. 2 . 1 8 , o n e h a s N E N ( x ) w z ; @ ( X \ N ) - -s (X\N)-- ++ (X\(X\N)-) ~ n / ( z )Let . M = X\(X\N)-. Then M E n/(:c) by the reasoning above. For all y, the following holds y E Ail ++ y @
( X \ W @ Y e (X\N)- @ Y @ (X\N)-- ++Y @ X\(X\(X\N)-)'++ (X\Ad)- @ M E N ( y ) , by Def. 2.18. Hence, we have shown that Y N E N ( ~3~.r=(x\(x\,v-)~,v(~) ) 'dy~ns EN^). 0 y@
T h e dissimilarity representation f o r p a t t e r n recognition
38
N(z) A EN(z)
Neighborhood
A c X A is open
Closure A-
InteriorA'
A = X\(X\A)-
A = A'
The difference between pretopological and topological spaces lies in the notion of a closiirc operator. In a topological space, the closure of any set A is closed, A-- = A - , and the interior of any set is open, (A")" = A". In a pretopological space, this is not necessarily true: so the basis neighborhoods are not open. Here, the generalized closure operator expresses the growth phenomenon, where the cornposition of several closures results in successive augmentations, i.e. A 5 A- C A-- C . . . .
Example 2.3 (Pretopological and topological spaces) (1) Let X be any set and let S : X X X + P ( X ) be a symmetric relation, i.e. S ( x , y ) = S ( y , x ) . Assume a generalized closure of A C: X be S ( x ,y). Then (X, - ) is a neighborhood space, defined as A- = since the generalized closure obeys conditions (1)-(3) of Def. 2.19. (2) Let X be a finite set and ( X , E ) be a directed graph. Let F ( x ) be a set of the forward neighbors of x, i.e. F ( x ) = EX: ( T C , ~E) E } . Let A X . By axioms of Def. 2.19 it is straightforward to show that the closure A - = U Z E A ( F ( xU) {x}) defines a pretopological space ( X ,-). (3) Let N B ( ~ = ){ ~ E R Ix : - y1 < E A E > 0). Then (R,&) defines a topological space. (4) Let &(z) = { ( u . m ) : U E I W A X E ( a , ~ ) }Then . (R,NB) defines a topological space.
c
Corollary 2.2 (Open and closed sets) Let ( X ,- ) be n neighborhood space de,fined by t h e generulized closure, i.e. conditiom (1)-(3) of De,f. 2.19 hold. A 2 X i s an open set if A" = A. A is a closed set if A- = A; see also Table 2.2. Th,e followin,g holds:
(i)
AEN(TC H )A = X\(X\A)-.
Y Z E ~
(2) A
=
A"
A
= X\(X\A)-.
Pro0f. (1) Assiimc that Y r E ~A E N ( x ) holds. By Def. 2.18, Y r E ~A E N ( x ) @ Y l r g ~x $ ( X \ A ) - @ Y l s € ~xEX\(X\A)-. Hence A = X\(X\A)-. (2) A = A" = X\(X\A)- by Def. 2.17.
Spaces
39
Lemma 2.1 Let ( X , N ) be a neighborhood space. T h e assertions: (1) V I N E , V (3~A)4 C N ( z ) V y t &JEN(Y) ~ and (2) VN(NE N ( z ) @ No E N(2:)) are equivalent. The proof of Corollary 2.1, point (5) shows that VINE~u(a) VyEnr hf E N(y). Since M = N o by Def. 2.17, 3M=(X\(X\N)-)EN(zL.) then No E N ( x ) . 0
Proof.
A collection of open sets containing z constitutes a neighborhood basis in a topological space, which can be proved by Lemma 2.1. Equivalently, since the closure operator is dual to the interior operator, a neighborhood basis in a topological space can be built by a collection of closed sets coritaining x. Lemma 2.2 Let (X,&) be a pretopological space. If all neighborhoods oj = No} ,for all x E X , t h e n (X,NB) is a topological space.
NB are open sets, or NB(x) = {N C X : x E N A N
Corollary 2.3 (Closure on neighborhoods) Let ( X , - ) be a neighborhood space. T h e n a genm-alized closure operator i s a function, P ( X ) + P ( X ) , defined as gcl(A) = {x E X : V N t ~ ( r ) A n N # Moreover, gcl(A) = A-.
a}.
In order to prove that gcl(A) = A- holds for any A C X, we will equivalently show that (z$gcl(A)) H (x$A-) holds for all z t X .
Proof.
=+ z $gcl(A) + 3 ~ ~ , vN (n~A )= 0.By Def. 2.18, the latter is equivalent t o (x$(X\N)- A N n A = 0). Since (N n A = 0) + (A C X \ N ) , then by the monotonic property of -, A- C (X\N)- holds. Since z $ (X\N)-, then x#A-. +=
By Def. 2.18, then z@gcl(A).
(.$Ap) + ((X\A) @ N ( x )holds. )
Since (X\A) n A = 0,
0
Definition 2.20 (Limit point) Let ( X , N ) be a neighborhood space. An element y E X is a limit of A C X iff for every neighborhood N EN(^), N intersects A\{y}. The set of all limits points is called the derived set, der(A) = EX: V N c ~ ( y )(A\{y}) n N # 0} [Sierpinski, 19521. Corollary 2.4 In a neighborhood space, der(A) tains all its limit elements and conversely.
C A - . A closed set
con-
The notion of corivergencc is important in neighborhood (pretopological arid topological) spaces. Recall that a sequence x, in X is a function from N to X: hence f ( n ) = zn. The order of elements in z , is, thereby, important.
40
T h e dzssimilarity representataon f o r p a t t e r n recognatzon
The sequence x,, is different from a set { x : n } ~ ?which l, is simply indexed by W. One would say that a sequence xn converges to x E X in a neighborhood space ( X . N ) if for every neighborhood N c N ( z ) ,there exists k E N such that xn E N for all n 2 k . The problem with this definition is, however, that neighborhoods may have an uncountable number of elements arid countable sequences may not capture the idea of convergence well. In general, convergence is defined by the use of filters, which are generalization of sequcnces.
Definition 2.21 (Filter) A filter on a set X is a collection of X such that (1) (2)
YFE3
F
F of subsets
# @.
~ F" ~ C g (~F n F'). ( 3 ) b ' p ~v ~~F ( C F' + F ' E 3 . ' d p . p i e 3~ p
If 3 satisfies only the first two conditions, then
F defines a filter basis.
Note that given a filter 3 on a set X and a function f : X + Y , the set f ( 3 )= {f(A): A t 3 } forms a filter base for a filter of the function f .
Definition 2.22 (Convergence) Let ( X ,JV)be a neighborhood space. A filter 3 converges to z E X , 3 4x if V N E ~ / ( s ~) F F~ 2 FN . One may easily verify that a neighborhood system N(z) of an element X in a prctopological space ( X , A f ) ;compare t o Def. 2.12. One may, therefore, imagine a set of nested subsets (neighborhoods) of an element 2 that defines the convergence to x. If one is given a sequence of elements 2 , for n E W, then a filter basis can be defined as {PI,k EN},where Fk is a subsequence of x ,starting from the element x k , i.e. (xk,zk+l,. . .). :c is a filter on
Definition 2.23 (Hausdorff space) A neighborhood space ( X . N )is Hausdorfl or T2 if every two distinct elements of X havc disjoint neighborhoods. i.e. Yr,yEX 3 N z E ~ ( Nz )g ,E ~ ( y )Nzn & = 0. Lemma 2.3 Every convergent filter in a Hausdorfl space has u unique limit. Functions and, especially, continuous functions are basic tools in applications of various spaces. The basic intuition of a continuity is that small changcs in the input produce small changes in the corresponding function output. where 'small' is expressed by a chosen distance. In general neighborhood spaces, one can only work with sets.
41
Spaces
Definition 2.24 (Continuity by filters) Let f : ( X , N )+ ( Y . M ) be a function between two pretopological spaces. f is continuous at :I; E X if for all filters 3 on X if .F + x, then f ( F ) f (x). --f
Definition 2.25 (Continuity by neighborhoods) Let f : ( X , N ) + (Y.M ) be a function between two neighborhood spaces. f is continuous at z E X if for each neighborhood M of f ( z ) in Y , there exists a neighborhood N of z in X , whose image lies entirely in M . f is contiriiious on X if it is continuous at every x E X . Formally, f is continiloils if holds for all EX. Yin/ic,u(f(.)) ~ N E N ( . ) f ( N ) C Theorem 2.2 (On continuous functions) Let f : ( X , N )+ ( Y , M ) be a f u n c t i o n between two neigh,borhood spaces. T h e following assertions are equi.ualent [Gnilka, 1997; Munkres, 20001: 1. f i s continuous at x. 9. For all x E X , B E M ( f ( 2 ) + ) f - ' ( l ? ) ~ N ( z. ) 3. For every set A E P ( X ) ,f ( A - ) C ( f ( A ) ) -. 4. For eiie7-77 set B E P ( Y ) ,( f - l ( B ) ) -C f - l ( B - ) . 5. For every set B E P ( Y ) ,f - l ( B " )C ( f - l ( B ) ) " . Note that in topological spaces, continuity of a function translates to the fact that the preimage of an open (closed) set is an open (closed) set.
Remark 2.1 T h e composition of finitely many contin,uous mappin,gs i s a continuous mapping. Definition 2.26 (Regular space) 0
0
A neighborhood space ( X , N ) is regular if for each neighborhood N of z E X , there exists a smaller neighborhood M of z whose closure is contained in N ? i.e. Y N ~ N ( ~31\.lc~(.) ) M - C N. A topological space is regular if every neighborhood of n: contains a closed neighborhood of x. It means that the closed neighborhoods of' I(: forin a local basis at z. In fact. if the closed neighborhoods of each point in a topological space form a local basis at that point, then the space milst be regular.
Definition 2.27 (Normal space) 0
A pretopological space is normal if the separation of the closures of two sets imposes the existence of their disjoint neighborhoods, i.e. if for nonempty sets A and B , one has (A-nB-= 0) ( ~ N * , N( ~A C NA)A(BC N B ) A (NAn N B = 0)) [Cech, 1966: Stadler and Stadler, 20021.
*
42
The dissimilarity representation for p a t t e r n recognition
Table 2.3 Properties in Neighbood spaces
Regularity axioms
Separation axioms
A topological space is normal if the separation of two closed sets imposes the existence of their disjoint neighborhoods. Neighborhood and (pre)topological spaces can be classified with respect to the degree to which their points are separated, their compactness, overall size and connectedness. The separation axioms are the means to distinguish disjoint sets and distinct points. A few basic properties are presented in Table 2.3 and scheniatically illustrated in Fig. 2.5 [Cech, 1966: Stadler and Stadler, 2001b; Munkres, 20001.
Definition 2.28 (Completely within) A set A is completely within B in a neighborhood space ( X , N ) if there is a continuous function 4 : ( X . N )+ [O. 11 such that 4 ( A )C (0) and 4(X\B) C (1). Therefore, A C B. Different pretopological spaces can be distinguished by the way they 'split up into pieces'. The idea of connectedness becomes therefore useful.
Definition 2.29 (Connectedness) A space X which is a union of two disjoint non-empty open sets is disconmected, and connected, otherwise. Equivalently, a space X is connected if the only subsets of X which are both open and closed are the empty set arid X . Definition 2.30 (Cover) Let X be a set. A collection of subsets w C X is a coiier of' X if X = U w . A cover is finite if finitely many sets belong to it. If w and w' arc covers of X , then w' is a subcover if w' c w .
Spaces
REG
QN
43
--
TO
TI 1
T2
T2 t
T3
T,
t
0
Figure 2.5 A pictorial illustration of the regularity and separation properties; based on [Stadler and Stadler, 2001bl. Neighborhoods are drawn as ovals and closures are indicated as filled ovals. The regularity condition REG demands for each neighborhood N the existence of a smaller neighborhood whose closure is contained in N. The quasinormality axiom Q N requires that the separation of the closures of two sets iniposes the existence of their disjoint neighborhoods. To means that for two distinct elements, there exists a neighborhood of one of them such that it does not contain the other element. TI states that any two elements have neighborhoods with the property that the neighborhood of one element does not contain the other element. Tz imposes the existence of disjoint neighborhoods for any two elements. T' asks for the existence of neighborhoods such that their closures axe disjoint, for any two elements. T3 demands that for each neighborhood N, there is a set h!! which is completely within N.
Definition 2.31 (Compact space) A topological space X is compact if every open cover has a finite subcover. A topological space is locally compact if every element has a compact neighborhood. Theorem 2.3 Let f : X + Y be a continuous function betweesn topological spaces. If A is a compact subset of X , then f ( A ) i s a, compact subset of Y . Theorem 2.4 A closed subset of a compact set is compact. A compact subset of a Hausdorfl space i s closed. Definition 2.32 (Dense subset) Subset A of a topological space ( X .- ) is dense in X if A- = X . Equivalently, whenever N , is an open neighborhood of Z E X the ~ set NT n A is non-empty. Definition 2.33 (Size) 0
0
A topological space is
separable if it is a closure of a countable subset of itself, or in other words if contains a countable dense subset. first-countable if every element has a countable local basis; see also Def. 2.13.
44
T h e dissimilarzty representation f o r pattern recognition
second-countable if‘ it has a countable basis for its topology. Second-countable spaces are separable, first-countable and every open cover has a countable subcover. Example 2.4 1. Every topological spaces is dense in itself. 2. Let (R,Ng) be a topological space withNB(x) = ( ( ~ - - E , Z + E ) : E > O } . Then by Corollary 2.3, the set of rational numbers Q is dense5 in R, i.e. Q- = R. Consequently, Iw is separable, as Q is countable. More generally, R” is separable. 3. A discrete topological space ( X , N g ) is a space with Ng(x) = {z}, i.e. the basis consists of single elements. This means that every subset of X is both open and closed. Every discrete space is first-countable and second countable iff it is countable.
Definition 2.34 (Topological product space) Suppose X i , i = 1:2. . . . , *rL are given sets. The set X of all n-tuples ( X I ,2 2 , . . . , x T L )z ,iE X i is a Cartesian product X = X ~ X X ~ . X xX, . .= X i . Let ( X i , N i ) , i = 1 , 2 , . . . , n be (pre)topologicaI spaces. (X,n/) is a (pre)topological product space if n/(z)= hii(xi) is a neighborhood basis of Z E X .
n;=,
Remark 2.2 T h e definitions above can be extended t o a n y (countable or n o t ) ,family of topological spaces. T h e mapping 7ri : x + z i i s a projection o,f X onto X i . It is a continuous mapping and the topology defined o n X is the weakest topology for which all the projections ~i are continuous [Kothe, 19691. Topology (pretopology) can be introduced on a set in many ways. It can be defined by a collection of open sets, or generated by a neighborhood basis, a (generalized) closure or other operators. The way it is introduced specifies particular ‘closeness relations’. One should, however, remember , that new topologies can always be added to a set. Some topologies can be cornpa,red, however not all of them are comparable.
Definition 2.35 (Weaker and stronger topologies) Let X be a set arid let H , M be two neighborhood systems defined for every x E X . The topology defined by N , the N-topology is stronger (finer) than the topology defincd by M , the M-topology if for each x E X every neighborhood ‘Informally, one may think that a subset A is dense in X if the elements of A can ‘approximate’ the elements of X with the arbitrary precision with respect to X .
Spaces
45
M E M ( x ) is also a neighborhood of n/(z). It means that N has more neighborhoods than M . The M-topology is then ,weaker (coarser) than the N-topology. If' neighborhood bases NB arid M B are considered, then the N-topology is stronger than tlie M-topology if for each z E X and every basis neighborhood M B E M B ( Z ) there , is a basis neighborhood of N B E N B ( such ~ ) that N B c MB. If finitely or infinitely many topologies are defined by N, on a set X : there is tlie strongest, (finest,) topology specified by n/ among the topologies on X which are weaker (coarsest) than every n/,-topology. This nieaiis that every neighborhood of n/(x) is a neighborhood of N, for every 0.
Definition 2.36 (Homeomorphism) A bijective function6 f : X i Y between two topological spaces (X,n/) and (Y,M ) is a ho,meomorphism if both f and f - l are continuous. The spaces X arid Y are homeoniorph,ic. The homeomorphisms form an equivalence relation on the class of all topological spaces. Therefore, homeomorphic spaces are iridistinguishahlc as topological spaces; they belong to the same equivalence class. Two homeomorphic spaces share the same topological properties. e.g. if one is compact, connected or Hausdorff, then the other is as well. This also means that a set N g N ( z ) is open in X iff the set f ( N )~ M ( f ( x is ) )open in Y . Moreover, a sequence z, converges to z iff the sequence f ( ~ converges , ~ ) to
f (x). Remark 2.3 T h e identity m a p I : ( X , N ) ( X , N ) ,where I(%) = z is a homeomorphism when the same topology (neighborhood systems) are used over the domain and the range of the map. In general, it i s not true, if two di.fferent topologies are defined o n X . Let N B ( z )= X and M n ( x ) = {x} be the neigh,borhood bases for all 2 E X . T h e n N consists of X and M is u power set of X (without the empty set). B y Dgf. 2.25, I i s con,tinuous at z z f f o r all hl E M ( z ) there exists N E N ( : c ) such, that f ( N ) C A f . As N = X fm- all z and there exists hf = {x} such that f ( X ) @ {x}, then I : ( X ,N ) 4 ( X ,M ) i s discontinuous at each point IC.
Proposition 2.1 Let n/ and M be two neighborhood systems de,finined o n a topological space X . Th,e identity m a p I : ( X ,N ) 4 ( X ,M ) is continluous iff the N-topology is stronger than the M-topology. 6A bijective function f always has an inverse f-', even if f is.
but not necessarily continuous,
46
The dassemalaraty representataon f o r p a t t e r n recognition
Equivalence relation on a set is a binary relation between its elements, such that some of them become indistinguishable by belonging to the same class. In the study of spaces, a quotient space is the result of identifying such classes by an equivalence relation. This is usually done to construct new spaces from given ones.
-
-
Definition 2.37 (Quotient space) Let ( X , N )be a topological space and let be an equivalence relation on X. Denote by X/ the set of equivalence classes of X under -. Let i7 : X + X/ be the projection map which sends each element of X to its equivalence class. The quotient topology on X/ is the strongest topology (having the most open sets) for which 7r is continuous.
-
Remark 2.4 If X is a topological space and A c X , we denote by X / A a quotient space of the equivalence classes X/ under the relation x y zf x = y or 2 , y E A. So for x @ A , {x} is a n equivalence class and A is a single class.
-
2.4
-
Generalized metric spaces
A set can be augmented with a metric distance, or a structure weaker than metric, which leads to generalized metric spaces. A metric can also be introduced to vector spaces. They are, however, discussed in thc subsequent section. Most of the material presented here relies on the following books [Bialynicki-Birula, 1976; Blumenthal, 1953; Dunford and Schwarz, 1958; Kiithe, 1969; Kreyszig, 1978; Willard, 19701.
Definition 2.38 (Metric space) A metric space is a pair ( X ,d ) , where X is a set and d is a distance function d : X X X i R; such that thc following conditions are fulfilled for all x,y, z E X : (1) Reflexivity: d ( x : x )= 0. (2) Symmetry: d(x,y) = d ( y , z). ( 3 ) Definiteness: ( d ( z ,y) = 0) + (x = y). (4) Triangle inequality: d ( z , y ) d(y, z ) 2 d(x,z ) .
+
For instance, X can be R", Z", [a,bIm, or a collection of all (bounded) subsets of [ a ,b]". If X is a finite set, e.g. X = { X I ,5 2 , . . . , x n } >then d is specified by an n x n dissimilarity matrix D = (&), i,, j = 1,.. . , n such that di,i = d ( x i , zj).Consequently, the matrix D is nonnegative, symmetric and has a zero diagonal.
Spaces
47
Example 2.5 Examples of metric spaces: 1. Let X be any set. For x,y E X , the discrete distance metric on X is given by d(x,y) = Z ( x # y), where Z is the indicator (or characteristic) function. If X is a finite set, then all the pairwise distances can be realized by points lying on an equilateral polytope (extension of an equilateral triangle and of a tetrahedron). 2. Let X be a set of all binary sequences of the length 111. Given two binary strings s = ~ 1 ~ 2. s,, . and t = t l t 2 . . . t,, the Hamming distance is defined as d H a m ( s , t ) = Z(sk # t k ) . 3. Metrics in a vector space R". To emphasize that a vector x comes form a finite-dimensional vector space Rm,we will mark it in bold:
c;;r"=,
(a) d, (x,y)= (CEl (xi- yzi")h, p 2 1; a general Minkowski distance. (b) dl (x,y) = Ixi - yil,the city block distancc. (c) dz (x,y) = d~ (x,y) = (CE,(zz- yi)')+, the Euclidean distancc. (d) d, (x.y ) = d, ( x y~) = maxi (xi- gi(,the max-norm distance.
El"=,
4. Let F ( 0 ) be a set of real-valued functions defined on a bounded and closed set 0. Let M ( 0 ) c F ( 0 ) be a set of function cl which are Lebesgue measurable on R. Then L f = {.f E M(62) : (Jn If(x)l"dz)l/l-'
(hi
( f ( t )- d t ) I pd t ) (a) F = Lf, d,M(f,3) = (b) 7 = d 3 f ; d = "PtEn lf(t) - s(t)l.
Lg>
Similar metrics can be defined on the set of continuous fuiictiori C(n). 5. Let X be a set and (Y,d)be a metric space. The set of all bounded functions f : X i Y can be considered as a metric space with the distance d ( f , g ) = suparX d ( f ( z ) , g ( z ) ) for any bounded functions f and y.
Theorem 2.5 (Backward triangle inequality) In a m,etrzc space ( X ,d ) , th,e backward triangle inequality, (d(.lc,z ) - d ( y , z ) ( 5 d ( z . y), holds for a11 x,y . z r X . 7Note that i f f is zero nearly everywhere on R except for a countable number of points, then f is a non-zero function, but I f ( z ) I P d z ) ' / p = 0 for p 2 1. Hcnce, such f is indistinguishable from a zero-constant function. Thercfore, an equivalence relation is introduced such that f g if (,[, if(.) - g ( z ) l p d z ) ' / P = 0. One, therefore, considers function classes defined by the equivalence classes.Since now on, M(CLL)will denote such
-
function classes
(s,
-
T h e dissimilarity representation f o r pattern recognition
48
Proof. By triangle inequality, d ( z ,z ) d(y, z ) 5 d ( z ,y) and d ( y , z ) d ( x , z ) 5 d(y,x). Since d ( y , x ) = d ( z , y ) , the latter inequality becomes 0 - ( d ( z , 2 ) d ( y , z ) ) 5 d ( z , y ) . Hence, the inequality follows. ~
~
~
Theorem 2.6 (Natural topology in metric spaces) Every metric space (X, d ) with a n open bull neighborhood basis i s a topological Hausdorff space.
Proof. Let B E ( x ) {y E X : d ( z ,y) < E } be an open ball. To show that ( X ,d) is a topological space, it is sufficient to prove that the neighborhood basis NB(z)= { B E ( z:) E > 0) defines a topology on X. Below, we show that indeed NB(II;) is a neighborhood basis, i.e. the axioms of Def. 2.13 are fulfilled. ( I ) Obviously, d(x,x) = 0 < E , which means that z E B,(x)for any E > 0, so axiom ( I ) is satisfied.
(2) Consider any BE(x) and B,(z) for ~ , >q 0. Let C 5 min{E,q}. Then d ( x , y ) < < 5 rnin{&,q}, which means that if ~ E B then c , y~B,(x)flB,(z), hence the inclusion B<(z) C B E ( x )fl B,(z) holds. Therefore, axiom (2) is fulfilled and the &-ballsare the local basis. By Lemma 2.2 we need to prove that the &-ballsare open sets. By Def. 2.15, B E ( x )is an open set iff for all y E BE(x) + B E ( z )E N(y). We will first show that for each y ~ B , ( x )there exists q such that B,(y) C B,(x).Let y ~ B , ( x ) .Let q be such that 0 < q 5 &-dd(z,y). Let z ~ B , ( y ) .This means that d ( y , 2 ) < rl 5 E - d ( z ,y), which leads to d ( z ,y) +d(y, z ) < E . By the triangle inequality, we have d ( z , z ) < E , which stands for ,zEB,(z). Hence, we show that z E B , ( y ) =+ z E B , ( z ) , hence B,(y) C BE(x). By axiom (2) of Def. 2.12, any set enclosing B,(y) belongs to N(y),hence B E ( z )EN(^) for all y ~ B , ( z ) Therefore, . N B ( ~consists ) of open sets. Consequently, by Leinrna 2.2, &(z) defines a topological space. The fact that every metric space ( X ,d ) is Hausdorff (see also Def. 2.23) can be shown as follows. Let x , y E X and E = d(z,y)/2. Then the opcn balls 0 are disjoint, i.e. B E ( x )n B,(y) = 8 and z ~ B , ( z and ) yEB,(y). Since a metric space is Hausdorff, every sequence has at most one limit and every subsequence is convergent to the same limit. This has an impact on applications. Solutions to many practical problems can be expressed as iterated function systems in some metric space. These properties ensure that if such systems are convergent, they are convergent to a unique solution. In practice. however, an additional property of completeness,
Spaces
49
Def. 2.42, must be required, which takes care that the limit exists in the domain of interest.
Definition 2.39 (Dense subset in a metric space) Let ( X , d ) be a metric space. The subset Y of X is dense in X if it has a finite cover by &-ballsfor any positive E . Formally, V E x VzExSyty d(x,y) < E holds. Remark 2.5 X i s separable i f X contains a countable subset th,at i s dense
in X . Definition 2.40 (Distance between two sets) Assume a metric space ( X , d ) . Let A and B be two subsets of X . Then the distance between two sets is defined as & ( A , B) = infaEA,btB d(a, b ) . Note that & ( a , B) = inf/,,B d ( a , b ) . Theorem 2.7 A m e t r i c space ( X ,d ) is a nmrmal space and ,first-countable (with, a cowntable neighborhood base a t each poin,t).
Proof. We consider the topology defined by open hall neighborhoods, B,(n:) = {y E X : d ( z ,y) < E } , E > 0. Let A and B be two disjoint closed sets. Define U = { x E X : d s ( z , A ) < $ d s ( z , B ) and V = { ? / E X: & ( v , B ) < i d s ( y , A ) . Note that A c U and B c V . Moreover, U and V are disjoint open neighborhoods of U and V . Hence, by Def. 2.27, every metric space is normal. A countable neighborhood basis is formed as N B ( ~ = ){B;(z) : n = 1, 2 , . . .} and BL(x)= { y E X : d ( z , y) < k}. 0 Definition 2.41 (Convergence) Let ( X , d ) be a metric space. The z, = z if limTsioo d ( x n 33;) = 0. sequence z, converges to z E X , limTL+m Equivalently, 2 , converges to z if the open ball B E ( %=) { g EX: d(z,y) < E } contains a tail of 2 , . In a metric space, limit points of a set are always limits of convergent sequences of the elements of the set. As filters are needed for a convergence in a general topological space, in a metric spaces, sequences can be used.
Definition 2.42 (Cauchy sequence, complete space) Let (X,d) be a metric space. A sequence 2, E X is C a u c h y iff limn.m--tmd ( ~ znl,) , ~ , = 0, which is equivalent to stating that VEx, 3~ b ' n , 2 , m > ~d ( z n , 2 , )< E . A space (X,d ) is complete if every Cauchy sequence converges in X . Theorem 2.8 (On metric spaces) 1. I n a m e t r i c space every convergent sequence i s Cathy, but n,ot conversely. T h e latter is illustrated in E x a m p l e 2.6.
50
The dissimilarity representation for pattern recognition
2. In a m e t r i c space, t h e distance d i s continuous, i.e. t h e convergence of a n y t w o sequences x,, yn E X t o x and y, respectively, implies t h a t limn--iood ( z n , yn) = d ( z , y). 3. A closeds subset of a complete m e t r i c space i s complete u n d e r the i n duced metvic. 4. A complete subset of a m e t r i c space i s closed. 5. E v e r y m e t r i c space i s a subset of a complete space.
Example 2.6 Examples of complete and non-complete spaces: is Cauchy, but not 1. ((0.11, dz) is not complete. The sequence 5 , = convergerit in (0, 11, since limn-oo x, = 0. 2. (RTr'> d 2 ) is complete. 3. Let 12 be a closed and bounded set in Wm and C(R) be a set of continuous functions on 62. (C(R), d,) is complete. 4. Let R be a closed and bounded set in Rm. (C(fl), d p ) for 1 5 p < 00 is not cornplet,e,since some of the Cauchy sequences converge to discontinuous functions [Kurcyusz, 19821. If M(R) is a set of classes of functions measurable in the Lebesgue sense, then (M(R),d p ) is complete. 5 . Since every metric space is a subset, of a complete space [Sierpiriski. 1952; Munkres, 20001, it can be completed by adding the limits of all convergent sequences. For instance, ((0,1],dz) can be completed to ([0,I], dz).
Definition 2.43 (Bounded, totally bounded subset) Let ( X , d ) be a metric space. A subset Y of X is bounded if there exists M > 0 such that d(y1. yz) 5 M for all y l , yz E Y . Y of X is totally bounded if for all E > 0 there is a finite subset { ~ i } r =of~Y ( n depends on E ) for which Y C U ~ ~ " = , B , ( pholds. i) Note that if Y is a bounded subset of
W",then it is totally bounded.
Theorem 2.9 A m e t r i c spa,ce ( X ,d ) i s compact $ f i ti s com,plete a n d totally bomded. Theorem 2.10 (Heine-Borel) closed a n d bounded.
A subset Y of IWTL i s compact 2 8 it i s
An interesting example of a metric space of a practical importance in evolutionary biology and clustering problems is an ultrametric space; see also Sec. 3.2. The reason is their close relationship to trees (hierarchical organizations) [Hughes, 2004; Fiedler, 19981. s A closed subset of X is a closed subset in the natural topology of X.
Spaces
51
Definition 2.44 (Ultrametric space) An ultrametric space is a pair (X,d ) , where X is a set and d is an ultrametric distance function d : XXXi Rt.d is a metric satisfying the ultrameti-ic inequality, also called a strong trian,gle inequality: d ( z , z) I max { d ( z , y), d ( y , z ) } ,
zt X
2, y 3
Note that the inequalit,y above imposes the triangle inequality. Recall that a ball neighborhood in a metric space (X, d ) is defined ils BE(:c)= {y E X : d ( z , y ) < E } . A neighborhood basis NB(x) = { B E ( x:) E > 0) defines a natural topology on X ; Theorem 2.6. So, the ball neighborhoods, also called open balls, are open sets, while closed hails, B ; ( z ) = EX: d(x,y) I E } , are closed sets.
Theorem 2.11 (Properties of ultrametric spaces) The ,followin,g properties h,old in an ultrametric space ( X ,d ) [Hughes, 2004]. If two open, ball neighborhoods intersect in X , then o’ne contains the other. If’ t u o closed ball neighborhoods in,tersect in, X , then, on,e contains the other. Every point in an) open bull rLeighborh,ood i s i t s center, a.e.
Definition 2.45 (Generalized metric spaces) Let X be a set and p : X X X -R: be a dissimilarity function. If t,lie requirements of Def. 2.38 hold, then p is a distance function. If these requirements are weakened, spaces with less constraintsg are considered; see also Fig. 2.3: 1. hollow space - a space (X,d ) obeying the reflexivity condition. 2. premetric space - a hollow space ( X ,p) obeying the symmetry constraint. gTerminology is not unified; it varies between authors and contexts
T h e dissimilarity representation f o r p a t t e r n recognition
52
3. quasimetric space - a premetric space ( X , p ) obeying the definiteness constraint. 4. semimetric space - a prenietric space ( X , p ) satisfying the triangle inequality. 5. A hollow space ( X ,p ) satisfying the triangle inequality [Bonsangue et al., 19981.
Example 2.7 Examples of generalized metric spaces: 1. Let X = { N ( p ,D ) } be a space of one-dimensional normal distributions. The Mahalanobis distance, or the Fisher ratio [Duda et al., 20011, defined as dA/f(N(pl,Dl),H(pZ,0 2 ) ) = (o:+m,",f 'cL1-cLz' is premetric; see Fig. 2.6.
R" and k , 1 2 k 2 7 n is a fixed integer. Then the distance measuring the absolute difference along the k-th dimension, defined as dkPrank(x,y)= / z k- ykl, is semimetric. Let ( R , d ; p ) bc a measurable space, in which R is a set, A is a 0algebra of subsets on R and p is a measure. Then d,(A, B ) = p (AAB), AAB = (AU B ) \ ( Afl B ) , is semimetric [Willard, 19701. Let X be a sct of closed subsets of RmL.Similarly, as above, the mdimensional volume symmetric difference d,,l(A, B ) = vol (AAB)is semimetric. The definiteness condition is not fulfilled, since d,,~ (A, B ) = 0 for A and B being finite collections of points. In the pattern recognition area, this dissimilarity can be computed between two matched shapes as the area of non-overlapping parts; see also Fig. 1.2(b). Let. ( X , p ) be a semimetric space. Let the equivalence relation N be defined as z N y iff p ( z , y) = 0. If X" is the set of equivalence classes [x] in X under this relation, then p- defined on X" such that p"([z], [y]) = p(x,y) is a metric on X" [Willard, 19701. The space (R"', d p ) , where dp(x,y) = (Czl1~ ~ y i l " ) ; and p~ ( 0 , l ) is quasirnetric.
2. Let X
=
dkPrank
3.
4.
5.
6.
Proof. We will prove that the triangle inequality does not hold. Consider m = 2 and A = [O, 1IT,B = [O,0IT and C = [1,0IT. Then d p ( A , B )= d p ( B , C )= 1 and d p ( C ; A )= 2 ; . Finally, d,(A,B) d p (B, C ) = 2 < 2 = d p (C,A ) , since p < 1. Hence, the triangle inequality is violated. 0
+
In generalized metric spaces, the definition of convergence and of a Cauchy sequence are adopted from the metric case, Def. 2.41 and Def. 2.42.
Spaces
53
Figure 2.6 Mahalanobis distance between one-dimensional normal distributions is premetric. The reflexivity and symmetry conditions are satisfied, but the definiteness arid triangle inequality are not. Although A and B are different, d ( A , B ) = 0. Let u = U B = uc and lp~g- ~ c= / a. Then dn,r(B,C) = a / ( f i c ) and d n f ( C , A ) = ./(a2 + a : ) $ . Since U A >u,d n r ( C , A ) i d n f ( A ,B ) < d n f ( C ,B ) .
Definition 2.46 (Convergence) Let ( X ,p ) be a quasimetric space. An clement z E X is called a limit of an infinite sequence x,, limn+m x,, = 2 if limn-m p(z,, x) = 0. Definition 2.47 (Continuity of a dissimilarity) Let ( X ,p ) be a genis continueralized metric space. Dissimilarity function p : X X X + ous a t x and y if for any two sequences x,. y, E X , xn = z and limn+m y, = y implies that limn+m p(x,, y n ) = p ( z , y). Moreover. p is continuous in X if it is continuous for each pair from X . Note that all ‘nice’ properties of a dissimilarity measure, such as continuity. convergence of a sequence to one limit, Cauchy convergent seqiiencrs can be considered for the metric only. A generalized metric space may not fulfill these conditions.
Example 2.8 Let ( X ,p) be a quasimetric space. 1. X is not necessarily a Hausdorff space. A sequcnce may have more than one limit. Proof. Consider a quasimetric space ([0,1],p ) such that p ( z , y) = I y - d if z,yE [0, I ) , p(z. 1) = p(x,O) if n: E (0, I ) , p(1,O) = p(O,l) = 1 and p ( l . 1 ) = 0. Then the sequence T , = converges to both 0 and 1, since 0 both p ( : , 0) and ~($1) have the limit zero if n 00. 2. The dissimilarity p is not necessarily continuous. Proof. Consider a quasimetric space ([0, 11.p) such that p ( r , y ) = 2 if x , y E (0, l} and x # y, and p(x, y) = /z - yl, otherwise. Then p is we discontinuous for the pair (0, l ) , since for xn = and yT1 = 1 have limn-wx, = 0 and limTl+wyn = 1, but p(z,\y,) = 1, while p ( z , y ) = 2. --f
~
k,
54
T h e dissimilarity representation f o r p a t t e r n recognition
3 . An infinite sequence of elements from X might be convergent without being Cauchy. Proof. Consider a space ([0,l ] , p ) , such that p ( z , y ) = 1 if 5 = 1 y = and n # m,, and p ( z , y ) = Iz - yl, otherwise. Then for IT1 1 2 , = ,; limn + ooz, = 0. So, z, is convergent, but not Cauchy, since p(k, = 1. 0
i,
A)
Theorem 2.12 If ( X , p ) is a quasimetric space with a continuous dissirnilarity p, t h e n f o r all 2 E X and all E > 0 , &(z) = { y E X : p ( z , y) < E } is urL open set [Sierpi.riski, 19521. Proof. We will use Corollary 2.4 stating that a, closed set contains all its limit elements arid conversely. To prove that B,(z) is open, we will show that the complementary set Y = X\B,(z) = {y E X : p ( z , y) 3 E } is closed. Let z be a limit element of Y , which means that there exist elements z, E Y such that z, converges to z . From continuity of p, p(z,,z) + 0. Since z, E Y , then for any z E X , one has p ( z , z,) 2 E . From continuity of p, it follows that p ( z , z ) = lirrinioop(z3zn,) 2 E . This proves that z E Y . Consequently, Y is a closed set, as it contains its limit elements. Hence, BE(x) is open. 0
Theorem 2.13 (Pretopology in generalized metric spaces) ( X ,p ) be a space with a dissim,ilarity measuie p.
Let
1. Hollow space is pretopological. 2. Prenietric space is pretopological. 3. Quasimetric space with u, continuous dissimilarity p i s topological. 4. Semimetric space i s topological. 5. Hollow space with p satisfying the triangle inequality is topological.
Proof. To show that hollow, prernetric and quasimetric spaces are pretopological, one needs to prove that the &-ballsdefine a neighborhood basis. Such a proof directly follows the proof given in the metric case by Theorem 2.6. A continuous dissimilarity measure in a quasinietric space assures that thc &-halls are open sets by Theorem 2.12, hence the axioms of the topological space are fulfilled. The proof that a semimetric space is topological follows the same reasoning as in the metric case; see the proof of Theorem 2.6. The proof that a hollow space satisfying the triangle inequality is topological is given in [Bonsangue et al., 19981. 0
Spaces
55
Since generalized metric spaces are pretopological, continuous functions between such spaces can be defined adequately; see Dcf. 2.25 and also Corollary 2.2. Making use of neighborhood balls, we have:
Definition 2.48 (Continuity of a function) Let ( X , d ) and ( Y . p ) be generalized metric spaces. A function f : X + Y is continuous at 5 E X if vie>^ 36>0 y E Bs(x) f ( y ) E BE(f(x)), where neighborhood balls are defined as Bb(z)= ( 2 : d ( z , z ) < S } arid BE(f(x)) = ( f ( z ) : p ( f ( z ) , f(2:)) < E } , respectively. In the case of metric spaces, the E- and &balls are open sets. The function f is continuous if it is continuous at every Z E X .
*
Corollary 2.5 (On continuous functions) Let ( X . d ) and ( Y . p ) he metric spaces (or generalized metric spaces with continuous d issimilarity measures). The assertions below are equivalent: 1. ,f i s cont.in,uous at x. 2. For every neighborhood M of f ( z ) E Y , f - l ( M ) is a n,eighbor,hood of X E X , k t N ( f ( z ) )f-'(W M ( x ) . 2 , = z, th,eri limnioo f ( x n )= f(:c).
3. If limn-oo
Corollary 2.6 (Continuity of a composed mapping) Let ( X , d x ) > (Y,d y ) and (2, d ~ be) generalized m,etricspaces with continuous dissimilari t y measures a n d let f : X + Y , g : Y + Z and h,; X + Z be mappings. I f f a n d g are coritinuous, then the composed mappin,g h = go f , h,(x) = g ( . f ( z ) ) , i s continuous as well. Sketch of proof. The proof follows disrectly frosrri considering thp equirualence between the continuity arid the converge of a seqmnce based o n Corolla?-y 2.5. Direct product spaces can he used for the construction of a new spacc by combining two (or more) spaces. In the context of (finite) generalizcd metric spaces, if the measures refer. the same set of objects, a new dissimilarity measure can be created, e.g. by their summation.
Definition 2.49 (Product space) Let (X,dx) and ( Y , d y ) he generalized metric spaces. Then a product generalized niet'ric space X x Y with a dissimilarity d can be defined as ( X X Y d, x o d y ) , where 0 is thc sum or max operator. This means that (dxody)((x~,y~), ( 2 2 , ~ ~= ) )dx(:1:1:x2)+ dY (Yl,Y2) or ( d x o d y)((:El, Yl), ( 2 2 , Y2)) = max { d x ( Q .: c 2 ) ,dY (Yl>?In)> for ~ I , Z ~ and E Xy l , y 2 E Y . Extension of the concepts of neighborhoods, convergence arid continuity to a product space is straightforward. For instance, U is a neighborhood of
56
The dissimilarity representation f o r pattern recognition
the pair (x.y) if there exist a neighborhood N of x E X and a neighborhood M of y E Y such that N x M C: U . Also, the convergence of a sequence (xn.yYc)E X x Y is equivalent to the convergence of sequences 2 , E X and Yn E Y .
2.5
Vector spaces
Generalized topological spaces and generalized metric spaces defined on sets were described in the previous sections. The necessity, however: arises to consider sets on which meaningful binary operations are allowed. This leads to groups and fields. When the operations become the addition of elements and scalar multiplication, vector spaces can be defined. When, additionally, a topology or a metric is introduced to a vector space, its algebraic structure is enriched. The reader is referred to [Bialynicki-Birula, 1976; Dunford and Schwarz, 1958; Garrett, 2003; Greub, 1975; Kothe, 1969; Larig, 2004; Willard, 19701 for more details.
Definition 2.50 (Group) A group ( G , o ) is a nonempty set G with a binary operation 0 : GxG + G, satisfying the group axioms: (1) Associative law: V i n , b , c E ~ ( a o b ) o c= a o ( b o c ) . (2) Existence of a unique identity element: 3 i d E ~' d l a Eaoid ~ = idoa = a. ( 3 ) Existence of an inverse element: ' d a E ~3,- E~ n o aa- o a = id. If additionally the commutative law, a o b = boa, holds for all a, b E G , then the, group G is Abeliun.
+, +
Definition 2.51 (Field) A field (r, *) is a nonempty set I? together with the binary operations of addition and multiplication * satisfying the following conditions: (1) (I', +) is an Abeliari group with the 0 additive identity element. ( 2 ) (F\{O},*) is ari Abeliari group with the unit multiplicative identity clement. ( 3 ) Distributive laws: a*(b+c) = (n*b)+(n*c) and (a+b)*c = (n*c)+(b*c) hold for all a , b, c E I?.
Example 2.9 (Fields and groups) (1) Let Z be a set of integers. (Z, +) is a group, but (Z, *) is not. ( 2 ) Let R be a set of real numbers and C be a set of complex numbers. (R, +, *) and (C,+, *) are fields.
57
Spaces
Definition 2.52 (Vector space) A vector space (a linear space) X over the field r is a set of elements, called vectoi-s, with the following algebraic struc,ture:
+
(1) There is a function X x X 4 X , mapping (z,y) t o 2 y , such that ( X ,+) is an Abelian group with the zero additive identity. ( 2 ) There is a function r x X 4X , mapping ( X , z ) t o Xz, such that the following conditions are satisfied for all x, EX and all A, p € r : (a) Associative law: (A p ) 2 = X ( p z ) . (b) Distributive laws: X(z+u) = X x + X y , and (X+p)n: = Az+pz. (c) Existence of multiplicative identity element 1 E r: 1x = z. If the field r is not explicitly mentioned,
r is assumed to be either R or @.
Definition 2.53 (Linear combination, span and independence) Let X be a vector space. The vector x is a linear combination of vectors {x1,x:2:. . . ,xn} from X if there exist { a l , a 2 , .. . ,a,) E r such that z = C,"=,ajxJ. The span of ( 2 1 : ~ 2 ,. .. , x,} is a collection of all their linear E X is linearly independent combinations. A finite set of vectors if C,"=,ajx,? = 0 implies that) all aj = 0. Otherwise, the set is linearly dependent. An infinite set is linearly independent if every finite subset is liriearly independerit. Definition 2.54 (Basis and dimension of a vector space) Let X be a vector space. The set B of vectors b, E X forms a Hamel basis of' X if B is linearly independent and each vector x is in the span of V = { b 3 } for some finite subset V of B. The dimension of X , diniX, is the cardinality of B . Definition 2.55 (Subspace) A subspace V of a vector space X is a subset of X , closed for the operations of vector additions and scalar multiplication. Example 2.10 Examples of vector spaces: 1. Iw and C with usual operations of scalar addition and multiplication. 2 . Rrn and C", with the elements z = (zl, zz,.. . ,x,) and the elementwise addition and multiplication by a scalar, are m-dimensional vector spaces. 3. A set of nxm, matrices with the matrix addition and multiplication by a scalar. 4. The set 3 ( a )of all functions defined on a closed and bounded set 62, with the pointwise addition ( f g)(z) = f(x) g(z) and the scalar multiplication ( c f ) ( z ) = c f ( n : ) .
+
+
58
T h e dissimilarity representation f o r pattern recognition
5. The set Pn of all polynomials of the degree less than n is a vector space a,nd a,n n~-dimensiorialsubspace of F(52). 6. The set C(52) of continuous functions on R arid the set M(R) of classes of functions measurable in the Lebesgue sense" are infinite dimensional vect,or spaces and subspaces of F(i1). 7. = { f E M(12) : (<Jnl.f(x)I"dx); < m} for p 2 1 is an infinite dirriensional vector space and a subspace of F ( 0 ) .
C ,F
Definition 2.56 (Quotient vector space) Let X be a vector space over a field r and let Y be a subspace of X. Consider an equivalence relation o i l X such that :cl x2 if ( 2 1 x2) E Y . X / Y , X mod Y , defined by the relation is a quotient vector space. Let [:c] denote the equivalence class of 3;. The addition on the equivalent classes is defined as [XI] [ 2 2 ] = [xi 221 for all il:2 E X and the multiplication by a scalar is defined as a[.] 1 [ax] for all CY Er and Z E X .
-
N
~
N
+
+
~
If X is an n-dimensional space and Y is an m-dimensional space, then X / Y has the dimension n, rri. ~
Definition 2.57 (Linear map) Let X and Y be vector spaces over the field I'. A linear m,ap (linear operator) from one vector space to another is a function 4 : X 4 Y , also called homomorph,ism,, such that for all x 1 , 5 2 E X and all X t I?, the following conditions are fulfilled:
+
(1) Additivity: 4(x1 x2) = qh(il;X:l) ( 2 ) Honiogencity: q5(Xx) = @ ( m ) .
+ q5(x2).
Note that the above conditions are equivalent t o stating that f preserves linear combinations, i.e. $(CzlXixi) = Czl &4(zi) for all xi E X and all X i E r. If'Y = I',then 4 is called a linear functional. X arid Y can also be defined over different fields.
Remark 2.6 If X and Y are finite dimensional vector spaces with chosen hases, th,en any linear map can be represented by a m,atrix. For instanm, a h e a r transformation Rk + R" is represented by an k x m matrix A such that y = Ax for .?: ER'"and ' ~ E I W ' " ~ . Definition 2.58 (Kernel and image) Let f : X i Y be a linear trmsformation between two vectors spaces. The kernel or a null-space of f is a subspace of X consisting of vectors whose image is 0, i.e. ker(f) = "Two functions are in the same equivalence class if they agree almost everywhere, i.e. if they disagree on a set of a measure zero. From now on, M ( n )refers to such classes of functions measurable in the Lebesgue sense.
59
Spaces
{ x : X : f(x) = O}. The image of f is a subspace of Y consisting of images of vectors from X , i.e. i m ( f ) = { ~ E YI z:E X f ( x ) = y}.
+
Lemma 2.4 If X is finite dimensional, then dim(ker(f)) dim(irn(f)) = dim(X). If additionally Y is finite dimensional and the bases are chosen. then the linear map is represented by the matrix F . dim(im(F)) is the rank of F and dim(ker(F)) is the nullity of F . The notion of a dual vector space is important in niost applications. It is especially useful for inner product and nornied spaces; see also Sec. 2.6.
Definition 2.59 (Dual space) Let X be a vector space over the field I? (Ror C).The dual space, also called algebraic dual, X * = C ( X .I?) of X is a set of linear functions , f : X + I',also called lin,ear functionals. Remark 2.7 The collection X * of linear functionals on X over r is a vector space over r with the pointwise addition (.f g)(x) = f ( x ) g ( z ) and scalar niultiplicatio,n ( a . f ) ( x )= a f ( z ) for all f , g E X * , a ~ and r ZEX. The 0-vector in X* is the linear functional that maps every vector x E X to zero. The additive inverse ( - f ) is defin,ed by (-f)(x) = - f ( x ) . The associative law and distributive laws, De,f. 2.52, can easily be uerified b y struightforward computations.
+
+
As we will later deal with finite samples, our focus is on finitedimensional spaces. If X is finite-dimensional, then both X and X * havc the same dimension. Moreover, X is isomorphic'' to X * . The isomorphism depends on the basis B of X , which defines a dual basis B*of X* arid a bijection B + B*. So, given a basis of X , there exists a unique corresponding dual basis. Definition 2.60 (Dual basis) Let X be an n-dimensional vector space with a basis B = { b l , b 2 , . . . , bn}. A dual basis { f l , f 2 , . . . , f n } of X * with respect to B is a basis for X * with the property that 1, if i fj(b2) =
=j
,
0 , otherwise,
"An informal definition by Hofstadter [Hofstadter, 19791: T h e word 'isomorphism' applies when two complex structures can be mapped onto each other, in such a way that t o each part of one structure there is a corresponding part in the other structure, where 'corresponding' means that the two parts play similar roles in their respective structures. Formally, the isomorphism f is a bijective map (one-to-one and onto) such that both f and its inverse f p l are linear maps.
T h e dissimilarity representation f o r p a t t e r n recognition
60
{fz}r=l
The linear functionals of X* are formally defined as f t : X + r by fL(CT=l ~ . k b k )= 2,. Then fz are nonzero elements of X* and span X*. If X is infinite-dimensional, then the dimension of X* is strictly larger than that of X [Kothe, 19691. A simple illustration of this fact is a space X of infiriitc real sequences (21 zz, . . .) with a finite number of non-zero elements. The dual space X* consists of infinite real sequences (x?,x:, . . .) of a n y elements, hence its dimension niust be larger than this of X. ~
Definition 2.61 (Bilinear map) Let X. Y' and 2 be vector spaces over the field I?. A bilinear m a p (bilinear operator) is a function f :X X Y + 2 such that (1) For any fixed z E X the map y transformation from Y to 2. (2) For any fixed y E Y the map z transformatioil from X to 2.
4
f ( z ,y), f 3 ; ( g ) = f ( z ?y), is a linear
+
f ( z ,y), f,(x)
=
f ( z ,g ) , is a linear
If X = Y and f ( x , y) = f ( y , z) for all z, y E X , then f is s y m m e t r i c . If r = C and f(z, y) = f + ( y ,z) for all x,y E X: where t denotes complex c o n j u g a t i o ~ ithcn ~ ~ , f is H e r m i t i a n .
Definition 2.62 (Bilinear form) Let X be a vector space over the field I?. A bilinear f o r m , is a bilinear transformation f : X X X+ I?. Notc that any rcal n x m matrix A can be regarded as a matrix of a bilinear form X X Y + R such that X = R" and Y = R" and f ( x . y ) =
c:=,x;:1
AL1XZY.l.
Definition 2.63 (Non-degenerate bilinear form) Let f : X X X + r be a bilinear form over thc vector space X. f is non-degenerate when the following conditions hold: (1) If f ( ~ l , x z=) 0 for all (2) If f(xl,x2)= 0 for all
x1 E X , then .c2
E X . then
xz
= 0.
z 1 = 0.
Thc spaces X and X* are dual with respcct to a bilinear function X * x X + I?, called a scalar product or i n n e r product, and denoted as (., .), such that ( f ,x) = f ( x ) for :I:E X and f E X*. For instance, if X = R" = X* (R" is self-dual), then ( x * ,x ) = C711x:xi for x* E X " and x E X . This scalar product is linear in both arguments and its properties are analogous ''If
zE@ such that
t=
a+bz, then
z t = a-bi, i2 = -1.
Recall that IzI = z z t = a2+b2.
Spaces
61
to the properties of a well-known scalar product studied in analytical geonietry (that is, given two vectors x and y, their scalar product, is computed by multiplying their lengths and the cosinus of the angle betwecn thcrn). Note, however, a subtle difference. In general, the arguments of ( f ,:c) belong to diflerent spaces and they cannot be exchanged, which rrieaiis that this inner product is not symmetric. One, however, uscs the same notioii of inner product to strengthen the analogy to the traditional geoiiietric inner product. Formal definitions on inner product will follow in Sec. 2.6.
Definition 2.64 (Evaluation functional) Let X * = C(X, r)be a space of linear functionals. An evaluation ,functional 6, evaluates each function f € X * at a point Z E X as 6 , [ f ] = , f ( x ) . One can, therefore, write that 6,[f = ( . f , z ) for f e X * arid Z E X . Any isomorphism q5 :X + X* defines a unique non-degenerate bilinea function on a finite-dimensional vector space X by ( : x , g ) = 4 ( x ) ( g ) for x,!/EX such that for the fixed II:, q5(z): X 4I?. R.emind that X* consists of linear functionals X 4 r and Y * consists of linear functionals Y + r. Note that a bilinear map f : X X Y + r can be characterized by the left linear map f~ E C ( X , Y*). i.e. f~ : x f r arid the right linear map f~ E C(Y,X*),i.e. f ~g :+ f, such that f~(x)(y) = fz(y) = f ( z , y ) and f ~ ( y ) ( x = ) f,(x) = f ( z , y ) for all Z E X and ~ E Y . ---f
Theorem 2.14 (Dual map) 0 Consider a homomorphism l i, : X + I/ over th,e ,field r. T ~there L exists an associated (unique) dual m a p $* : Y *+ X* bekween the dual spaces Y * and X * such thmt $ * ( g ) ( : c ) = g(,dJ(x))for all g E Y * and X E X . 0 T h e dual m a p i s h e a r , hence ($I 4)* = $* 4* and (a$)*= ~ 4f o*r the h e a r maps $ and 4, and a ~ rAdditio,nally, . (4)o 4)* = 4* o $*.
+
+
Definition 2.65 (Second dual space) Let X be a vector space over the field I?. The second dual space X** of X is the dual of its dual spacc. This means that the elements of X** arc linear fimctionals f : X * 4I?. There exists a vector space homomorphism : X + X** defined by q(x)(f) = f ( x ) for all II: E X and f E X*. If X is finite-dimensional: then 17 is an isomorphism, called canonical isomorphism, arid dimX = diniX* = diniX**.
Definition 2.66 (Quadratic form) Let X be a vector space. A mapping q : X 4 R is a qu,adratic form if for all ~ 1 ~ ExX2 arid N E R,the following conditions hold:
62
The dissimilarity representataon for p a t t e r n recognition
(1) f ( z 1 , ~=) y ( z l (2) q(nz)= fY2y(z).
+2 2 )
-
y(x1)
-
y(z2) is bilinear in z1 and
22.
Note that f is a symmetric bilinear form.
Definition 2.67 (Continuous dual space) Continuous dual C,(X, r') of a topological vector space X is a subspace of the dual space X* = C(X,I?) consisting of all continuous linear functiona1sl3. Definition 2.68 (Topological vector space) A vector space X over the field r is a topological vector space if there exists a neighborhood system N such that ( X , N )is a topological space and the vector space operations of addition ( z , y ) + z + y of X X X + X and multiplication by a scalar (A, z) 4Xz of FxX + X are continuous in the topology. Note that in a topological vector space, the topology is determined by the neighborhoods of 0. The neighborhood base NB(0) is defined by open sets of 0 such that every neighborhood of 0 contains a base neighborhood from NB(O).All the neighborhoods of other points are unions of translatcd base neighborhoods: U,,o(za Bp, z, E X and Bp ~ N g ( 0 ) .
+
Definition 2.69 (Convex set) Let X be a set in a real vector space. X is co'ri'uex if QZ (1 - a ) y E X for all z, y E X and all a: E [0,1].
+
Definition 2.70 (Locally convex topological vector space) A topological vector space X is locally convex if every point has a local base consistirig of convex sets. X is locally com,pact if every point has a local base consisting of compact neighborhoods. These definitions can also be simplified tjo consider only a local base of 0. 2.6
Normed and inner product spaces
Metric spaces are already richer in stzructurethan topological spaces, still more structure can be introduced; see Fig. 2.1 and Fig. 2.2. Normed and irnier product spaces are special cases of metric vector spaces, where metric is dcfined either by a norm or an inner product. The algebraic and geometric structures of such spaces are richer than those of metric spaces only. Inner product spaces are important, since there exists a welldeveloped mathematical theory which places the pattern description and I3For any finite-dimensional normed vector space (to be defined in Sec. 2.6) or any t,opological vector spacc, such as a Euclidean space, the continuous dual and the algebraic dual coincide. L , - ( X ) is then a normed vector space, where the norm I l f I I of a continuous linear functional f on X is defined as IIfII = sup{lf(z)i:llxll 5 1).
Spaces
63
learning in their context. Details can be found in [Dunford and Schwarz, 1958; Garrett, 2003; Greub, 1975; Kreyszig, 1978; Kothe, 1969; Pryce, 1973; Sadovnichij, 19911. Definition 2.71 (Normed space) Let X be a vector space over the field r. A norm on X is a function I I . I I : X + eS: satisfying for all 2 , y E X and all a E F the following conditions: (1) Nonnegative definiteness: 11x1I 2 0. (2) Non-degeneration ((x(( = 0 iff J: is a zero vector. ( 3 ) Homogeneity: ( \ a x l (= ( a /( ( X I ( . (4) Triangle inequality: / ( z+ yl( _< (1z/( ( ( y ( ( .
+
A vector space with a norm ( X , I / . 11) is called a normed space. If only the axioms (1); (3) and (4) are satisfied, then ( 1 . ( 1 becomes seminorm and ( X , 1 I . 1 I) a seminormed space. Example 2.11 Examples of seminornied spaces: 1. (F([-l,1 1 , ) )(1). with ( I f 1 1 = ! f ( O ) ( is a seminormed space. p l . 2. (R", ~ ~ ~withp ~ ~ 2p 1,)where , llxilp = (C:Ll lxil ). is anormed space. 3. (Kim3 /I . llm), where l\xllm= maxi=1,..., lxil, is a normed space. 4. Let C(Q) be a set of continuous functions on a closed and bounded set, R c R". (C(Q), 11 . l i p ) , where l i f l i P = (J: l f ( x ) l p d z ) iand p 2 1, is a normed space.
Theorem 2.15 (Seminormed spaces are topological) A seminormed space is a topological vector space, where the open ball neighborhoods are defined asB,(z) = { ? / E X : ! ( x - y y J l < ~ }E,> O . Remark 2.8 I n a seminormed space with the topology induced b p a semin,orm, all neighborhood systems can be constructed by the translation of the neighborhood system for 0 , i.e. N ( x ) = N(0)+ x. This is also tr.ue wh,en the topology Is defined by a translution invuariant (serni)metric. Remark 2.9 A n,ormed space is a locally convex topological vector space, since a norm is a convex function, Example A . l . Therefore, the open ball neighborhoods BE(x) are convex sets. Lemma 2.5 (On seminormed spaces)
1. The (serna)norm is a continuous function, i.e.
The dissamalarity representation for pattern recognition
64
Figure 2.7
Example of a metric open ball in
R2,B l ( 0 )=
:
Jm+Jm
3. Not every metric space is a n,ormed space.
4.
Sketch of proof. Let X = R and d(x,y ) = I ( x # y). Suppose that d ( z , y) = I/x - yll i s true f o r some norm 1 1 . 11. T h e n f o r all a EIW and Z E R , llazll = IallIzII should hold. L e t z = 2-y, then llzll = 1. Consider cy = 2 . T h e n we have 2 = IcvII/zII= I/czI/ = 1, hence a contradiction. Conseguen,tly, thxre is n,o n,orm th,at generates this metric. If a metric distance d in a uector space X is translation invariant, i.e. d(n: z , y z ) = d ( z , y ) and d ( a x , a y ) = \ a \ d ( z , g )holds for all x , y , z c X and (WE&%,then ~~x~~ = d ( z ; O ) defines a norm.
+
+
Remark 2.10 W h e n metric distances are discussed in, vector spaces, th,eg are I L S I L ~ ~ ~ defined IJ by a 'rLorm. This m a y lead t o a false in,tuition that a n open ball BE(x) = {y E X d(x,y ) < E } defined by a metric d i s a convex set. Only if d(x,0 ) defines a norm, then, the metric space ( X ,d ) i s locally coniiex. Otherwise, it i s not true. An example is a vector space R2 with the metric d(x:y ) = An, open ball B l ( 0 ) is sho3wn in Fig. 2.6. It can ea,sily be checked that d is a metric (the triangle inequality holds thanks to the inequality Jusb L & & f o r all a , b 0 ) . Let E > 0. To see that any ball i s not convex, by Def. 2.69, it is suficient t o show that there exist y , z E B E ( x ) such that ( a y (1 - a).) $2 B E ( x ) for some a t [0,1]. Let x = ( ~ 1 . 2 2 )E R2.Define y = ( 5 1 & E ~ , x and ~ ) z = ( 2 1 , ~ &E~). T h e n d(x,y) = $ E < E and d ( x , z ) = $ E < E . Hence, y , z E B,(x). Bwt d(x;7jy 1 TZ) 1 = T3 E4 > & . So, ( a y (1 - Q)Z)$BE(X)f o r Q = +. Any metric space (R",dg),where dg(x,y) = C:, Izi - y#, p < 1, is not localky conuex. dg i s metric by Corollary 3.2.
Jm+
Jm.
+
>
+
+
+
+
+
Spaces
65
Definition 2.72 (Bounded operator) Let (X, 11 . 1 1 ~ ) and (Y.I ( . / l y ) be normed vector spaces. A linear operator A : X 4Y is bounded if there x all Z E X . A linear fiinctiorial exists aER+ such that llAzlly 5 a / ( x ( J for f : X + r is hounded if there exists a € R + such that If(s)l 5 (1 ) ~ I c for ))x all Z E X .
Iy)
Corollary 2.7 Let (X, I 1 . I 1 5 ) and (Y,1 1 . 1 be normed vector spaces. A linear m a p T : X + Y is bounded ifl it i s continuous. Definition 2.73 (Operator norm) Let (X, 11 . 115) arid (Y,/ I . I ?,) be normed vector spaces and A : X + Y a continuous linear map. Then a uniform norm of A is llAlj = sup^^^.^,r - llT~11~. Definition 2.74 (Continuous dual space of a normed space) Let (X, ( 1 11) be a normed vector space over the field I?. The continuoils dual space X' = C , ( X , r) consists of all continuous linear fiinctionals .f : X + r. X' is it,self a normed vector space with the uniform norm defined as l l f l l = suPII,((
r).
Definition 2.75 (Banach space) A normed s p x e for which the associated metric induced by the norm is complete, i.e. cvery Cauchy sequence converges in this space. is called a Ban,ach space. Example 2.12 Examples of Bariach spaces: 1. (R",I ( . 112) is a Banach space. 2. Let t?, p 2 I, be a vector space of real sequences II: =
(XI, 2 2 . .
(czl
. .)
such that Czl (zi(P
Example 2.13 (On Banach and continuous dual spaces) 1. Let (X, / I . 11) be a normed space. The continuous dual X ' = Cc(X,r) I ~ ( I c ) ~ for EX' and z E X is a Banach with the norm I l f l l = space.
T h e dissimilarity representatzon for pattern recognition
66
(Rn,. 1 I ,112) is a Banach space, sirice I I . I l2 induces the Euclidean distance. Its continuous dual space is also (Rm,/ I . 112). Hence the Euclidean space is self-dual. 3. Let ”, p 2 1, be a vector space of real sequences z = ( 2 1 . 2 2 , . . .) such t,hat Erll z i / P < c c with the norm given by lIzllP = (C,“=, Izilp)k. This norm induces the Minkowski metric d,. Consequently, and t:’,= (R”,d p ) are Banach spaces. The continuous dual of p > 1 is a space t? for q > 1 such that 1 + 1 = 1. Hence, ty is self-dual. P q 4. Let tz be a vector space of real sequences z = (zI,z2,. . .) with the norin given by ((z((, = supi lzi!.This norm induces the metric d,. Therefore. tz is a Banach space. Consequently, if finite sequences z = ( 2 1 , ~. ~ . . ., z m )are considered in R’”, with the norin 1 1 .,,I/ the space tz = (R””, d,) is Banach, as well. The continuous dual of lr is e;M. 2.
lr
lr,
Definition 2.76 (Inner product space) Let X be a vector space over @. An znner product (., .) is a bilinear function X X X + C satisfying the following axioms for all z, y, z E X and all a , PEC: (1) (2) (3) (4)
Nonnegative definiteness: (2, z) 2 0. Non-degeneration: (z, z) = 0 iff z is a zero vector. Hermitian symmetry: (z. y) = (y, z)+. Linearity in X and sesquilinearity over C: ( a x + P y , z ) = a t ( z ~ z ) f p(y,z) t and (z,ay+Pz) = a ( z . y ) + P ( z . z ) .
If X is a real vector space, then (., .) : X X X + R is a symmetric bilinear form. A vector space with an inner product ( X , (.)) is an inner product vector space. Lemma 2.6 (On inner products) 1. T h e i n n e r product in a n i n n e r product space i s a continuous function. 2. Eiiery irrner product space is a normed space with t h e norm defined as I l 5 j l = (z,z)k
3. E v e r y iririer product defines the associated m e t r i c d ( z ,y)
=
1 Ix - yl 1.
4 . Parallelogram law. ( ( z + U ( l 2 + j ( z - y [ ( ’= 2 ( ( 2 ( ( ’ + 2 ( ( y j ( ’holdsforthe ~ O T T T (I j
~ (= ( (z>~)i.
5. Polarization identity.
T h e real i n n e r product (., .) can be determined ‘from t h e corresponding n o r m as ( x , y ) = 2(11z+y1121 11~11’-11y11’). T h e complex i n n e r p r o d u c t
Spaces
67
can be determ*i.nedf r o m t h e cor,r-espon,ding norm as (x.y) 1 1 . ~- y/)I2- illz + i y ) I 2- ilia - i y l l ’ ) , where i 2 = -1.
=
( / ( l e t -IJ(
l2
+
Theorem 2.16 (Cauchy-Bunyakovski-Schwars inequality) Let (X, (., .)) be an inner product vector s p c e . The following inequality I ( 2 ,y) I 5 (x,z) 4 (y, y ) i holds for all 2 ,y E X . The equality holds if y = f o r some
~EC.
Definition 2.77 (Hilbert space, pre-Hilbert space) An inner product space for which the induced norm gives a complete metric space is a Hilbert space. A non-complete inner product space is a pr-e-Hdbert space. Example 2.14 (On Hilbert spaces) m
q y , is a Hilbert space. 1. (R7n,(., .)) with ( 2 ,y) = 2. ly is a Hilbert space with an inner product defined as (.~;,y) = C,“=, zi yi. The metric becomes d ( z . y ) = l ~ z ; - y ~=~ (xi- ? y j 2 ) 2 ) + . 3. The space L p defined on a set M(f2) of Lebesgue rneasurablc classes b of functions with (f.g) = ( J a f(x)y(x) p ( d x ) ) + is a Hilbert spacc. Note that L;, defined on a set of continuous functions, since riot complete, is only a pre-Hilbert space. 4. The space l; (and le,”) with p # 2 is riot an inner product space, hence not a Hilbert space.
(cEl
Sketch of proof. The proof i s based on the contradiction of the parallelogram law for z = (1,1,0, 0, . . .) and y = (-1, 1.0, 0, . . .). Then IIzllP= Ilyllp = 2; and 112 +yllp = 112 - yllp = 2 . So, 2 llzll;+ 2 1 1 ~ 1 ; = 4l+5 and /Iz + yilg + /Iz - yyl(p2 = 8, so the equality is satisfied only if p = 2. 0 5. The space L& on R = [u,b]with the norm l i f l l m = max,E[n,b][ f ( r )is l not an inner product space, hence not a Hilbert space.
Sketch of proof. The proof is ba,sed on the contradiction of the parallelogram law for the functions f(x) = a and g ( z ) = 2--a defined on [a,,b ] . Then = a , ~ ~ =g b -~a , ~Ilf+gllm m = b, and I l f - g i l m = a . Then 2 ilfll2,+2((gil& = 2 ~ ’ + 2 ( b - a ) ~and I l f + g l ( & + l ( f - g ( ( & = u 2 + b 2 , 0 so the equality is not satisfied for any a < b.
ilfiloo
Bounded linear functionals are defined in analogy to Def. 2.72. Contimiity of a fLinct,ionalis equivalent to its boundness, as stated in Corollary 2.7.
Definition 2.78 (Orthogonality, orthogonal complement) ( X , (., .)) be an inner product space.
Let
68
T h e dissimilarity representation for pattern recognition
(1) Vectors 1c and y are orthogonal in X , zly if ( x l y ) = 0. Hence, a zero vector is orthogonal to every vector. (2) A subspace V of X is orthogonal if all vectors of V are orthogonal. ( 3 ) Let X be a subspace of X . The set XI = {y E X : k f z E x ( y , ~ = ) 0} is the orthogonal complement of X .
Definition 2.79 (Orthonormal basis) Let 'Ft be a Hilbert space. The set {e,} of elements in 7-l is an orthonormal basis if ( e z , e j ) = bij for all . . z , . ~ and is the Kroiiecker delta, bij = Z(i = j ) , arid every II: € 3-1 can 00 be uniquely written as x = a i e i , which is equivalent t o stating that 00 x = limN-,m aieLfor some a , € @ and . IaiI2.
EL,
be a n orthonorTheorem 2.17 (Orthogonal expansions) Let {ei}zl ma1 basis in a Hilbert space IFI. The following dependencies hold ,for all x,y€R:
C,"=, 1 (x,e i ) j i lIx/12. This inequality holds also .for a pre-Hzlbert space. 2. Purseval formflula: /1x/I2= C,"=, I(x,ei)l. 3. Planch,erel inequality: (2, y) = ( x , e i ) ( e i , y). I . Bessel in,equality:
c,"=,
Definition 2.80 (Adjoint of a continuous linear map) Let ( X , (., .)x)arid (Y,(.; . ) y ) be pre-Hilbert spaces. Let AEC,(X, Y )be a continuous linear map. An adjoint A* € L,(Y*, X * ) , if exists, is a continuous linear map such that (Ax, y ) y = (2, A * y ) x . Theorem 2.18 If ( X , (., .)J is a Hilbert space and (Y,(., . ) u ) is a preHilbert space, then, A E C,(X. Y ) has a unique adjoint A*. Definition 2.81 (Self-adjoint, unitary operator) Let (El(., .)) be a Hilbert, space. A E C,(IFI,IFI) is self-adjoint or Hermitian if A* = A, ( A x ,y) = ( x , Ay) for all 2 , ~ E I F I . A is unitary if AA* = A*A = 1. Theorem 2.19 (Projection theorem) Let V be a closed subspace o,f x E 'Ft, there exist unique x?,E V and x l E V' such t h d z = 2 , : + 21, Define x, = P x, where P is the orihogonal projection, of x on,to V . P has the followin,g properties:
IFI. T h e n f o r every
1. P 2 = P (idempotent). 2. ( P x ,y ) = (x.P y ) (self-adjoint).
3. ( P z ,( I - P ) x) = 0. :I; = Px + ( I - P ) x and PI( I - P).
4.
Only the ,first two conditions are required for P to be a projection.
Spaces
2.6.1
89
Reproducing kernel Halbert spaces
Reproducing kernels are used in a variety of applications like function estimation, function approximation or model building. They uniqiicly define so-called reproducing kernel Hilbert spaces (RKHS). which are spaces of bounded linear functionals, see Def. 2.72 and Def. 2.74. Reproducing kernels are used in statistical learning theory [Vapnik, 19981 for the coristruction of support vector machines; see also Chapter 4. Here. we will provide basic definitions and facts. More details can be found in [Berg et al., 1984; Dunford and Schwarz, 1958; Schaback, 1999, 2000; Schaback and Wendland, 2001; Wahha, 19991.
Definition 2.82 (Positive definite function or kernel) [Berg ef al., 1984; Wahba, 19991 Let X be a set. A Hermitian function K : X x X + C is positive definite (pd) iff for all n c N , { x z , C X and {c2.}= :, C C,one has Cc,=,c,cb K ( J , ,x J ) > 0, where t denotes complex conjugation. Such a function is called a k n e l l 4 . Additionally, K is conditionally positive definite (cpd) iff thc above condition is satisfied only for { c ~ } , "such , ~ that n c3 = 0. Depending on the sign of c,ci K ( z , , z,), also (conditionally) negative, nonnegative and nonpositive functions can be defined.
)rxI
c:,=,
Note that if X is an n-element finite set, such as X = ( p 1 ,pz,. . . ,p n ) , then K is pd iff the nxn, matrix K ( X , X ) is pd. Moreover, if K is pd, then K ( p , , p , ) 2 0 for all p , E X .
Theorem 2.20 (Riesz representation theorem) [Rudin, 1986; Debnath and Makusinski, 19901 Let X be a pre-Hilbert space over the field I?. For every continuous linear functional $(z) : X + r (for a fixed z), there exists a uniqwe y in the com,plet%onX- of X s*uch that $(z)(g) = ( 2 ,y) f o r all z E X . be a Hilbert space over the field I?. For every continuou,s linear Let fun,ctional + ( x ) : IFI 4 r (for a fixed x), there exists a unique y E 'R such that $(z)(y) = (z.g) for all X E X . Definition 2.83 (Reproducing kernel Hilbert space) Let X be a set and CX denote a space of functions f : X 4 C. Let 'RK c C X be a Hilbert space of bounded (hence continuous) linear functionals. A 14Kernel K originates from the study of integral operators, where (LKf ) ( z ) = K ( z , y ) f ( y ) d g . K is called a kernel of the operator LK.
70
The dissimilarity representation for p a t t e r n recognition
Hermit,ian function K : X X X i
C is a reproducing kernel for ‘HK if
(1) K ( ~ , . ) E N for K all Z E X and ( 2 ) K ( x , . ) is the representer of evaluation at z in ‘ H K , that is f ( z ) = (f,K(z,.))xK for all f E N K and all (fixed) Z G X .
NK equipped with K is called the reproducing kernel Hilbert space (RKHS). Example 2.15 1. Every finite-dimensional Hilbert space is a RKHS for some kernel K . 2. The space Lp defined on a set M ( R ) of Lebesgiie measurable classes of functions with ( f ,9 ) = f ( z ) g(z) d z ) is a Hilbert space, but not a RKHS. The reason is that the elements of L F are defined over equivalence classes of functions they agree almost everywhere and not the individual functions, herice the evaluation is not defined. Although the Dirac delta functionl5 6(x) is the representer of evaluation in as f ( z ) = S f ( t ) S ( z - t)& but 6 @ L p . The reason is that S should be in the equivalence class of functions h(z),which take zero for all :c # 0 arid some non-zero value for x = 0. However, S f ( z ) d x = 0, but f S(x)dx = 1, hence contradiction.
Ly
The reproducing kernel map is realized by a linear map li/: z 4 K ( z ,.) such that $(y) = K ( z ,y). Since K ( y , .) is the representer of evaluation a t y, then $(y) = (+, K ( y ,. ) ) x K = ( K ( z ,.), K(y, . ) ) x KAs . a result, one gets K ( z ,y ) = ( K ( z ,.): K ( y , . ) ) x K This . means that a pd kernel K can be seen as a Gram opcrator in ‘ H K , i.e. there exists a function li/ in a Hilbert space 3 - t ~such that the evaluation of the kernel at x and y is equivalent to taking the inner product between $(z) and +(y). If X is a set of a finite cardinality, say n, then the functions are cvaluated only at a finite number of points. Consequently, the RKHS becomes an n,-dimensional space, where the linear functions become n-dimensional vectors. As a result, the reproducing kernel K simplifies to an n x n Hermitian (or symmetric) pd matrix. Corollary 2.8 Let XK = .C,(X, I?) be a Hilbert space of bounded functionals defined over the donialn X . If the evaluation functional S,, 6,[ f ] = f ( x ) is defined and continuous for every x E X and f E ‘ H K , then NK is a RKHS. Hence, there exists K ( z ,.) ~ 7 - such l ~ that S,[f] = f ( z ) = ( K ( z ,.), f ( . ) ) x K . ‘“The Dirac delta function 6 is defined as 6(z) = 0 for z # 0 and J’ 6 ( z ) d z = 1. For any continuous function f one has the following property .I’6(z - t ) f ( t ) d t = f ( z ) .
Spaces
71
Theorem 2.21 (Mercer theorem) Let ' H K be a Hilbert space of functions f : X 4 C and let K : X X X + G be a Hermitian kerne1l6. If ( K ( x ,.), K ( x ,.))xK 5 03, then K can be expanded b y a countable sequence of orthonormal eigenfunctions $i and real positive eigenvalues X i such that the bilinear series K ( z ,y) = &$i(x)& (y)t converges uniformly and ab~olutely'~.
c,e"_,
The theorem above means that the eigenfunctions and eigenvalues are found as a solution to the eigen-equation ( K ( z .),+i(.))x, , = Xi&(z) or, in the integral form, JX K ( x ,y)$%(y)dy = Xi&(z), if K corresponds to an inner product defined by the integral. In practice this requires that X is a compact subset of R" or an index set. As the eigenfunctions {&}Elare linearly independent functions (an orthonormal basis of XK);then any function f in the space 'HK can be written as f ( z ) = C,"=, a i & ( x ) . The inner product between f and g in the Hilbert space 'HK is defined as (f(x),g(x)).~~~ = * ( a , i h j ) , where g(z) = C,"=1bZ$2(z). Such a space of functions with the kernel K is indeed a RKHS, since ( f ,K ( x , = ( . f ( y ) ,K ( z ,Y ) ) R ~= ( f k )K(Y, , z ) ' ) x K= C,"=, ((&&(x))t)t = C,"=, ~ g ) ~ (=z )f ( z ) , because K is Hermitian,
cEl
2
.)hK
i.e. K(z, y) = K(y,z)?. Note that = ( f ( x )f,( z ) ) & , = C"z = 1 (u?(2 A, and IIKil&, = ( K ( z ,.), K ( z ,.))&, = Xi. There is an equivalence between choosing a specific ' H K , reproducing kernel K and defining the set of X i and $ i .
, : c
Theorem 2.22 (Moore-Aronszajn theorem) [Wahba, 19991 For every p d kernel K on X x X (X is a compact set), there exists a unique RKHS ? f ~over X for which K is the reproducing kernel and vice versa.
2.7
Indefinite inner product spaces
Indefinite inner product is a generalization of a (positive definite) inner product (., .), Def. 2.76, by requiring that only the (Hermitian) symmetry and (sesqui)linearity conditions hold. The facts presented here are based on the books [Alpay et al., 1997; BognBr, 1974; Iohvidov et al., 19821 and 161n the integral form the positive-definiteness means that (Kf, f ) ~ , = Jxxx wx>Y)f(x)f(Y)+dxdY2 0. I7Let {u,} be a set of functions X + C. A series C & u,(z), converges uniformly to u ( z ) iff for every .E > 0, there exists a natural number N , such that for all z t X arid all n 2 N , Iun(z)- u(x)l < E . For a fixed z,a series C ,u z ( z )converges absolutely if the series C 1 u,(z) I converges.
72
T h e dissimilarity representation for p a t t e r n recognition
the following articles [Constantinescu and Gheondea, 2001; Dritschel and Rovnyak, 1996; Rovnyak, 1999; Goldfarb, 1984, 19851.
Definition 2.84 (Indefinite inner product space) Let V be a vector space over C. An indefinite i n n e r product (., .)v is a map V X V ---f CC such that for all II:, y , z EV and a , P E C , one has: (1) Hermitian symmetry: (z,y ) = ~ (y, x)t. (2) Linearity in X and sesquilinearity over C:( a2
P+(Y,.)V
and
( 5 ,a
y
+PZ)V =
(II:, Y)v
+ /3g, z
) =~at
(x,Z ) V
+ P (z, " ) V .
If V is a real vector space, then (., .)v : V x V form; see Def. 2.61.
4
R
+
a symmetric bilinear
Since (x,I C ) V can have any sign, there is a distinction among positive, negative and neutral vectors and the corresponding subspaces. For the material presented below, V is assumed to be an indefinite inner product space equipped with the inner product (., .)v. We will write (., .) only if the traditional positive definite inner product Def. 2.76 is meant.
Definition 2.85 (Positive, negative and neutral vectors) A vector 5 E V is positive if (IC, z)V > 0 , negative if ( 2 ,z ) <~0 or neutral if (IC, x)v = 0. A subspace X c V is called positive, negative or neutral if all its elements are so, respectively. Every indefinite inner product space contains at least one non-zero neutral vector [Bognk, 19741
Definition 2.86 (Orthogonality, orthogonal complement) ( V , (., .)) be an indefinite inner product space.
Let
(1) Vectors II: and y are orthogonal in V if (z, y )= ~ 0. (2) A subspace X of V is orthogonal if all vectors of X are orthogonal. (3) Lct X be a subspace of V . The set ' X = { y E V : VZEx( y , z ) v = O} is the orthogonal complement of X .
Definition 2.87 (Isotropic subspace, degenerate subspace) ( V ,(.; .)) be an indefinite inner product space.
Let
(1) A vector %r E V is isotropic if it is a non-zero vector orthogonal to every vector in V . ( 2 ) Let, 0 be the zero vector. Let X C V . The isotropic subspace Xo of X consists of isotropic vectors, i.e. Xo = X n XI. If Xo # 0, then X is degenerate and (., .)v is degenerate on X. The entire space V is degenerat,e if V' # 0.
Spaces
73
Example 2.16 Inner product spaces: Let V be a vector space of pairs of real numbers. Let (x,y ) = ~ 5 1 y1 2 2 y 2 for 2 = (zl,z2) and y = ( y l , y z ) . Then ( V ,(.,.),,,) is indefinite. Note also that if X = { (21,za) E V : z1 + x2 = 0}, then XI = X. Hence X is a degenerate subspace of V . Let V be a vector space of number sequences (ul, v 2 , .. .) satisfying 1 ~ ilvi12 l < 00. Then (x,y ) =~ E~ xi y j defines an inner product. Depending on the signs of ~ i ( x, , y ) V may be positive, negative or indefinite. If ~i are of different signs, then (z, y ) is~ indefinite. Moreover, if there exists at least one zero E ~ then , (x,y ) is~ degenerate. Let L( [a,b ] ) be a vector space of real valued functions that are measurable and square-summable with respect to some function p . Depending b on the function p , ( f , g ) = f ( x ) g ( z ) d p ( x ) defines an indefinite or definite inner product; see also [Halnios, 19743. ~
Cpl
Cpl
s,
Definition 2.88 (Fundamental decomposition) Let (V, (., .)v)be an indefinite inner product space. If V is represented as a direct orthogonal decomposition'* V = V+ @ V- @ Vo such that V+, V - and Vo are positive, negative and neutral subspaces, respectively, then such a decomposition is called a fundamental decomposition and V is decomposable. Not every space V admits a fundamental decomposition, but every finitedimensional inner product space does [Bognk, 19741. Spaces which yield a fundamental decomposition are called Krein spaces and are of our interest. Pseudo-Euclidean spaces are the simplest examples of these. See also Fig. 2.2.
Definition 2.89 (Pseudo-Euclidean space) A pseudo-Euclidean space & = IW(P>q)is a real vector space equipped with a non-degenerate, indefinite inner product (., . ) E [Greub, 19751. & admits a direct orthogonal decornposition & = E+ @ E - , where &+ = IWP and E- = IW4 and the inner product is positive definite on €+ and negative definite on E-. The space & is, therefore, characterized by the signature ( p ,q ) [Goldfarb, 19841. '*A direct sum V = X @ Y @ Z means that every v t V can be uniquely decomposed into z E X , y E Y and t E 2 such that v = z y z and X n Y = {0}, Y n 2 = ( 0 ) and X n 2 = (0). An orthogonal sum of X , Y and 2 is their direct sum such that they are pairwisc orthogonal. Here, a direct orthogonal decomposition V = V+ V - @ Vo means that V - = V$ and Vo = (V+ n V:)', i.e. Vo = V+ n V $ consists of neutral vectors orthogonal t o all other vectors in V .
+ +
The dissimilarity representation for pattern recognition
74
Definition 2.90 (Orthonormal basis) Let € = Iw(P,Q)be a pseudoEuclidean space. An orthonormal basis {el,e2, . . . , eP+q}in € is defined as
i
1, f o r i = j = 1 , 2 , . . . ,p , ( e i , e , ) = -1, f o r i = j = p + l , . . . l p + q , 0; for i # .I.
The inner product between two vectors x and y in R(P.Q)can be expressed by the standard inner product (., .) in a Euclidean space.
Lemma 2.7 (Pseudo-Euclidean inner product via the standard inner product) L e t € = R(Piq) be a pseudo-Euclidean space. T h e n (., .)& can be expressed by t h e traditional (., .) in n Euclidean space RP+q as P t9
where
and I,,,
and I,,, are t h e adeiitzty matrrces.
If x+ and x- stand for the orthogonal projections of x onto RP arid Rq, respectively, then (xl y ) = ~ ( x t ly + ) - (x-,y-).The indefinite Liior~n' of a non-zero vector x becomes llxll; = ( X , X ) E = xTJpqx,which can have any sign. Based on the inner product, the pseudo-Euclidean distance is defined analogous to tlie Euclidean case. Definition 2.91 (Pseudo-Euclidean square distance) Let € be a pseudo-Euclidean space. Then
d2X.Y)
=
llx - YII; = (x - Y1 x - Y) & = (x - Y)
T
Jpq
(x - Y ) ,
= Iw(P,Q)
(2.2)
is a yseiido-Euclidean square distance. It can be positive. negative or zero.
am,
The distance d is either real or in the form of where i2 = -1. Note that the square distance between distinct vectors x and y may equal zero. Note that an orthonormal basis of the pseudo-Euclidean space is chosen as it is convenient for the representation. The rcason is that Jpq has a siniplr form arid it is both symmetric and orthogonal in tlie Euclidean space RP+q and in the pseudo-Euclidean space PQ(p,q) (this will be explained
Spaces
75
Figure 2.8 Left: a pseudo-Euclidean space & = R ( l > l ) = R1 x iR1 with d2(x,y) = (~-y)~J11(x-y).Orthogonal vectors are mirrored versus the lines 5 2 = 2 1 or xz = -1, for instance ( O A , O C ) E= 0. Vector v defines the plane 0 = ( v , x )=~ vTJ1lx. Note that the vector w = J ~ I va,‘flipped’ version of v, describes the plane as if in a Euclidean space 8’. Therefore, in any pseudo-Euclidean space, the inner product can be interpreted as a Euclidean operation, where one vector is ‘flipped’ by Jpg. The square distances can have any sign, e.g. d 2 ( A , C )= 0, d 2 ( A , B )= 1, d 2 ( B , C )= -1, d 2 ( D , A ) = -8, d 2 ( F ,E ) = -24 and d 2 ( E ,D ) = 32. Right: A pseudo-sphere 11x11; = x: - x; = 0. From the Euclidean point of view, this is an open set between two conjugated hyperbolas. Consequently, the rotation of a point is carried out along them.
later on). One may, however, consider another basis. Let V = IWTL be a vector space and let { v ~ } ? =be~ any basis. Consider two vectors of V ; x = C:k,zivi and y = Cr=lyivi, as expressed with respect the basis vectors. Let 4 : V X V + IR be a symmetric bilinear form in V. Then 4(x,y ) = C:=‘=, Cy=lziyi 4(vi,vj) = xTMy, where A,f = M ( 4 ) such that Mij = 4(vi, vj) for all i ,j = I, . . . , n is a matrix of the form 4 with respect to the basis { V ~ } F = ~ .Assume that 4 is non-degenerate, which means that the rank of M is n. If M is positivc (negative) definite, i.e. if 4(x,x) > 0 ( d ( x , x )< 0) for all x E V, then qh (4) defines a traditional inner product in V. If A f is indefinite, i.e. $ ( x , x ) is either positive or negative for x E V, then 4 defines an indefinite inner product in V. We will denote it as (x,y)bf = xTMy. If M is chosen to be J P q , then { V ~ } Y = ~ is an orthonormal basis in R(”q), p g = n. This means that any symmetric non-degeneratc bilinear form 4 defines a specific pseudo-Euclidean space. Any other such form $ will define either the same or different pseudo-Euclidean space, depending on the signature, i.e. the number of positive and negative eigenvalues of M ( $ ) . If the signatures of M(qh) and AT($) are identical, then the same pseudo-Euclidean space is obtained.
+
76
T h e dissimilarity representation f o r p a t t e r n recognition
Note that if the basis of R" is changed, then the matrix of the bilinear form changes as well. If T is a transformation matrix of the basis { V ~ } F = ~ to the basis { W ~ } F = ~ ,then M"(4) = TTM"(q5)T is the matrix of q5 with respect to the new basis. This directly follows by substituting x by (Tx) and Y by (TY) in ( X , Y ) M . By introducing algebraical structures to a vector space V = R",specific vector spaces are obtained, depending on a form of a bilinear map or of a metric. One may introduce both an inner product (., .) and an indefinite inner product (., .)E to the same vector space. Such inner products are naturally associated with the (indefinite) norm and the (indefinite) distance. Additional metrics or norms can also be introduced. In this way, a vector space may be explored more fully by equipping it with various structures. A pseudo-Euclidean space R(P>q)can also be represented as a Cartesian product IWP x i Rq. It is, thereby, a ( p q)-dimensional real subspace of the ( p y)-dimensional complex space CP+q, obtained by taking the real parts of the first p coordinates and the imaginary parts of the remaining q coordinates. This justifies Eqs. (2.1) and (2.2), and allows one to express the square distance as d2(x,y ) = d i p (x,y) - d& (x,y ) , where the distances on the right side are square Euclidean. A Euclidean space is a special case of the pseudo-Euclidean space as RP = IR(P3').
+
+
Definition 2.92 (Isometry between pseudo-Euclidean spaces) Let ( X ,(., and (Y,(., . ) E z ) be pseudo-Euclidean spaces. A mapping q5 : X + Y is an isometry if (4(z), 4(y))e, = (x,y ) ~ ~ . The notions of symmetric and orthogonal matrices should be now properly redefined. Sirice the matrix JPq plays a key role in the definitions below, we will denote them as 3-symmetric and 3-orthogonal matrices to make a distinction between matrices in indefinite and traditional inner product spaces.
Definition 2.93 (3-symmetric, 3-orthogonal matrices) an n x n matrix in IW(P>'J), n = p y. Then
+
1. A is J-sym,metric or 3-self-udjoint if 2. A is J-orthogon,al if J&AT&, A = I .
JPq
Let A be
ATJpq= A.
A 3-symmetric or 3-orthogonal matrix in a pseudo-Euclidean sense is neither symmetric nor orthogonal in the Euclidean sense. If, however, pQ(P>q) coincides with a Euclidean space, i.e. q = 0, then the above definitions simplify to the traditional ones, as Jpq becomes the identity operator I . For instance, by straightforward operations one can check that the matrix
Spaces
[
I : ]
is 3-symmetric in
IR(l.l)
with
77
3
=
[
and that
5 [ 2 1;l i s
3-orthogonal in IR(lil). If we denote A* = JP4 ATJp4,then the conditions above can be reformulated as A* = A for a J-symmetric matrix A and as A*A = I for a J-orthogonal matrix A . This already suggests that A* plays a special role of the adjoint operator, which will be discussed below. An extension of a pseudo-Euclidean space leads to a K r e h space, which is a generalization of a Hilbert space as a pseudo-Euclidean space is a generalization of a Euclidean space.
Definition 2.94 (Kreh and Pontryagin spaces) a vector space K over C such that
A K r e k space is
(1) There exists a Hermitian form, an indefinite inner product (.; . ) X on K , such that the following holds for all z, y; Z E K and Q, [ ~ E C :
(a) Hermitian symmetry: (z, y ) =~ (y, x ) i , (b) Linearity over IC and sesquilinearity over a!i (z, .)X
+ P'
@I:
(ax+ /j y 3z ) =~
(y, Z ) X .
(2) Ic admits a direct orthogonal decomposition K: = K+ @ K - such that ( K + , (., .)) and ( L-(.,.)) , are Hilbert spaces'' and ( Z + , L ) ~ = 0 for any z+ E IC+ and 2- E Ic-. The space IC- is also called an antispace with respect to (., .). If K is a vector space over also Def. 2.61.
R,then (., -)K is a
symmetric bilinear form; see
It follows that K admits a fundamental decomposition with a positive subspace Ic+ and a negative subspace K - . Therefore, Ic+ = ( K - ) l . Let dimIc+ = K,+ and dimIC- = 6- be the ranks of positivity and negativity, respectively. Krein spaces with a finite rank of negativity are called Pontrgagin spaces (in other sources, e.g. [Bognar, 19741, the rank of positivity is assumed to be finite). A Pontryagin space with a finite K - is denoted by II,. Note that if (.; .)Ic is positive definite or zero for zero vectors only. then K is a Hilbert space. Example 2.17 (Pseudo-Euclidean, Kre'in and Pontryagin spaces) Let V be a vector space of real sequences ( u l , v 2 , . . .) satisfying C,"=,I ~ i jv,l j 2 < 00. Then ( 2 ,y ) = ~ Czl ~i z i yi defines an inner product. If ~1 = 1 and ~j = -1 for all j> 1, then the inner product is given ans 03 ( xy)v ~ = z1y1 zi yyi and V becomes a Pontryagin space. If ~ 2 >j 0
xi=*
IgAll Hilbert spaces discussed here are assumed to be separable, i.e. they admit countable bases.
The dissimilarity representation for pattern recognitaon
78
and & z J - 1 < O for all j,then V equipped with (z,.y)vdefines a Kre'in space. If V is a vector space of finite sequences (v1, vz, . . . , v,) and all E~ # 0, then V with (2, y)v = ~i zi yi is a pseudo-Euclidean space.
c;:,
Definition 2.95 (Fundamental projections and fundamental symmetry) Let Ic = Ic+ @ Ic-. The orthogonal projections P+ and P- onto Ic+ and Ic- , respectively, are called f u n d a m e n t a l projections. Therefore, any x E K can be represented as x = P+ x + -'J x where I K = P+ P- is the identity operator in K . The linear operator 3 = P+ - P- is called t>he fundam,ental s y m m e t r y .
+
Corollary 2.9 (Indefinite inner product by the traditional one)
(x,w)lC
=
(x,3 , ) .
In Hilbert spaces, the cla.sses of symmetric, self-adjoint, isometric and unitary operators are well known [Dunford and Schwarz, 19581. Linear operators. carrying the same names can also be defined in Krein spaces. The dcfinitions are analogous and many results from Hilbert spaces can be generalized to KreYn spaces. However, due to indefiniteness of the inner product, the classes of special properties with respect to the inner product are larger. We will only present the most important results; see [Bognjr, 1974; Iohvidov e t al., 1982; Pyatkov, 2002; Goldfarb, 1984. 19851 for details. Definition 2.96 (H-scalar product, H-norm) Let z, y t IC. The H scalar product is defined as [x,y] = (2, J ~ ) Kand the H - n o r m is 11x1i~ = [x;:x] 4.
+
Let x E Ic be represented as x = x+ x-, where x+ E Ic+ and x- E Ic-. Since [x,y] = (2, Jy)x, we can write [z, y] = (z+, y + ) K - (z-, y - ) ~= (:x+,yj+) - (-(z-,y-)) = ( z , ~ ) This . means that [x,g]is equivalent to the traditional (Hilbert) inner product and Ic+ and Ic- are orthogonal with rcspect to [z, y]. Moreover, the associated Hilbert space IFI is then such that IFI = IIcI = Ic+ @ IIc-1, where 1Ic-I stands for (K-,(., .)). Formally, there is a close 'bound' between a Krein space arid its associated Hilbert space: Lemma 2.8 A decomposable, non-degenerate i n n e r product space Ic i s u K r e k space ifl f o r every f u n d a m e n t a l s y m m e t r y J ,t h e H-scalar product t u r n s it into a Hilbert space [Bognur, 19741. H-scalar product is a Hilbert inner product, therefore Ic can be regarded as a complete Hilbert space (Banach space) with the H-scalar product (Hnorm). As a result, the (strong) topology of K: is the norm topology of the associated Banach space, i.e. the H-norm topology. This topology is
79
Spaces
simply defined by the nornis in the associated Hilbert space 1KI. does not depend on the choice of fundamental symmetry2'. Therefore. continuity, convergence and other notions can be defined for K with respect to the H-norm.
Definition 2.97 (Convergence, Cauchy sequence) (1) The sequence x, in K: converges to x E K with respect to thc H-norm iff lim7L-m(z7h,y)K= (x,y)~ for all y E K: and limn400(xfl,,.x,)K: = (x,Z ) K . (2) The sequence x ,in K is Cauchy with respect to the H-norm iff (xT1x,,xn - x,)~ 40 and ( z T L , y form ) ~ a Cauchy sequence for y € K . Corollary 2.10 Since (x,y ) =~ [x+,y+] - [ L , g-1, then (z, y ) is~ continUOTLS with respect t o the H-norm in both x aad y. Theorem 2.23 (Schwarz inequality) /1x11~ l l v l l ~holds for all r c , y E K . Proof.
(1 .+1 2
2
The inequality l(z, y ) ~ l5
l [ ~ + ? ~ + 1 F [ ~ - ~ Y2 -5l l(1 .+1 I / ~ + I / + I I ~ ~ - ll lIY -l l )2 + 11~-/12)(llY+l/2 + IlY-112) = I l ~ l l ~ l l Y l l ~ . I(.,V)Kcl
5
Definition 2.98 (Orthonormal basis in a KreYn space) Krein space. If K+ and K - are separable Hilbert spaces, then a countable orthonormal basis {ei}g1 in K such that any z uniquely written as IC = a i e i for some E r and means that
czl
(ei,e,j)lc =
{
I
Let K be a there exists E K can be lat12. This
c,"=,
1, if i = j and P+ei is an orthonormal vcctor in K+, -1, if i = j and P-e, is an orthonormal vector in IC-. 0, otherwise.
Theorem 2.24 (Orthogonal expansions) I f K + and Ic- ure sepurable Hilbert spaces, then there exists a countable orthonmrmal basis in K . For x,y E IC, one has [Boyna'r, 19741:
{e,}zl
(1)
CF1 Ib,+ I 2
< 0.
(21 -Cz(e,.e,)KC=-l I(X, 4 K 1 2 5 (3)
(Z.?/)K: 5
c : ,
(Z.Z)K
5
Ci(e,,e,)K=l I(x7 +I3.
(ez, e i ) K ( x ,e i ) K ( e t ,Y ) K .
201na Krein space, there are infinitely many fundamental decompositions, hence fundamental symmetries and, consequently, infinitely many associated Hilbert spaces. However, the decompositions yield the same ranks of positivity and negativity, the same H-norm topologies; simply, they are isomorphic.
80
T h e dissimilarity representation f o r p a t t e r n recognition
Definition 2.99 (Adjoint operator) Let C,(IC, G ) be a space of continuous linear operators from the Krein space K onto the Krein space G. If G is K:, then C,(K) will be used. Note that C,(K:) is a dual space of K .
1. A* E C,(G,K) is a unique 3 - a d j o i n t of A E C,-(K,G) if ( A x , y ) g = ( x , A * y ) ~for : all z t K and all ~ E G . 2. A E &(K) is 3-self-adjoint ( 3 - s y m m e t r i c ) if A* = A , ( A z , y ) ~= (x,Ay)x for all rc, y E K . Definition 2.100 (Isometric and unitary operators) [Alpay et al., 1997; BognBr, 19741 Let A E L,(K,G) be a continuous linear operator K: + 4. A is 3 - i s o m e t r i c if A*A = I , and 3-coisometric if AA* = I,. A E L c ( K ) is 3 - u n i t a r y if (Arc, Ay), = (rc, y ) for ~ all rc, y E K , or in other words. if it is both 3-isometric and J-coisometric. Remark 2.12 Th,e fundamental s y m m e t r y 3 fulfills 3 Hence, 3 is J-symm,etric and J-unitary.
=
J* =
3-l.
Theorem 2.25 (Factorization) [Bogncir, 19'74l Every 3 - s y m m e t r i c operator A E C,(K:) can be expressed as A = T T * , where T E C,(V, K ) f o r some K r e k space V and ker(T) = 0 . Since a Krein space is inherently connected t o its associated Hilbert space, both the 3-adjoint and 3-unitary operators can be expressed through operators in this Hilbert space. Hence, the condition ( A x ,y)g = (z, A * ~ ) Kis: equivalent to stating that ( A x ,J g ) = (z,JA*y). This is further equivalent to ( J A z ,g) = (z, JA*g), since J is self-adjoint (symmetric) with respect t o (., .) in the associated Hilbert space \ K \ . This means that in IKl, the adjoint of ( J A ) is (JA*). Let A X be a Hilbert adjoint of A. (This means that A X = AT or A X = At, depending whether the Hilbert space is over W or C.)Then (JA)X = A X J = JA* and finally
A* = J A X J . For a 3-unitary operator in a Krein space K , we have (Arc,Ay)~:= (rc, y ) ~which , is equivalent to stating that (Ax,JAy) = (x,Jy) in the associated Hilbert space. Since J is self-adjoint in 1x1, then ( ( J A ) z ,(JA)y) = ( x ?y). So, ( J A ) is a unitary operator in IKI, which means that ( J A ) " = ( J A ) - l . Then Apl = J A X J ,which is equivalent to A-l = A*. Formally, we have: Theorem 2.26 Let A E & ( K , G ) , then A E Cc,(lKl,IGl) f o r the associated If A X is a Halbert adjoint of A, t h e n A* = Hilbert spaces 1x1 and IS/. Jx A XJ G , where JK:an,d JG are the fundamental symmetries. Moreover,
IIA*IIH = / I A X / I H= IlAIIH.
Spaces
81
Definition 2.101 (Krein regular subspace) Let K be a Krein space. A Krein regular subspace of K is a subspace X which is a Krein space in the inner product of K , i.e. (x,y ) x = ( x ,y ) for ~ all 2 , Y E X . Definition 2.102 (Positive, uniformly positive subspaces) A closed or non-closed subspace V E K is positive if ( x ,x ) >~0 for all x E V and V is ,unifownly positzue if it is positive and (x,Z)K >a(/xil&for a positive cv depending on X and the associated H-norm. Similar definitions can be made for negative, uniformly negative, nonnegative etc. subspaces. The term maximal, if added, stands for a subspace which is not properly contained in another subspace with the same property. Every maximal positive (negative) subspace of a Krein space is closed. If K = K+ 6? K - is the fundamental decomposition, then the subspaces K+ and K - are maximal uniformly positive or negative, respectively. Any maximal unifornily positive or negative subspace arises in this way [Bognk, 19741.
Definition 2.103 (Positive definite operator) A J-self-adjoint operator A E C , ( K ) is positive definite ( 3 - p d ) in a Kreiri space if (x,A ~ )>K0 for all x E K . The negative definiteness ( 3 - n d ) or semi-definiteness is defined accordingly. The above condition is equivalent to 0 < (2, AZ)K = (x,JAz). This means that A is J-pd if ( J A ) is pd in the associated Hilbert space 1 K 1. For instance, the fundamental symmetry J is J-pd, since it is J-symmetric and JJ = I.
Theorem 2.27 (Projection theorem) Let V be a closed, non-degenerate subspace of a K r e k space K . Th,en f o r every z E K , there exist unique zvE V and x 1 E V' such that z = x, 21, where 2 , = Px and P is the orthogonal projection of x onto V [Bogncir, 1974; Iohvadov et al., 19821. P h,us th,e following properties:
+
1. 2.
P2
=P.
(Px, y ) ~= : (x,Py)x. (J-self-adjoint)
c?,
( P z ,( I K
4.
z = Pz
~
P)Z ) K
+ (I;c
~
= 0.
P ) z and PI( I x - P ) .
Only the first two conditions are required f o r P t o be u projection. Definition 2.104 (Gram and cross-Gram operators) Let V be a linear subspace of K: spanned by linearly independent vectors { u l , v 2 , . The Gram operator, or the inner product operator, is defined as G,,,, =
82
T h e dassimilarity representataon for pat t er n recognition
( ( ~ ~ , v , ~ ) x ),..., i ,., j = 1 Assume further that a subspace ZA C IC, spanned by ( 7 ~ 1 , I L Z , . . . , u t } . is given. Then G,[,, = ( ( u i , u j ) ~ ) i , l , . _t :_j =_l ,..., is the cross- G r a m operator.
Theorem 2.28 (Projection onto a subspace) Let V be a linear subspace of a Krein space IC spanned by the vectors (w1,112, . . . , un}. Hence, V = [ ~ q + * u 2. ., .,ti,] is the basis of V . If the Gmrn operator G,, = ( ( v i , v j ) )i,,i=l....,n ~ is nonsingular, then the orthogonal projection of x E I C on,to V is unique and given by 5,
= V G:,
g,,
(2.3)
where g , is a n n x 1 vector of the elements (x,w i ) ~ ,i = 1 , 2 , . . . , ri,. If the G r a m operator G,, is singular, then either the projection does not exist or x ,= Vz, w h ~ r ez i s a solution t o the linear system, G,, z = g,. Proof. Let J be the fundamental symmetry of K . Let x, be the projection of x onto V . Based on Theorem 2.27, X can be uniquely decomposed as z = x,, ZL, such that z, E V and x i E VL. Moreover. ( x l ~ u z=) 0.~ Hence, ( : c , ’ u , ) ~= ( x t , > t i i ) xwhich > are the elements of g , . Since the vectors ( 7 i i } are linearly independent (as the span of V ) , then there exists a , siich t,hat x,,= C/4,a p ~= , V a , where a is a colurnn vector. The elements of g, become (x,,wi)x = (Va,’ui)x,‘i = 1,.. . , 72. This gives rise to g , = V t J V a = G,, a. If G,, is nonsingular. then a can be determined uniquely as G :; g , , hence x,, = V Gzf g , . If G,, is singular then either there is no solution to the cquation g , = G,,,, a or there are many solutions. 0
+
Remark 2.13 T h e same formulation as Eq. ( 2 . 3 ) holds for a projection onto a subspace in a Hilbert space, provided that the indefinite i r i n w product (.>.)x %sreplaced b y the usual inner product (., .). In a Hilbert space, the singularity of the Gram operator G,, means that { . u ~ } are ~ = linearly ~ dependent. In the case of a Krein space, this means that V contains an isotropic vector, i.e. there exists a linear combination of {ZI~}:=~which is orthogonal to every vector in V . In other words, to avoid the singularity of the Gram operator, the subspace V niust be nondegenerate.
Remark 2.14 Since ( x , ~ i i ) x= ( x > J u i )= d J w i , then by th,e use of the Hilbert operations only, we can write that g, = V i J x and also G,, = V t J V . As a result, x, = V(VtJ’V)plVtJ’xand the projection operator P onto the subspace V i s expressed as P = V ( V t J V ) - l V t J .
Spaces
83
Corollary 2.11 Let V = span(v1, va,. . . , v,} and Li = s p a n { u l ,u ~. ... , u,} be linear subspaces of K . Assume the Gram operator G,,, and the crossGra.m operator G,,,, = ( ( u i , u,)lc)i=l..t.j=l..,. If G,, is nonsingular, th,en by Theorem 2.28 the orthogonal projections of the elements f r o m K: onto V are given by Qv = G,, G,: V. Theorem 2.29 (Indefinite least-square problem from a Hilbert perspective)'l. Let V be a linear non-degenerate subspace of a Krefn space K spanned by the vectors { u l , v2,.. . , vn}. T h e n for the basis V = [2/1.v2,. . . ,un] of V and f o r u E K , the function, F ( x ) = IIu - Vxli$ reuch,es its m i n i m u m iff G,,, = V t J V is positive de,fin,ite in, a Hilbert sense in the uicinity of xS2'. x, is the sought solution such that 5 , = G;Jg, and g , = VtJu. Otherwise, n o solution exists.
+
2 x t V t J u z t V tJVz . From mathematProof. Jju- Vxllg = utJu ical analysis [Birkholc, 1986; Fichtenholz, 19971, .z', is a stationary point of F ( x ) if the gradient VFlz=zs = 0. By a straightforward differentiation of F , one gets 2VtJVx - 2 V t J u = 0, hence VtJVx, = V t J u . Since V is non-degenerate, then G :; exists. Therefore, by Remark 2.14, the potential solution is given as 2, = (VtJV)plVtJu = Gzfg,. Traditionally, the stationary point x, is a unique minimum iff the nxn>Hessian is positive definite in a Hilbcrt sense. The Hessian H = a2F H = 2 Vt JV = 2 Gu,. Since the matrix of indefinite inner products equals G,, is generally not positive definite, H 2 G,, is also not. Consequently, 2 , cannot be a global minimum. However, H is positive definite at the point 5 , . Observe that zLHz, = u ~ J V ( V ~ J V ) - ~ = V u~tJJ P Uu , where P is the projection matrix onto the space spanned by the column vectors of V ;see Remark 2.14. By Theorem 2.27, P is J-self-adjoint, hence JP is ~
(-)&=lz5
21For comparison, an equivalent formulation is given for the Hilbert case: (Least-square problem in a Hilbert space) Let V = span(v1, vz,. . . , w,,} be a linear subspace of a Hilbert space 'h and let V = [v]vz . . . v,]. Then for u E 7-1, the norm F ( z ) = IIu V t ~ 1 1is~minimized for z such t h a t , V z = u,, i.e. z is the orthogonal projection of 1~ onto V . The unique solution is zs = Gr:gu, where G,, if the Gram matrix (in a Hilbert space) and g, is a nxl vector of the elements ( u , v z ) , for i = 1 , 2 , . . . , n. roof. 1 ( u- V Z 1'~ = 1 / u - un u, - Vzl1' = ( /u- uu1 j 2 I (u, - V Z /1;' since (u - u v , u o V z ) = 0. From Theorem 2.28, we know that the projection of u onto V is unique and it is given by uz,= VGL: g,. F ( z ) is then minimized for llu?,- V z / I 2 = IlVGr; g , - V s / J 2being equal t o zero, if the sought solution is x g = G,;,'g,. 22From a Hilbert point of view, the minimum of F cannot be found for a n arbitrary indefinite space. Assume, for instance a K r e h space K = W(',') with the indefinite norm = z: - 'c;. Then for a particular z = [l z2], the minimum of 110 - ziig = 1 - xg is reached at --oo ~
+
~
Ilzllk
+
84
T h e dissimilarity representation f o r pattern recognition
positive definite in the Hilbert space lKI. Therefore, x i H x , = u t J P u holds for any U E K ,which means that H is positive definite at 2 , .
>0 0
Below, we present an interpretation of the indefinite least-square problem, but from the indefinite point of view. The solution does not change, however, the interpretation does:
Proposition 2.2 (Minimum in the Kre'in sense) L e t K be a K r e h space over t h e field F and let f (x)= Ilb - Axil: be a, quadratic f u n c t i o n in K . T h e minimum o,f f in K i s a special saddle p o i n t xs in t h e associated Hllbert space JKl. T h i s space i s specified by t h e indefiniteness of J. T h i s m e a n s t h a t f Ix+ takes t h e minimum a t x , ~ , + a n d f 1 ~ - takes t h e m a x i m u m a t x,,_, iihere x,,+ and x,,_ are t h e f u n d a m e n t a l projections of x, E K o n t o K+ and K - , respectively.
+
+
Proof. Givcn that J = P+ (-P-), we have: f ( x ) = f + ( x ) f - ( x ) , whcre f+(x) = ztAtP+Ax - 2xtAtP+b b t P + b and f _ ( x ) = -(xtAtP_Ax 2xtAtP-b btP_b) are the restrictions of .f to Ic+ and K - , respectively. As f+ and f - are defined in the complementary subspaces (L is the orthogonal coniplement of K + ) , the minimum of f is realized by determining x,,+ for which f+ reaches its minimurn and finding :I;,,- for which f - reaches its maximum. The final solution is then 2, = z,,+ x , ~ , -(this is due to K being the direct orthogonal sum of K+ and K - ) . The critical points are the ones for which the gradients of f+ and f - are zero. This leads to x,,+ = (iltP+A)_lAtP+b and x;,,_ = (AtP-A)-lAtP_b. The Hessian matrices become, correspondingly, H+ = 2AtP+A and H - = -2AtP-A. Thanks to the properties of projection operators, P+ = Pip+ and P_ = Pip-, Theorem 2.27, one has H+ = 2 (P+A)t(P+A),which is positive definite by the construction, and H - = -2 (P_A)t(P-A), which is negative definite. Hence, f+ reaches its 0 rnininium for x,>+ and f- reaches its maximum for zs,-. ~
+
+
+
Theorem 2.30 (Indefinite least-square problem) L e t V be a linear non-degenerate su.bspace of a K r e i n space K spanned by th,e vectors ,vn] as t h e basis of V . Th,en for 712, (711, 112,. . . , u , , ~ } .D e n o t e V = [UI, the f u n c t i o n F ( x ) = liu, - Vxllg i s m i n i m i z e d in t h e Kreiiz sense for x, being t h e orthogonal projection o f 'u on,to V . T h e u n i q u e solution i s f o u n d 0,s x, = G;Jgu.
Proof. Similarly as in the proof above, we have: IIu - Vxllc = u + J u2 ztVtJu+xtVtJVz. x, is a stationary point of F ( z ) if the V7Fl,=,3 = 0. This leads to the equation VtJVz, = VtJu. By Remark 2.14, the solution
Spaces
85
is then given as x, = G,;i,lg,. We require that the Hessian, equal to 2 V t J V , is indefinite with the indefiniteness specified by J.This holds as VtP+V is positive definite in a Hilbert space K+, hence z , , ~ +yields a niiriimurn there and -VtP-V is negative definite in a Hilbert space IK-1, hence x S , ~ 0 yields a maximum there; see Proposition 2 . 2 . Remark 2.15 Note that the system of linear eguation,s V t J V x = V t J u solved in an inde5nite least-square problem can be expressed as Q'Qz = Q*u, where Q = V and Q* = V t J . This can be interpreted as a system of normal equations i n a Krefn space. Consequently, G;JVtJ is a pseudoinverse o f V in this space. 2.7.1
Reproducing kernel Krez'n spaces
Reproducing kernel Krein spaces (RKKS) are natural extensions of rcproducing kernel Hilbert spaces (RKHS). The basic intuition here relies on the fact that a Krein space is composed as a direct orthogonal siini of two Hilbcrt spaces. hence the reproducing property of the Hilbert kernels can he extended to a KreYn space, basically by constructing two reproducing Hilbert kernels and combining them in a usual way. We will present facts on reproducing kernel Pontryagin spaces (RKPS), which are Krein spaces with a finite rank of negativity (in other sources, e.g. [BognAr, 19741: a rank of positivity is assumed to be finite). Here, we will only present the most important issues, for details and proofs, see [Alpay et al., 19971 and also the articles [Constantinescu and Gheondea, 2001; Dritschel and Rovnyak, 1996; Rovnyak, 19991. All Hilbert spaces associated to Krein spaces are considered to be separable. Definition 2.105 (Hermitian kernel) Let X be a Krein space. A function K defined on X x X 4 CC of coritinuous linear operators in a Krein space X , is called a Hermitian kernel if K ( z ,y) = K ( z ,y)* for all 2 , y E X . K(z,y) has K negative squares, where K is a nonnegative integer, if every matrix { K ( x i , ~ ~ ) }based ? ~ =on ~ { Q , x ~ ., . . , xn} E X and n = 1 , 2 , . . . has at most K negative eigenvalues and there exists at least one such a matrix that has exactly K negative eigenvalues. Lemma 2.9 Let IIK be a Pontryagin space and let 2 1 , 2 2 , . . . 2, E IT,. The Gram operator G = ( ( ~ i , x j ) n J & = cwn ~ have n o more than K negative eigenvalues. Every total set in ITK contains a finite subset whose Gram matrix has exactly K negative eigenvalues [Alpav ct al., 19971.
86
T h e dissimilarity representation f o r p a t t e r n recognition
Lemma 2.10 Let 5 1 , 5 2 , ,zn,belong t o a n inner product space ( K . (.. . ) K c ) . T h e n the n,umber of negative eigenvalues of the G r a m operator G = ((xi,z j ) ~ ) ~ j coincides ,l with the dimension of the maximal negative subspace of s p a n ( x 1 , . . . , z T L[Alpay } et al., 19971. Definition 2.106 (Reproducing kernel Kreln space) Let X be a KreTn space arid let C X be a space of functions f : X + CC. Assunie K K c C X is a Kreiri space of continuous linear functionals on X . A Hermitian fiinctiori K : X X X + C is a reproducing kernel K K if (1) K ( X ; ) E K K for all Z E X and (2) K ( z . .) is the representer of evaluation at z in K K : ( f , K ( z , . ) ) K ,for all ~ E K and K all (fixed) Z E X .
,f(x) =
K K equipped with K is a reproducing kernel Krein space (RKKS). If K K is a Poritryagin space, then the resulting space of functions is called a reproducin,g kxrnel Poritryagin space (RKPS).
Corollary 2.12 Let K = Lc(X,CC) be a K r e f n space of continuous linear f u n c t i o n d s defined over the dom,ain X. If the eiJaluation functional 6,, 6 , r [ f ]= f (z) is defined and continuous for every z E X , t h e n K is a RKKS. Hence, there exists K ( z , . )E K such that 6, : z K ( z ; . )or 6 J f ] = ,f (x) = ( K ( x ,.), ,f ( . ) ) K . Therefore, the reproducing kernel is unique arid can he written as K ( z :y ) = 6, 6;, where 6; EL,-(@,K K ) is the 3-adjoint of the evaluation mapping E ( x ) for any fixed z E X . Similarly to the Hilbert case, m e has ( K ( z ,.), K(y, . ) ) K ~= K ( z ,y ) . In the case of the Pontryagin space, K ( z , y ) has at most /c. negative squares, Def. 2.105, where m is the rank of negativity. --f
Theorem 2.31 (On reproducing kernels) [Rovnyak, 19991 Let K ( z ,y) be a Hermitian kernel X X X+ C. T h e following assertions are equivalent:
1. K ( x ;y ) is a reproducing kernel for some K r e f n space K K consisting of functions over the domaisn X . 2. K ( z ,y) has a nonnegative majorane3 L ( s ,y) o n X X X . 3. K ( z .y) = K+(x,y ) - lip(.,y) for some nonnegative definite kernels K+ and lipo n X X X .
If the above holds, then for a given nonnegative majorant L ( z , y ) for K ( x ,y), there exists a K r e f n space K K with a reproducing kernel K ( x ,y ) , 23A nonnegative majorant L for K is a nonnegative definite kernel L such that L - K and &K are nonnegative definite kernels in the ‘Hilbert’sense, i.e. according to Def. 2.82.
Spaces
87
which as continuously contained in the Hi,lbert space XL with, the reprodli~cin.9 kernel L ( x . y).
+
Note that L(z,y) can be chosen as K+(s,y) K-(z.y). Note also that the consequence of this theorem is that the decomposition K ( x ,y ) = K + ( z ,y)-K-(x, y) can be realized such that K+ is a reproducing (Hilbert) kernel for ( K K ) + and K- is a reproducing (Hilbert) kernel for I(ICI<)-I in a fundamental deconiposition K K = ( K K ) + @ l ( K ~ ) - l . Practically, this means that the K* can be chosen as reproducing kerncls for the spaces in a fundamental decomposition.
Theorem 2.32 (On reproducing kernels in RKPS) Supjmse that K l ( x ,y) and K ~ ( xy), are reproducing kernels f o r Pontryagin spaccs n,, and 6 2 , a n d nK2of linear ,functions o n X with the ranks of negativity respectively. Then K ( z ,y) = Kl(x, y) K2(x, 9) is the reproducing kernel for a Pontryayin space nKwith K 5 K I ~ 2 Eqimlity . holds iff 7t = rlK,n l l K 2is a Hilbert space with the inner product: ( 2 ,y ) =~ (x.y ) n K l (2. y)n,., f o r z;y ~ 7 - t .
+ +
2.8
+
Discussion
Some classes of spaces have briefly been described. These are pretopological and topological spaces, generalized metric spaces, norrned and inner product, spaces. Normed and metric spaccs are topological. As a norm can be associated to an inner product, hence also a topology can be associated to it. Euclidean, Hilbert and Banacli spaces arc the usual examples of inner product and normed spaces, respectively. Most of the learning methodology developed deals with vectorial (feature-based) representations of objects either in Euclidean or Hilbert spaces. The reason is that the iriner product, the norin (defined by the inner product), t,he metric (defined by the norm) and topology (tlrfincd by the metric balls) coincide in these spaces. Since the probabilistic framework and other learning approaches are well developed, a natural requirerncnt for dissimilarity data seems to be their metric behavior. As a result, dissirnilarity measures used in statistical learning are either constructed or corrected to obey this requirement. In addition to the Euclidean distance. other tp metrics are considered in vector spaces. usually the city block or inax-norm distance. However, marly general dissimilarity measures have been derived for object comparison or matching in computer vision, pattern recognition
88
T h e dissimilarity representation f o r pattern recognition
and related fields, as briefly described in Chapter 5. Therefore, there is a need for learning paradigms applicable to general dissimilarity measures. Only if' a proper mathematical foundation is established for metric and non-metric dissimilarities, more general measures may be used and developed further. It is our aim to present general learning methods and to apply them to a number of problems. They will be constructed in a mathematical framework relying on generalized metric spaces and Krein spaces, which reduce to pseudo-Euclidean spaces for finite data representations. Since K r e h spaces are extensions of Hilbert spaces, Krein spaces accommodate a more general interpretation of dissimilarity data than Hilbert spaces. Since our starting point is a dissimilarity representation, all the relations and close links between generalized metric spaces and (indefinite) inner product spaces, as well as those between generalized metric spaces and generalized topological spaces are important. The most essential properties have ,just been introduced. How these spaces are used for learning is the topic of Chapters 3 and 4.
Chapter 3
Characterization of dissimilarities
A rock pile ceases t o be a rock pile the moment a single man co.ntemplates it, bearing within him the image of a cathedral. “FLIGHT TO ARRAS” ANTOINEDE: SAINT-EXUPERY ~
Various spaces in the context of generalized metric spaces were introduced in Chapter 2 . These are pretopological spaces, normed and (indefinite) inner product spaces. This chapter focuses on theoretical aspects of dissimilarities and relations between generalized metric spaces and inner product spaces. Isometric embeddings, as well as semimetric and metric transformations are described, as they are basic means for characterizing dissimilarities. A theory is also presented that deals with transformations preserving metric properties or allows one to test, whether a, pa,rticula,rdistance is Euclidean. Shortly, this chapter introduces some tools that check or enhance particular properties of dissimilarity matrices. It prepares the ground for the data exploration techniques and learning algorithms discussed in Chapters 6 10. In practice, we will always deal with finite samples, i.e. a finite collection of numerically represented data entities. Such a finite representation is used to define a space or a more general framework having suitable properties, in which learning techniques will be applied. Given a set R of n objects, this representation becomes an n x n dissiniilarity matrix D ( R ,R ) whose elements are denoted by d i j ) . Each entry d i j is a dissimilarity value bctweeri the i-th and j - t h objects. Consequently, the properties of dissimilarity measures and possible interpretation spaces are mainly discussed in the context of such finite collections. Metric distances have advantageous properties, and therefore many methods work in (Euclidean) metric spaces. Section 3.1 briefly introduces basic aspects of city block and Euclidean embeddings. Such isometric mappings find correspondences between an abstract space defined by a (fit1it.e) representation of distances and a chosen metric space. Seniinietric and metric transformations are also considered. Next, tree models for t,he rep~
~
90
T h e dissimilaraty representation for p a t t e r n recognition
resentation of dissimilarity relations are introduced. Section 3.4 presents basic relations and properties of dissimilarity matrices, especially with respect, to metric arid Euclidean behavior. Many traditional learning methods are designed in a Hilbert space or in a Euclidean space equipped with a Euclidean distance. Therefore, given a distance measurc, it is important to know whether it shows Euclidean behavior. For the Euclidean distance, every finite representation D can perfectly be embedded in a Euclidean space. This means that a configuration in a Euclidean space can be found such that the original distances are preserved. If the measure is non-Euclidean, then either it is corrected to become Euclidean or it is used as such. Any premetric non-Euclidean measure, i.e. any measure satisfying the definiteness arid symmetry constraints, Def. 2.45, can be interpreted as a indefinite distance in a pseudoEuclidean space (KrcYn space). Section 3.5 explains how both isometric and approximate embeddings in a pseudo-Euclidean space can be realized. Such mappings are examples of spatial models of the dissimilarity data. A few ot,lier projection techniques are presented in See. 3.6. Additionally, spherical embeddings are discussed in Sec. 3.5.9. 3.1
Embeddings, tree models and transformations
Both enibeddirigs arid tree models are means to represent generalized metric spaces in spatial organizations. The main purpose of (isometric) embeddings is to detcrrniric whether a given space ( X , d ) with a dissimilarity measure d is isometrically equivalent to a predefined space possessing some ~isefiillproperties. 3.1.1
Embeddings
Ernbeddings are B useful tool in practical problems where finite dissimilarity representations are considered, that is, firiitc (generalized) metric spaces ( X ,d ) defined by the corresponding dissimilarity matrix D . If an equivalence between such spaces arid other known spaces is established, the latter. if possessing favorable properties, can be used to set learning paradigms there. Although many spaces can be considered in this context, Euclidean and Hilbert spaces are the ones most extensively investigated. The reason of their applicability is that they are simultaneously topological, inner product, norrned, arid metric spaces, where the inner product is used to define
Characterization
of
dissimilaritres
91
the norm, which further defines the metric and topology. Because of these properties, many theoretical models and a probabilistic framework exist, which are used to solve pattern recognition problems formulated in such spaces. Studying the questions related to embeddings allows one to better characterize commonly used metric spaces. The essential work on Euclidean and Hilbert embeddirigs was done in [Cayley, 1841; Menger, 1931; Schoenberg, l935,1938a,b;Blurnenthal, 19531. Since Euclidean embeddings require a thorough treatment, they are the subject of Sec. 3.4 and Sec. 3.5. Here, we will briefly describe a few aspects of the e,-embeddings, with el-embeddings in particular. The el-embeddings rely on the additive property of the el metric, which. in turn. can be represented by x i additive tree; see also Def. 3.10. Let us recall some basic definitions; see also Example 2.5. Assiinie that M ( 0 )is a set of function classes on a closed and bounded set 52 measurable in the Lebesgue sense. Formally, one has: 0
! : = (RW1rL,dp), where d,(x,y)
!z = (RvL,d,), where d , ( x , 0
=
(C’,”=, Izi - y$’);
and p > O .
y ) = dmax(x,y) = max, )xi - y,J.
(s,
( M ( [ u , b ] ) , d pwhere ), d,(f,g) = If(.) - g ( z ) l ” d z ) b , p > O . L g = ( M ( [ u , b ] ) , d , ) whered,(f,g)=sup,If(x) , -g(x)ldz.
Lf
=
tp defines an sra-dirneIisiona1 space, while lr describes an infinite dirneiisional space. For simplicity, we will also write ep instead of, ; ! where the dimension m is fixed. I f p 2 1, then tpand L r are metric spaces, othcrwise, they are quasinietric spaces. el stands for the city block metric; while !22 is the Euclidean metric. Let ( X ,dx) arid (Y,d y ) be Definition 3.1 (Isometric embedding) metric spaces. ( X ,d x ) is isometrically enibeddable into (Y,d y ) if there exists an isometry 4 : X + Y , i.e. , a mapping 4 siicli that d x ( z l . : x : 2 ) = d ~ ’ ( 4 5 ( a 4) ,5 ( ~ ) )for all x 1 , n EX.
Definition 3.2 (&embedding) A metric space ( X ,d ) is !,-embeddable if ( X , d ) is isometrically embeddable into the space e; for some intcger m 2 1. The smallest such integer is called the !,-dimension. Isometrics are injective’. Two spaces are isometrically isomorphic (see footnote 11 on page 59) if there exists a bijective2 isometry between them. ‘An injective function maps distinct input values to distinct output values. 2 A bijective function f : X + Y is both injective and surjective. A surjective function has its range equal to its codomain, i.e. for every y t Y there exists z E X with f(z)= y.
92
The dzssirnilarity representation for pattern recognition
Example of a distance matrix D ( X , X ) describing a metric space ( X , d ) , X = { I , J , K , L } that, cannot be embedded in a Hilbert space.
Figure 3.1
In this case, the two spaces are essentially identical. Every metric space is isometrically isomorphic to a subset of some normed vector space. Every complete metric space is isometrically isomorphic to a closed subset of some Banach space.
Definition 3.3 (Lipschitz continuous mapping and contraction) Lct (X,dx) and ( Y , d y ) be metric spaces. A mapping 4 : X + Y is Lipschitz continuous if there exists a constant K such that dy ( q ! ~ ( z4(z2)) ~), < n dx ( X I , x 2 ) holds for all 21, x2 E X . If n < 1, then 4 is a contraction. Theorem 3.1 (On Lipschitz continuous mappings) ( I ) E ? i e ~ yLipschitz con,tinuous mapping is continuous. The reverse is not true. Sketch of proof. To see th,ut a continuous mupping may not be L%pschitz,assume X = R a.nd d(z,y) = lx-yl. Th,e function f ( z )= x2 on X i s continuous. However, no n exists such that I z2- y2 1 5 n /x-yl for all x , y E X . Consider y = 0 . Then for 15/ 5 1, x 2 5 1x1, but for 1x1 > 1, x 2 > 2 , hence a contradiction,. f is continuous, but not Lipschitz. (2) Let ( X , d ) be a metric space. The mapping x + d ( z . z ) , z E X , i s Lipsch,itz with = 1. Theorem 3.2 A Euclidean space IWvL can be embedded an a Hilbert space. Every finite subset of rri, elements in a Hilbert space can be embedded in ~ m - 1[BIunienth,al, 19531.
Not every metric space (X,d) can be embedded in a Hilbert space. A counterexample is a metric space consisting of four elements X = {I,,I, I(,L} and represented by a distance matrix in Fig. 3.1. From the definition of d , it follows that there are two points J and L which should be considered as the middle points between I and K . In a Hilbert space, howevcr, every pair 3: and y determines a unique middle point z = i ( x + y) between them such that d ( z , z ) = d ( z ,y) = d(x.y) [Blumenthal, 19531.
4
Hcnce, a contradiction.
ChaTaCteTZZatzOn
of dissamilarities
93
Theorem 3.3 (Schoenberg) Let y E (0.21 and r E (0, t ] . Then, th,e spaces (Rm,d;) and ( M ([a,b ] ) ,d;), where d; i s t h e r-th power o,f th,e P,- or' C,- distance, are isometrically embeddable in a Hilbert space [Schoen,berg, 19371. This theorem covers classes of both nori-metric and metric spaces with the dissimilarity d,, whose distances can be transformed by an appropriate power function such that they become enibeddable in a Hilhcrt space. If a finite collection of points is considered, then Theorem 3.3 refers to a Euclidean ernbedding. This theorem justifies the validity of a common sense approach, where non-metric or non-Euclidean finite spaces ( X ,d ) are transformed by a power transformation with the power r E ( 0 , l ) . This is done in practice, sincc such a transformation may be capable of making the space metric or even Euclidean; see Sec. 3.3. After such a transforrriatioii, a prenietric space may also become semirnetric. Lemma 3.1 (On embeddability of metric spaces into max-norm spaces) Any finite m e t r i c space ( X ,d ) is e,-embeddable. Assume that a collection of ri objects is given as X = Then a metric space ( X ,d) can be ernbedded in 8&. Define a data-dependcnt mapping 4 : X 4 R" such that $(:I;) = [d(x,x1),d(z,z2), . . . d ( x , z n ) l T . Denote zi = q5(xZ)for i = 1,2 , . . . , n . Then one has dm(4(zi),4(zj)) = dm(zi,z3) = rriaxl
{ X I ,5 2 . . . . , x n } .
~
Theorem 3.4 Let ( X ; d ) be a metric space.
The following implicatlon,s hold [Deza and Laurent, 1997; Ball, 1990; Bretagnolle et al., 1966; W e l l s and Williams, 197'51:
94
The dissimilarity representatton for pattern recognition
d is ta-embeddable
4
U for 2
d i s el-embeddable
4
U
E d is !?,-embeddable
for 15 p < m d i is t,-embeddable
l
Definition 3.4 (Cut semimetric) A partition of a set X into two subsets V c X and X\V is called a c u t . Such a cut defines a cut s e m i m e t r i c as 6 v ( z , y ) = Z(1V n {z,v}\ = l ) ,where 1 . 1 denotes the cardinality of tkie set and Z is the indicator function. A cut serriiriietric (metric without the definiteness condition required) is used for a further characterization of the !I-distance. A distance d is lgernbeddable if d can be decomposed as a nonnegative linear combination of nr cut metrics. Moreover, there exists a nonnegative measure space3 such that d is the measure of the symmetric difference. Formally, one has [Deza and Laurent, 19941: Theorem 3.5 (!I-distance characterized) L e t X = { X I , 2 2 , . . . , x,} a n d dz, = d ( z z , z , ) . Assume ( X , d ) i s a f i n i t e m e t r i c space. L e t i , j = 1;2 , . . . . n . T h e ,following assertions are equivalent:
Cv:vcx
d(xi,z j ) = XV 6 ~ ( z i , z j )where , XV .:%LE (1) d i j (2) There exists a nonnegative measure (probability) space (n,A, p ) and A l , A 2 , . . . , A , E A s u c h t h a t d,, = p (Ai AA,) = p((Ai\Aj) U
( 4 7 \A)).
() ( X ,d ) is -P;'L-embeddable, i. e. there exist vectors {vl, v2, . . . , v,} E Rm f o r ,some integer 'ru 2 1 s u c h that d i j = lvik - v j k 1.
cy=l
(a,
'A measure space is defined as a triple A, p ) , where R is a set, A is a 0-algebra on (a collection of subsets A such that E A and if A E:A, then a\A E: A, and a union of any number of subsets of A belongs to A, as well).
Characterazation of dassamilarities
95
Although the el-distance admits a decomposition by cut seniimetrics. it
does not, simplify the process of deciding whether a particular distance is isometrically el-embeddable or not. The difficulty of devising a polynomial algoritlim for an (;"-embedding is due to the non-uniqueness of such a dccomposition. Given n points, there are 2n-1 possible cuts. This problem is known to be NP-complete [Karzanov, 19851.
3.1.2
Distorted metric embeddings
Not every metric space is isometrically embeddable into the or Y, spaces. This is, however, possible with distortion. Let ( X ,dx) and (Y,d y ) be mctric spaces. Then a mapping 4 : X 4 Y is an embedding with a distortion c 2 1, called c-embedding, or bi-Lipschitz embedding, if there exists T > 0 such that r d x ( x , y ) 5 d y ( 4 ( z ) , $ ( y ) ) 5 c r d ~ ( x , yholds ) for all x , y ~ X . There has been a lot of research devoted to the problems of distorted embeddings into the &I, 2! and too spaces. For instance, a classical result is that any finite semimetric space ( X , d ) of n points, = n, can be embedded into L1 with distortion c = O(log, n ) [Bourgain, 19851. Anotlier result is that a finite n-element metric space ( X , d ) can be embedded into $ for p 2 1 and k = U((log,n)') dimensions with distortion c = U(log, s n ) . c-embeddings are not treated in this book. Nevertheless, they remain of high interest for further study as such embeddings to low-dimensional vector spaces may be useful for learning. An overview of import,ant results concerning low-distortion embeddings with the algorithmic emphasis can be found e.g. in the following works [Indyk, 2001; Linial, 2002; MatouSek, 2002; Krauthgamer et al., 20041.
1x1
3.2
Tree models for dissimilarities
Representing dissimilarities by trees is an important issue in many scientific fields, like data analysis, mathematical psychology, historical linguistics, bioinforrnatics and evolutionary biology; see e.g. [Sneath arid Sokal, 1973; Barthdemy and Guknoche. 1991; Kim and Warnow, 19991. A tree structure of the dissimilarity matrix allows one for a natural interpretation of relations between the objects. It is a useful tool for understanding the data structure, especially for a smaller number of objects. where the resiilts can be presented visually. Further on, tree models support the hierarchical clustering scheme based 011 proxirnities; see Chapter 6. In brief. tree mod-
96
T h e dassimilarity representation f o r pattern recognition
els order the dissimilarity information in ternls of organizational aspects, hierarchical and nested structures. Formal definitions follow.
Definition 3.5 (Graph) A graph, (V, E ) consists of a set of nodes, V = { v l , va;. . . . w,} arid a set of edges E { ( v i ,wj) : vJ,vJ E V} connecting two nodes. In a directed graph, the pairs (w,,w3) in E are ordered, in an undirected graph, they are not. The degree of a node vi is the number of edges incident in 7 ) i . Nodes with degree one arc called leaves. Nodes with larger degrrc are called inkrnal. A weighted graph (V,E , W ) is a graph (V,E ) with nonnegative weights wij E W associated to the edges (vi,vj). Definition 3.6 (Path and minimum path distance in a graph) Let (V,E ) be a graph. A p a t h between two nodcs 'oil and vtk is a sequence of connected edges (vil, u i 2 ) , (wi2, (vikpl, uik).A graph is connected if there is a path between any of the vertices. The length of a path is then a sum of the weights of the connected edges between the nodes vil and < q kThe . minimum p a t h distance between any two nodes is defined as the minimum length of the paths connecting them. Definition 3.7 (Tree) A tree is a connected graph, where each pair of nodes is connected by a unique path.
A key model is the additive tree model, which represents objects by nodes of a tree and defines dissimilarities as path lengths between two nodes, computed by the sum of the weights of the edges on the path. A special case of a rootcd additive tree is an ultrametric tree; in which the distance from the root to every leaf is identical. The formal definitions follow. Definition 3.8 Let ( X ,d ) be a metric space. If X = { X I , 2 2 , . . . ,z,}, then d is described by an rixn, distance matrix D = ( d t 3 ) such that dij = d (xi:x3).Additional constraints can be considered for all IC, y, z , EX: ultrametric neyuuliky: d(z,z ) 5 max { d( z, y ) , d ( y , z ) } .
A metric space satisfying the above inequality is called ultrametric or ri,ori,-Archirr~.edean; see also Def. 2.44 and fact 2.11. ( 2 ) four-point property: d ( z ,y) d ( z , u)I max { d ( z , u ) d ( y , z ) ,d ( z , z) d(Y. u)1. (3) hypermetric inequality: y T D y 5 0 and y E Z n such that yT1 = 0; where D is an ' n x n , distance matrix. A metric space obeying the above inequality is called hypermetric. An infinite metric space X is hypermetric if the inequality holds for every finite subspace of X .
+
+
+
Characterrzation
of
drssrmrlarrties
97
(4) neya,tiwe t.y;ne: y T D y 5 0 for y E Rn such that
yT1= 0, where D is an nxn distance matrix. A metric space obeying the above inequality is called of ne,qati,ue trjpe. An infinite metric space X is of negative type if the inequality holds for every finite subspace of X .
The ultrametric and four-point properties can also be considered for prcmetric spaces. The four-point property and ultrametric inequality are iriliereritly connected to a tree structure of the distance data. Definition 3.9 (Additive, ultrametric trees) An additive tree is a connected, undirected graph where each pair of nodcs is connected by a unique path. An ultrametric tree is an additive tree in which each leaf is equidistant (along the path) from the root. Let ( X ,d ) be a finite metric space and let D be the corresponding distance matrix. D defines a unique additive tree iff the four-point inequality holds for any quadruple from X. D defines a unique ultrametric tree iff the ultrametric inequality holds for any triplet from X . See Fig. 3.2 for examples. Formally, one has [Barthdemy arid Gui.noche, 19911: Definition 3.10 (Additive distance tree) Let D = ( d t J )be it11 n x n symmetric distance matrix between the elements of X . Let To be an edgc weighted tree with at least n nodes, where n distinct nodes of To arc labelccl by the elements of X. To is an additive tree for the matrix D if for every pair of the labeled nodes ( i , j ) ,the path from the node i to the node j has the total weight equal to d,, . In an additive tree the root is not determined and choosing differcrit roots may suggest different interpretations or relations iri the clata. As the root will likely distinguish a few significant groups in the data, it should be chosen to enhance the interpretability of the data. This, however. requires some prior knowledge on cluster tendencies of some objects. Another possibility is to place the root a t the midpoint between two most distant objects in thc tree or at a node which minimizes the variance of the distances from the root to the leaf nodes, so it splits the data into homogeneous groups. In a weighted graph, the path metric defines the shortest path, judged by the total sum of weights, between two nodes. A metric satisfying the four-point property is the path metric of nonnegative weighted trees.
98
T h e dissimilarity representation for p a t t e r n recognition
7 j
D I - J
(a) Additive distance tree
K
L
D I
J
K
L
(b) Ultrametric distance tree
Figure 3 . 2 Tree examples. (a) Additive distance tree for the given D satisfying the four-point property. (b) Ultrametric distance tree for the given ultrametric distance matrix D .
Theorem 3.6 E v e r y p a t h distance, i e . t h e shortest-path distance in a tree, i sI! -embeddable Proof. Let T = (V, E ) be a tree, where '11 is a set of vertices arid E is a set of edges. Every edge e = (vk,vl)introduces a partition of V into two sets V k and Kc = V\Vk such that v k E V k and YL E 4". The path metric of T caii be decomposed as d T ( u i , v j ) = C ( V k , V ~ ) E E b V ~ ( ~ where k , , u l ) 6vk, , is the cut metric, Def. 3.4. In the case of a weighted tree, the values of 6v,,( i i k , ,ul) are multiplied by the weights ~ i k l .Since the path metric can be decomposed as a linear combination of cut metrics, then it is I! -embeddable by Theorem 3.5. 0 This statement makes a connection between the embeddability into a l1space and a path metric of an additive tree. Consequently, any distance matrix D which is !,-embeddable can be represented by a path metric of an additive tree T . NIoreover, if a path metric is embeddable in !?, then the tree has at most 2m leaves [Hadlock and Hoffman, 19781. This establishes a discrete model of distance relations between the objects.
Definition 3.11 (Ultrametric distance tree) Lct D = ( d i j ) be an n,xn symmetric distance matrix between the elements of X . An ultram,etric distance tree for the matrix D , also called dendrogram, is a rooted tree TD with the following properties: 1. TD contains n. leaves, each labeled by a unique element of X . 2. Each internal node is labeled by distance values from D such that d i j is the label of the least coninion ancestor of the leaves i and j . 3 . Along any path from the root to a leaf, the numbers labeling internal nodes strictly decrease. 4. Each internal node has at least two children.
Characterization of dissimilarities
99
In practice, there might be no additive tree to represent the distance data D , since there might be no path metric coinciding exactly with D . A solution can be offered by finding a tree which models the givcn dissimilarities as well as possible by the path distances. Such a tree imposes a metric D T , which should provide the best approximation of D iinder some criterion, defined e.g. by the !I, e, or ! , norms. This is a formulation of a numerical taxonomy problem, which has received a great deal of attention over the years; see e.g. [Barth4lemy and Guknocl-ie, 1991: Kim and Warnow, 19991. The additive or ultrametric tree fitting problems are known to be NP-hard under the (1 arid t 2 norms [Sneath and Sokal, 1973; Kim and Warnow, 1999; Cliepoi and Fichet, 20001. In the case of tjhe t m norm, the same holds for an additive tree [Agarwala e t al., 19991, however the optimal ultrametric tree can be computed in a polynomial time [Farach et al., 19951. There exists a number of other methods trying to construct either an ultrametric or additive tree such that the path distance approximates the given distance as well as possible. Refer t o [Barth4lemy and Guknoche, 1991; Gordon, 1996; Kim and Warnow, 1999; Sneatli and Sokal, 19731 for general descriptions or to [Agarwala et al.: 1999: Cohen and Farach, 1997; Farach et al., 1995; Gascucl, 1997, 2000; de Soete, 1 9 8 4 ~ ; de Soete and Caroll, 19961 for more specific algorithms. See also Sec. 6.4.
3.3
Useful transformations
Some transformations are considered, which either preserve the (semi)metric properties or, in particular cases, change a dissimilarity rneasure into a metric. Properties of convex and concave functions are introduced in Appendix A. 3.3.1
Transformations in semimetric spaces
Theorem 3.7 ((Semi)metric transformation) Let (X,d) be a s e m i m e t r i c space. Then the composition of mappings, f o d , i s also sem%metrici f f :Rt+ Rt is a non-decreasing a n d concave function such t h a t f ( 0 ) = 0. If ( X ,d ) is a m e t r i c space, th.en f o d is m,etric, if additionally f i s positive o n R+.
Proof. We focus on a metric space, as the proof for a semimetric space is analogous. Let ( X ,d ) be a metric space. Since f(x) > 0 for positivc x and
100
The dissimilarity representation for pattern recognition
f(0) = 0, then f o d directly fulfills the positivity, reflexivity and symmetry constraints: see Def. 2.38. It suffices to prove the triangle inequality. Let d,, = d ( z .y), d,, = d ( y , z ) arid d,, = d ( z , 2 ) for any 5 , y, z E X . Assume that d,, + d,, 2 d,, holds for every triplet ( d z y ,d,,, d Z z ) . Since f(d,, + d,,) 2 f ( d Z Z )and ,f is non-decreasing, one suffices to show that f(d,,) f(d,,) 2 f ( d , , d,,). The inequality f ( d x , ) f(d,,) 2 f(d,,) is then straightforward. Based on the concavity of f , one has that f ( a t (1 Q ) U ) ~ o l f ( t ) + ( l - o l ) f ( u ) f o r a l l a ~ [ O , l ] a n d a l l t0. , uL~e t o l = *, u = d,, d,, arid t = 0. We have:
+
+
+
+
+
(3.1)
Also, . f ( d y Z )2 f(&,+d,z). which finishes the proof.
Hence, f ( d x , )
+ f ( & ) 2 f(dxy+dyz), 0
f is required to be non-decreasing and concave. In fact, instead of concavity, the subadditive property of f is needed, i.e. f ( x 1 x2) 5 f ( q ) f ( x 2 ) for all x1 and z2 iri the donlain o f f .
+
+
Corollary 3.1 L e t ( X ,d ) be a m e t r i c space. T h e n ( X ,f o d ) i s also m e t r i c f o r t h e following f u n c t i o n s f :R: +: R ( 1 ) f(x) = c z , c > 0 . (2) f(.) = z C Z ( ~ > O ) , c>o. (3) f ( z ) = rniii{c,x}, c>O. (4) f(x) = z7‘>O
0 .
+
(G) f ( z ) = sigm ( x ) = 2 / ( 1 +exp($))
(7)f(.)
-
1, a>O.
log(l+z).
Proof. It is easy to verify that all the functions above are rrionotoriically growing and f ( 0 ) = 0. Concavity of f ’ s are easily proved by showing that f ” are n e p t i v e for x > 0 ; see Theorem A.l. 0 Theorem 3.8 (Blumenthal) L e t ( X , d ) be a m e t r i c space and let f T ( z ) = xT be a m e t r i c t r a n s f o r m w i t h r E ( 0 , $1. T h e n d‘ = f T o d is metric and any f o u r p o i n t s of (X,d‘) can be isometrically embedded in a Euclidean s p c e [Blumenthal, 19361.
Characterization of dissim,ilarities
101
Note that for any metric, every three points can be isometrically embedded in a Euclidean space; which follows from the triangle inequality: see also Sec. 3.4. The above theorem explains that the power transformation f T ( x ) = xT with 0 < T 5 $ makes the metric ‘inore’ Euclidean, sirice the &-embeddability holds for any four points. Corollary 3.2 Let d, be the !,-distance
and p E ( 0 , l ) . T h e n the space
(Rm,dF) is metric. Proof. Assume that x, y , z E R“. Note that dE(x,y) = Czl I:ci - z / i ( ” . It suffices to show that the triangle inequality holds, since (R”,d,) with p~ ( 0 , l ) is quasimetric, as shown in Example 2.7. Based on the Minkowski m inequality4, lai bilp 5 (ail?’ (bilJ’,we have:
+
+
+
The latter inequality is equivalent to stating that dF(x,z ) 5 d:(x, y) 0 dg(y,z). Hence, d; is metric. Remark 3.1 Since ( R m dg), , with p E ( 0 , l ) is a metric spece, then based o n Corollary 3.1, ( R m , d g ) with r E ( 0 , p ] is a metric space as well. Definition 3.12 Let
G denote a set
of functions g ( z ; p ) of one real nonnegative variable z and one real parameter p > 0 such that g ( 0 ; p ) = 0 and g ( x ; p ) is a continuous strictly increasing function of 2 . Moreover. g is such that for m y x > 0 and any real E > 0, there exists a such that p 5 a + jg2(x;p)- I ] 5 E [Courrieu, 20021.
4 are [Courrieu, 20021: with s u p s < 00. Let E < 1. Then a > 0 for
Example 3.1 Examples of the functions from 0
0
Power function: g ( z ; p ) = ICJ’ z = 1 and a = [log (1 sign(z - I ) ~ ) ] / ( 2 l o(x)), g otherwise. Weibull function: g ( z ; p ) = 1 - exp ( -:cr/p) with T > 0. Let & < I . Then a = - z T / l o g ( l - fi).
+
Theorem 3.9 (Courrieu) Let ( X ,p ) be a n n-element finite quasirnetr.ic space. Let be a set of functions as in Def. 3.12. Consider a funxtaon g E 8. T h e following statements hold: 4The proof of this inequality is as follows. Let a 2 b 2 0 and p E (0,1]. Let -1 g(z) = (1 + cf). with c E (0,1] and z 2 1. g’(z) = -5 log(1 + c 5 ) log(c)g(z) 2 0, so g is non-decreasing. g takes the minimum value of (1 + c ) a t z = 1. For z =
I and
c = $ < 1 , ( a p + b P ) ~ = a g ( ~P ) >- a g ( l ) = a + b h o l d s . H e n c e , ( a + b ) p < u p + b ~ ’ .
102
T h e dassamilarity representation f o r p a t t e r n recognition
( I ) There exists a real
n(X)> 0 s u c h that f o r a n y positive
p
2 a ( X ) , the
space ( X .g ( p ; p ) ) is %sometricallyembeddable in a Euclidean space Rk of t h e d i m e n s i o n I; 5 n- 1. (2) T h e r e exists a real p(X)s u c h t h a t 0 < p ( X ) 5 a(X)and for a n y posit%uep 5 p(X)the space ( X , g ( p ; p ) )i s isometrically embeddable in a Euclidean space Rk of a dim,ension k = n- 1 [Courrieu, 20021. Thc above theorem explains that any finite quasimetric space can be transformed into a Eiiclideari space by a suitable function g ( . ; p ) . It does not: however, explain how a proper parameter p can be determined. This depends on the set X and cannot be captured by a general formula. In practice, p < 1. For p approaching zero, the quasimetric space resembles more arid more a discrete metric space, defined as d ( z , y) = Z(x # y). This means that thc structure in the data is weakencd in the embedded Euclidean space, since the points move towards the corners of an equilateral polytope. Still for any p > 0, some structural information is present.
3.3.2
Direct product spaces
Direct product spaces allow one for a construction of' a new space by combining two (or more) spaccs; see also Sec. 2.4. In the context of generalized metric spaces, this means that given two (or more) such finite spaces describing the same objects, a new dissimilarity measure can be created by their summation or by inaxinium operator. Now: some conclusions can be drawn for the combined spaces.
Theorem 3.10 (Metric direct product space) L e t (X,dx)and (Y.d y ) be m e t r i c spaces. The direct product space ( X x Y .d x d y ) , defined as (dx.dY)((lC1:y1),(xzrY2)) = dx(~llm).dY(Yl,Y2),w h e r e ~ 1 , x z E X and y1 , y2 E Y , and i s either t h e s u m o r m a x operator, i s m e t r i c .
Proof. The proof is straightforward by checking the conditions of 0 Def. 2.38. Theorem 3.11 L e t (X,px) and ( Y , p y ) be m e t r i c spaces. Suppose that t h e direct product space (XxY, px @ p y ) i s defined s u c h that ( p x @ ~ ~ ) ( ( ~ 1 ~(YI,Y~)) 1 ~ ~ 2 = ) , p x ( ~ i , x 2 ) + p y ( y iy2) 1 x1,m E X and y1,y2 E Y . T h m t h e space ( X X Ypx , @ p y ) is
(1) !,-embeddable (f ( X ,px) and (Y, py) are !I-embeddable. (2) i s &permetric $7' ( X ,px) and (Y, p y ) are hypermetric. (3) i s of n,egative type if ( X ,p x ) and (Y, p y ) are both of negative type.
103
Characteritatton of dissimilarities
Proof. Let Z = X X Yand p = px @ p y . (1) ==+ Assume that (2, p ) is (1-erribeddable. Then p ( ( x l ,yl), (xz,;y2)) = p x ( s 1 , x z ) 0 PY(Y1,YZ) 0 = px(z1,xz) p x ( G ! > x 2 ) pY(Yl>:Y1) PY ( ~ 1W,Z ) = P ((xi,N ) , (W , ~ ) ) + p ( ( yi, m ) ,(xz,yz)). Consequently, (2, p) is ( X X { ~ z } , p xx )({yi}xY,py), which is equivalent to ( X , p x ) x (Y,py). Hence, ( X ,px) and (Y,p y ) are (1-embeddable.
+ +
+
+
+
+
Assume that ( X , p x ) and ( Y , p y ) are el-embeddable. Let, 4x and denote the L1-embeddings of the spaces ( X , p x ) and (Y,py), correspondingly. Then the embedding 4 of ( 2 , p ) into el can be obtairied by d x , Y) = [4x(x) 4~ ( v ) ] . Since p = px @ p y , then p ( ( ~ 1 ~ x 2(yl ) . y2)) = +==
4y
Px(Z1.22)
+ II4Y(Y1) [ # ) x ( x 2 )4y(!h)/Ii = ll4(a,w) ~(z~.Y~)III.
+ PY('Y1,YZ)
iI[d)x(xi) b ( Y i ) ]
-
= II#)x(Z1) - 4x(22)111
-
$y(y2)I11 =
Hence,
-
( X X Y p, ) is ti-enibeddable. (2) ==+ Let (2, p ) be hypermetric. This means that for y1 E Y , (Xx{yl}, p ) is hypermetric as well. Then p ( ( x l , y l ) ,(z2,yz)) = ~ X ( : ~ . ' I , J : Z )so , (X,px) is hypermetric. The same reasoning holds for (Y,p y ) . +== Let (X,px)and ( Y , p y ) be hypermetric spaces. Define z E Zz as a function of (x,y) E X X Y to satisfy C(zl,yl)EZ z(x1,yl) = 1. Define u E Zx arid u E Zy such that u(z1) = C y l t Y z ( ~ l , y land )
4Yl) =
C z l t X z(x1,Yl). Consewntly, CZIEX 4 x 1 ) = CYIEY 'C'(V1)
Then
C ~ ( z l ; z 2 ) E X A ( y 1 . Y 2 ) E Y4}Z 1 , Y l )
4 x 2 . Y2) P((Z1, za), ( Y l .
y2))
C(z,,z2)€X 4 x 1 ) U ( Y l ) P X ( Q , Y 1 ) + C ( Y l , y 2 ) E4yY 1 ) 4 Y 2 ) P Y ( ! l 1 , Y 2 ) Hence, be Def. 3.8, ( 2 , p ) is hypermetric. (3) The proof is similar to the one above.
= =
5 0.
0
If X = Y, then the above theorem states that p x @ p y , the summation of distances, preserves the tl-embeddability, hypermetric property and the propcrty of being of negative type. Let 2 = X X Y . We also know that if (2, p ) is of negative type, then (2, p 4 ) is tz-embcddable by Theorem 3.13. This means that if (2, p x ) and (2.p y ) are tz-embeddable, then thc space (2, (p; &)a) is also tz-embeddable.
+
3.3.3
Invariance and robustness
Invariance is an important issue for designing informative dissimilarity measures. It may be studied in the context of invariant pattern recognition or for matching purposes [Rodrigues, 2001; Hagedoorn and Vcltkamp. 1999a,b]. Here, we will briefly discuss it in terms of (semi)metric properties.
104
The dissimilarity representation for pattern recognition
Definition 3.13 (Invariance) A semimetric space ( X ,d ) is invariant under the transformation group 7 if it is invariant for each element t from 7, i.e. d ( t ( z )t,( y ) ) = d ( z , y) holds for all z, y E X . By the definition of a group, Def. 2.50, the identity transformation id7 belongs to 7 . Also, for each t E 7 ,there exists the inverse transformation t-l E 7 such that t o t-' = t-' o i = id7. Consequently, d ( t ( z ) , y ) = d ( z l t p ' ( y ) ) and d ( z ,t ( y ) ) = d ( t - ' ( z ) , y ) . For example, if 7 is a group of rotations and translations, then Euclidean distance is invariant under 7. The identity is the zero rotation (or the zero translation). The city block distance is not rotation invariant, although it is still translation invariant. Theorem 3.12 (On invariant measures) Let 7 be a transformation group in a semimetric space ( X ,d ) such that ( X ,d ) is invariant under 7. Then
(x,d T ) is a semimetric space, where d 7 ( z , y) = inftGT d ( t ( z )y) , (2) vitl,tzE7 d7(t1(z),t2(u)) = d T ( z , y ) . (1)
Proof. (1) Note that d7(z,y) 2 0. Then 0 5 d7(z,z) 5 d(idl(z),z) = 0 and the reflexivity constraint is fulfilled. Since d is symmetric and invariant under 7 ,then d7(z, y) = i n f t E i d ( t ( z )y) , = i n f t c l d(y, t ( z ) )= infit,7 d ( t - ' ( y ) , z ) = inft,,7 d ( t ' ( y ) , z ) = d7(y,x). Hence, d7 is symmetric. Let t1,tz E 7. Since d is semimetric and invariant under 7, one can write d((t1 o t 2 ) ( z ) , z ) = d(tl(z),t;l(z))5 d ( t l ( z ) , ~ ) d(g,t,'(z)) = d ( t l ( z ) , y ) f d ( t 2 ( y ) , z ) .Denote t = t l o t z . Then d 7 ( z , z ) = i n f t l , t 2 a -d((t1 0 t2)(z),z ) 5 inftl,t2Eid ( t l ( z ) ,y ) + d ( t ~ ( y ) z: ) 5 i n f t l E i d ( t ~ ( z ) , y ) + i n f 'd(ta(y), t ~ ~ ~ z ) = d*(z,.y)+d7(;y, z ) , which proves that the triangle inequality is satisfied. Hence, d7 is a semimetric.
+
(2) Consider any t l ? t 2€ 7 .d 7 ( t l ( z ) , t 2 ( y ) )= inft,gd((t o tl)(z):t2(y)) = i n f t c 7 d ( ( t z 1 0t o t l ) ( z ) , y ) = inft!€.rd(t'(z),y) = d7(z,y), wlicre t' = t;1otot1.
0
This theorem has a practical implication when objects defined by a set of points or by contours need to be compared. For instance, this is useful for objects in digitized images. It justifies the common sense reasoning of computing the distance between two objects as the value of a smallest mismatch. One of the objects is usually transformed to match the other in the best possible way. If there are invariant transformations, then the semimetric properties of the distance measure are preserved; see also Sec. 5.5.
Characterization of dassamilarities
3.4
105
Properties of dissimilarity matrices
This section discusses some properties of dissimilarity matrices with respect to metric behavior, metric transformations and Euclidean erribeddings. Also corrections of dissimilarities, which impose the metric or Euclidean constraints, are discussed. This is done explicitly for dissimilarity matrices, since our learning algorithms will be later based on such finite representations. A substantial part of the presented theory comes from the work of Gowcr [Gower, 1986, 19821. Basic notions and facts of traditional matrix algebra, as well as of pseudo-Euclidean algebra (which is relatively new)? are collected in Appendix B. The reader should consult books on linear algebra, such as [Bialynicki-Birula. 1976; Greub, 1975; Lang, 2004; Noble and Daniel, 19881, if more details are needed. 3.4.1
Dissimilarity matrices
Here, we will only discuss matrices over the field of real numbers. Concerning the notation, e3 E is a standard basis vector, i.e. e j = 1 and ei = 0 for i # j , 1 denotes a vector of all ones and 0 stands for a vector consisting of all zeros. Moreover, if a vector representation X is mentioned, we will follow the convention from pattern recognition, where n vectors in
RnLare placed in rows of an nxm matrix X , i.e. X
=
[
xq . The consex:
quence of this convention is that the Gram matrix G, Def. B.3, becomes now G = XXT. This notation is a bit unfortunate for matrices in pseudoEuclidean spaces, as the adjoint operator is not the transpose. However, to be consistent with the description of learning techniques, in both Euclidean and pseudo-Euclidean spaces, a vector representation X will be meant as an n x n i matrix of 71, transposed vectors in R”’. The Gram matrix in a pseudo-Euclidean space R(P.4) becomes G = XJp,XT. Consider an n x n dissimilarity matrix D = ( d i j ) and an n x ’ n similarity matrix S = ( s t 3 )for i , j = 1 , .. . , n . In all discussions below, we assume that D is nonnegative and has a zero diagonal. D = D ( X , X ) is now undcrstood as corresponding to a finite generalized metric space ( X ,d ) , X = { z i , z 2 , . . . , z n } , such that the elements of D are d,, = d ( q , z j ) , i, j = 1,2 , . . . , n. In this book, an R X R distance matrix D is called Euclidean if the elements d i j of D are Euclidean distances between pairs of vectors in some real vector space. Note that in the literature, other authors may call D*’ Euclidean.
T h e dissimilarity representation for p a t t e r n recognition
106
Recall from Def. 2.45 that premetric obeys the reflexivity and symmetry constraints, and quasimetric additionally fulfills the definiteness axiom. If the triangle inequality is satisfied, then a quasimetric becomes metric. Semimetric is a metric without the definiteness axiom.
Definition 3.14 (Metric for D) Let D be a symmetric dissimilarity matrix with positive off-diagonal elements. D is metric if the triangle inequality dij d j k dik holds for all triplets ( i , j , k ) .
+
>
Remark 3.2 L e t D be a s e m i m e t r i c . (1)
I l f d i j = E,
(2) If
d,ij = 0 ,
then then
ldik -
d j k / 5 E for a n y k . for a n y k .
dik = d j k
Pro0f.
+
+
(1) By the triangle inequality, d i k d i j 2 d k j and d j k d j i 2 d k i hold for any k . Based on the symmetry condition and dij = E , one obtains: dik E 2 d j k and djk E 2 d i k , which after a simple transformation, gives dik E 2 d j k 2 d i k - E and finally ldik - d j k / 5 E .
+ +
+
(2) Trivial, by the same reasoning as in (1).
0
These properties of metric dissimilarities are important from a practical point of view. If two objects are similar, which means that the dissimilarity between them is small (close to zero or equal to zero), then any other object will have a similar relation to them both. As a result, one of them niay become a prototype to represent the information of both of them. This property is used for approximate nearest neighbor searches in a Euclidean space; see e.g. [Moreno-Seco e t al., 20031.
Remark 3.3 D i s a m e t r i c i f every triplet is Euclidean. Any metric triplet ( d i j , d i k , d k j ) is Euclidean, as it constitutes a Euclidean triangle. However, if n > 3 , not every n x n metric distance matrix D has a Euclidean representation. A counterexample is a 4 x 4 matrix D shown in Fig. 3.3. D can be isometrically embedded in 15: but not in l 2 , so no Euclidean representation of D exists.
Corollary 3.3 I f D i s quasimetric (the triangulation inequality does n o t hold), t h e n t h e m a t r i x D' = D f c (llT-I ) , c 2 maxp,q.rId,, d,, - d,,l i s m>etric.
+
Proof. It suffices to show that D' fulfills the triangle inequality, since other properties can easily be checked. Let ( i , j ,k ) be a triplet for which the triangulation inequality does not hold, i.e. dij + d j k < d i k . Since c 2
Characterization of dissimilarities
107
I
I
0
3
3 1.6
J
3
0
3 1.6
K 3
3
0 1.6
1’5/J -0.1
L 1.61.6 1.6 0
(c) Il representation
(b) No Euclidean representation
Figure 3 . 3 Example of (a) metric distances with (b) no Euclidean representation and (c) a possible el representation. In order to get a 2 0 or 3 0 Euclidean representation, = &f. the distances from the point L to other points should be at least equal to They are smaller, so no Euclidean embedding exists.
59
+
rriaxp,q,rl d p q d,, - d,J, then (djk c) prove that (dij + c ) dij djk 1di.j d j k - d i k l = c is nonnegative, then ( d i j +c) proof.
+
+
+
+
+
C
2
2
(dik
+ djk + c). Note
NOW,we should that d i j djk c 2 d i k , since Iz/ = -z for z < 0. Because ( d j k + c ) 2 ( d i k + C ) , which finishes the ldij
-
+
+
+
If c is relatively small, then the dissimilarity matrix D is only slightly non-metric. If? however, c is large, then the analysis should take into account its non-metric properties. The triangle inequality is the most burdensome to check; in the worst case scenario, all the triplets need to be investigated. In practice, c is not the smallest possible value that will make D metric. Having estimated c, one may follow a sort of bisection rnethod with a specified precision to determine a smaller value C’E (0, c).
Remark 3.4 An i m p o r t a n t question refers t o t r a n s f o r m a t i o n s of a m e t r i c dissimilarity s u c h t h a t t h e m e t r i c properties are preserved. F r o m Theor e m 3.7, w e kn,oiii t h a t if D i s m e t r i c , th,en so i s D f = ( f ( d i g ) ),for f being a non-decreasing and concave f u n c t i o n s u c h t h a t f ( 0 ) = 0 and f (x)> 0 f o r 2 > 0 . Consequently, if D = ( d i j ) is m e t r i c , t h e n f o r c > 0 t h e follouiing dissimilarity matrices also are: ( c d i j ) , ( d i j + c ( I - & j ) ) , (min(1, d i j } ) , ( &23. ) with r E ( 0 , I], ( d i j / ( d i j c ) ? (sigm ( d i j ) ) , (log(l+ d i j ) ) ; see Corollary 3.1.
+
Below we present some results, mostly related to the Euclidean behavior of a distance matrix and its vector representation. A more thorough explanation can be found in Sec. 3 . 5 .
Definition 3.15 (Euclidean behavior) An R X R distance matrix D = ( d i j ) is Euclidean if it can be embedded in a Euclidean space (IFn:d 2 ) : where 7n. 5 n. In other words, a configuration {XI x2,.. . , xn} m n be determined in IwVL such that da(xi,xj) = llxi - xjllz = di,?for all :‘.;j.
108
T h e dissimilarity representation f o r p a t t e r n recognition
Theorem 3.13 (Test I for Euclidean behavior) A symmetric n x n niatrix D with a, zero diagonal is Euclidean a# D*’ = (d:j) is conditionally negative definite (cnd). This means that zTD*’z 5 0 holds for all vectors z E R” such that zT1 = 0 . Equivalently, a symmetric n,xn matrix D with a zero diagonal is Euclidean i;rf -D*2 is conditionally positive definite (cpd). Proof. Let 0”’ be a square Euclidean distance matrix. This means that there exist vectors XI,x2,. . . , x, in an m-dimensional Euclidean space R” such that d:j = I/xi-xJ’. Let gEPSTLconsists of the elements gi = I(xi11’, i = 1 , 2 , . . . n. Then,
=
2zTlgTz - 2 IIzTXI1’.
Note that IIzTXI1’ 2 0 and zT1gTz2 0, as l g T is a psd matrix. Since zTD*’z = 2(zT1gTz- I/zTX1l’), then to assure that z ~ D * ~ 5 z0, one should require that zT1 = 0. Hence, D*’ should be cnd. This finishes the proof. El
Theorem 3.14 (Test I1 for Euclidean behavior) [Gower, 19861 An distance matrix D is Euclidean iff the matrix D:’ = J,D*’J,: with J , = ( I - I s T ) is negative semidefinite (nsd) for sT1 = 1. Equivalently, D i s Euclidean ifl the matrix S;’ = - $ D , is positive semidefinite (psd) for sT1 = 1. 11x71
Proof. ==+ For any x E R”,the vector z = ( I - IsT)x is orthogonal to 1 , i.e. zTl = xT(I- slT)1 xT1 - xT1sT1= xT1 - ( ~ ~1 1= )0. Then by Thcorem 3.13, zTD*’z 5 0; which yields xT[(I- IsT)D*’(I- s l T ) ] x 5 0. This proves that 0:’is negative semidefinite.
+=
Assume be nsd. This means that 0 2 zTDi2z for zT1 = 0. One has zTD;’ z = zT[(1- I s T ) D*’(I- slT)]z= z ~ D * -~ 2z zT1sTD*’+ zTlsTD*2s1Tz = zTD*’z. Hence is cnd and by Theorem 3.13, D is Euclidcan. 0
Remark 3.5 D is Euclidean ifl 0,“’ = J D*’,J is nsd. J = ( I - 11’) is knoiun as the centering matrix. This is a special case of Theorem ,?.14 .for
Characterization of dissimilarities
109
il.
s= Another special case holds for s = ei, where ei ER" is a standard basis vector. Remark 3.6 Let D be an r i x n distance matrix. If Theorem 3.14 i s true f o r a particular vector s , e.g. s = $1, then it is true for any s s,uch that sTl = 1. Some further intuition, is presented in Sec. 3.4.2. Theorem 3.15 (Vector representation) Let D be an nx'ri Euclidean distance matrix. Then there exists m 5 n and a vector representation of the distances d,,j in Rm, defined b y the rows of an r i x m matrix X , such that X X T = -$(I- I s ~ ) D * ~ (slT)and IsT1 = 1.
-+
Proof. Indirectly, the goal is to prove that S,? = ( I - 1sT)D*'(I- slT) is a matrix of inner products (Gram matrix) of some vector representation X in Kim. Let h = 0 " ' s - $ lsTD*2s. It can easily be check that h is the diagonal of S,;see also Sec. 3.4.2. After straightforward calculations, we get S, = -;(I- I S ~ ) O * ~ (slT) I - = -L(D*' 2 - h l T - lhT). Since (e, - e,)Tl = 0, then hlT(ei ej) = 0 as well. Consequently, -
T 1 T *2 (ei - e,j) S,(ei - ej) = -; (e( - ej) D (e, - e J )
by the fact that D*' has a zero diagonal (& = 0 for any i ) . On the other hand, according to Def. 3.15, there exist a vector configuration X = [x:; x 2 . . . ; ] : x such that d:? = JIx,- x,llz. So, we ca.n w r i k d2. ag = ((XTei - X'ejiI; = ((XT(e, - e,)IJ," = (ei - e,)TXXT(e, - e3).
Since, d:j = (e,-ej)TS3(ei-ej), then X can be related to S,as S, = XXT. Note that the dimension m of X is determined by the rank of S,s. 0 Theorem 3.16 (Test 111 for Euclidean behavior) Let D be an n x n , non-zero symmetric matrix with a zero diagonml. D is Euclidean ifl has exactly one negative eigenvalue.
[
-:;'I',
Proof. This theorem and its proof follows directly from the consideratioris of on seniidefiriiteness of quadratic forms in [Chabrillac and Crouzeix, 19841. Given an n x n real symmetric matrix A, they show that requiring that zTAz 2 0 for all z such that BTz = 0 is equivalent to stating that the matrix [ has exactly r = rank negative eigenvalues. By Theorem 3.13; D is Euclidean iff D*2 is cnd, that is zT(-D*')z 2 0 for
tT t]
(a)
110
The dissimilarity representation for pattern recognition
all 0 = zT1 = l T z . By substituting A = -D*'
[ -:;'t] has exactly one negative eigenvahie.
and
I?
= 1, we get that
0
Lemma 3.2 If a n nxn symmetric matrix A is cnd, then A has at most one positive eigenvalue. Proof. A contrario. By eigendecomposition, A = Q AQT, where A = diag(Xi) is a diagonal matrix of eigenvalues in a non-decreasing order, XI 2 A 2 2 . . . 2 X, and Q = [ql q, . . . q,] is an orthogonal matrix of the corresponding eigenvectors. Suppose now that XI and Xz are both positive. Let Y c R" such that Y = span{q,, qz}. Then dim(Y) = 2, as the two eigenvectors constitute a basis. Note that y T A y > 0 for all non-zero y E Y ,since there exist a1 and 0 2 , where at least one is rion-zero, such that y = a1 q, azq, and y T A y = [a1q1+ mqJTQAQT[al q1+ azqzl = [DI CYZ 0 . . . 01 A [a1a2 0 . . . 0lT = a: A1 a; A2 > 0, since the set of eigenvec-
+
+
tors {qi}i=l.."is orthonormal. Let 2 = {z E R" : zT1 = O}. Then d i m ( 2 ) = n - 1. A is cnd if for all z E 2:zTA z .< 0. Y and 2 are subspaces of R" and there exist a I I U I I - X ~ ~ O X E Y n 2,since dirri(k') = 2 and dini(2) = n - 1. However, one has that xTA x > 0 holds, since x E Y and also xTA x 5 0 holds, since x E 2 and A is 0 cnd. Hence. a contradiction.
Remark 3.7 If an nxn symmetric matrix D with a zero diagonal is Euclidean, then, -D*2 ha.s exactly one negative eigenvalue. T h e rewe'we does riot hold. Proof. Assume that {Xi}:=.=, are the eigenvalues of -D*'. Since the trace is, on the one hand, a sum of the diagonal elements and, on the of -0"' other hand, the sum of eigenvalues, then t r (-D*2) = 0 = C:='=, Xi. It follows that -D*2 must have at least one negative eigenvalue. By Lemma 3.2 a n x n , cpd matrix has at most one negative eigenvalue. Hence, -D*2 has exactly one negative eigenvalue.
To provc that the reverse does not hold, consider a nonnegative symmetric matrix A = 3 o 1 . The cigerivalues of -A*2 are {9,0.217, -9.217}, so
[::I:
+
+
Since a13 a32 = 1 1< 3 = a12, then A cannot be Euclidean as the triangle inequality is not fulfilled. 0
-A*, ha,s exactly one negative eigenvalue.
Theorem 3.17 (Test IV for Euclidean behavior) [Crouzeix and Ferland7 i982] Let D be an n x n non-zero symmetric matrix with a zero diag-
Characterzzation of disszmilarities
111
onal. D i s Euclidean iff D*2 has one positive eigenvalue and there exists a vector z such that D*’ z = 1 and zT1 2 0. Theorem 3.18 (Constructing D from S) inite similarity matrix.
Let S be a positive serraidef-
(1) If S = ( s i j ) such that 0 5 s z j 5 1 and sii = 1, th,en, the dissimilarity 5’)”;is Euclidean. Also the matrix 0 2 = ( l l T - S ) matrix D1 = (llTis Euclidean. (2) The dissimilarity matrix D = ( d z 3 ) with, d i j = ( s i i ,?j3 - 2s,,)4 is Euclidean,.
+
If S i s not psd, then the corresponding ciissimilarity matrices shown aboii~ are not Euclidean, yet they can still be constructed. Proof. By Theorem 3.13, it is sufficient to prove that DT2 and D;2 are cnd for all z such that zT1 = 0. (1) Let zT1 = 0. Since DT2 = l l T -S . then zTDT2z= zT1lTz- zTSz = -zTSz 5 0. The latter inequality holds since S is psd, which means that zTSz 2 0 for any z. Hence, 0;’ is cnd. Consequently, D1 is Euclidean. Thanks t o 0;’ = l l T -2s S*’, we get z ~ D , *=~ -zT(2S z - S*’) z. Now. we need to require that zT(2S-S*’)z 2 0. which is equivalent t o (2S-S*2) being psd. See Gower [1986] for proof.
+
( 2 ) Let s = diag (S). Then D*’ = slT+IsT - 2s.Lct zT1 = 0. Consequently, zTD*’z = zTslTz zTlsTz- 2zTSz = 0 0 - 2zTSz< 0, since S is psd. Hence, D*’ is cnd. As a result, D is Euclidean. 0
+
+
Corollary 3.4 Let S = ( s i j ) be an n x n similarity matrix with the elements si3 satisfying 0 5 si3 5 1 and sit = 1 for i , j = I , 2,.. . , n. If the matrix D = ( l l T -S)*+is either non-metric or non-Euclidean metric, then S i s not psd. Moreover, if D = (llT-S ) is eith,er non-metric o r nm-Euclidean metric, then 2 s - S*2 i s not psd [Gower, 19861.
Theorem 3.19 (Correcting D to make it Euclidean) Let D be a nonEuclidean symmetric dissimilarity matrix and let S ( D ) = . I D J , where J = (IDenote Xmin as the smallest eigenvalue ofS(D*’) and A,
illT).
as the largest eigenvalue of the matrix
-+
where
is the zero matrix and I n x n is the identity matrix. Then D can be corrected
112
The dissimilarzty representation for pattern recognitzon
I&K
I5
d
K
J
Figure 3.4 Any non-Euclidean distances can be corrected to become Euclidean. Let D be the dissimilarity matrix from Fig. 3.3. Then the matrix = [ D * 2 + 2(llT-1)]*3 ~ is Euclidean for T 2 0.33 and the matrix D f ) = D IE. (llT I ) is Euclidean for n 2 0.3124; see Theorem 3.19. The plots present Euclidean embeddings of the corrected distances. Note that by using 7 = 0.33 and K = 0.3124, 2-dimensional representations are obtained; see plots (a) and (b), while for T = IE. = 0.5 the number of dimension increases; see plots (c) and (d).
+
~
such that the matrices D:') and DL') are Euclidean5 [Gower, 19861':
(1) D.I-" = [D*2+ 2 r ( 1 1 T - I ) ] * b . (2) D L 2 ) = D + ~ ( l l T - I ) , K ,
r
2 -Amin,
> > ~ ~ ~ .
Proof. (1) Assume that D is a non-Euclidean symmetric dissimilarity matrix. We will use Remark 3.5 to prove tjhat is Euclidean.
-+
illT)
02'
Let S, = (ID*2(I- ;1lT)and h = i D * 2 1 Then we find t h a t S , = -$, (D*z - h l T- lhT).Note that 1
& 1 lTD*'l.
1
.
diag (S,) = - - [diag (0"') 2
~
-
diag (hlT)- diag (lhT)]= ~-[0 -h -h] = h. 2
5There are mistakes (misprints?) in the formulation of this theorem in [Gower, 19861.
113
Characterization of dissimilarities
Therefore, D*2 can be expressed as
D*2 = h l T + l h T- 2 S, = diag ( S c )IT -t 1diag (S,)T
-
2 S,
+
Let S, = ( s Z j ) , then d:j = sii s j j - 2 s,] holds for all i ,j = 1, . . . , ri. From the latter equation follows that adding a constant ‘T to the diagonal of S, is equivalent to adding 27 to the off-diagonal elements of D*2. An eigendecomposition of Sc is given as S, = Q A&’, where A = diag ( A i ) is a diagonal matrix of the eigenvalues in a non-increasing order and Q is an orthogonal matrix of the corresponding eigenvectors. If S, is psd, then all eigenvalues are nonnegative, hence Amin 2 0. However, since D is nonEuclidean, then S, is not psd by Theorem 3.14. This means that there exist some negative eigenvalues, thereby Xlllirl < 0. Let ‘T 2 -Amin, where ‘ T > O if S, is not psd (note that 7 2 0 if S,: is pstl). Then A 71 is a nonnegative diagonal matrix. As a result, S, = Q [A 7 1 ] Q T i s psd. Note furthcr that S, = Q h Q T + Q r 1 Q T = S+7QQT= S, ‘TIand by the observation above, S, = S, 71 = D*2 2 ‘T (llT I). Since S, is psd, then by Remark 3.5, [D*’ 2 7 (ll -T I ) ] * $is Euclidean.
+
+
+
+
+
+
+
(2) Here, the smallest yi > 0 is sought such that D f ) = D 6 (llT - I ) is Euclidean. By Remark 3.5 this means that /c. should be chosen such that the smallest eigerivalue of S, = - a J ( D p ) ) * 2 J , where J = ( I is zero. Let q be the eigenvector of S, corresponding to the zero eigenvalue. Then S, q = 0. Since = D*2+2n0+rc2(11T-I) and J J = J , then after simple transformations, we obtain -$ ( J D * 2 J + 2 y i J D J - ~ ~g =50. ) Let p be a vector such that p = - k D * 2 J q or, equivalently, -6Jp = 2n.JDJq - K 2 J q = 0. After the diviJD*2Jq. One gets -nJp sion by 6 (remember that n > 0), -Jp + 2 J D J q = yiJq is obtained. Consequently, thanks to J J = J , one needs to solve the following sys-JD*’J ( J q ) = K, (Jp) which is equivalent t o tern of equations -Jp 2 J D J (Jq) = yi ( J q ) ,
illT).
+
+
O,,., -JD*2J -In X n 2 J D J
][ d [ ,]);?;:
value of the matrix
Jp
= yi
( : : :
. This means that
K
is thc largest eigen-
which finishes the proof.
Note that both corrections defined above yield different solutions, as illustrated in Fig. 3.4. In the first case, the correction of dissimilarities is linearly related to the corresponding matrix of inner products S, (see also
Th,e dissimilarity representation f o r pattern recognition
114
Theorem 3.15), such that S, case.
=
S,
+ T I . This does not hold in the latter
Theorem 3.20 Let ( X ,d ) be a finite metric space with, the associated dissinrilarity matrix D . Consider the assertions [Barthe'lerny and Gue'noche, 1991; Kelly, 1970; Hjort et al., 1998; Everitt and Rabe-Hesketh, 19971:
( I ) ( X ,d ) is ultrametric. (2) ( X ,d ) possesses the four-point property. (3) ( X ,d ) is Pa-embeddable. (4) ( X ,d ) is (1 -embeddable. (5) ( X , d ) is hypermetric. (6) ( X , d ) is of negative type. (7) ( X ,d i ) is (2-embeddable. The follo.uiin,y implications: ( 1 ) +
(3)
+ (4) + (5) + (6) + (7)hold.
Proof. (1) + (2) Ultranietric space is realized by an ultrametric tree, which is additive by Def. 3.9. Hence, the four point property is fulfilled. (1) + (3) The proof that every finite ultrametric space of n distinct points can be isometrically embedded into an (n- 1)-dimensional Euclidean space can be found in [Leniin, 19851 and the effective construction in [Fiedler, 19981.
(2)
+ (4) See Theorem 3.6 and also [Barthdemy and Guknoche, 19911.
(3) + (4) See [Bretagnolle et al., 1966; Critchley and Fichet, 19971 for proofs.
(4) + (5) Let X = {zI,z2, , z n } . Based on Theorem 3.5, it suffices to show that every cut metric, Def. 3.4, satisfies the hypermetric inequalities, Def. 3.8. Let V and V" = X\V define the cut. Then the cut metric &,(x7,zj) equals 1 if J V n { z i , z j } J = 1 and 0, otherwise. Let y E Zn such that yT1= 1. Then Ci C, yi yi & ( x i , x j ) = 2 CJgV yi y j = 2 ( C 7 , E VY * ; ) ( C j @ VY j ) = 2 ( C Z E V Y i ) ( l - CiEV3%) I 0. The latter inequality holds, since CzEv yi is an integer and, therefore, either both CzEV yi and (1 yi) have opposite signs or one of them is zero. Hence, the cut metric is hypermetric.
xiEv
xzEV
+
(5) (6) Let D be hypermetric. Then by Def. 3.8, y T D y 5 0 for y E 2" such that yT1= 1. Let x E R n be any vector. Then z = (I - lyT)X E W . Moreover, it is easy to check that zT1 = 0. Hence, by Def. 3.8, D is of negative type.
Characterization of dissimilaraties
115
(6) + (7) There is an equivalence of d being of negative type and D*2 being cnd. This implication is true based on Theorem 3.13. 0 Many dissiniilarity measures are constructed by combining the measure applied to all the attributes separately. Given m features (attributes), the dissimilarity can be expressed in the form of d(z,y) = CyLl f ( x l r y Y 7 ) . where f ( z J , z J=) 0 a nd f ( z , , yj) = f ( y J , x J 2 ) 0 for all ,j. Then we have:
Corollary 3.5 Let f i s metric in R.
2,
~ E B P .Then d ( z , y ) = Cy=lf ( z j ,y j ) is m,etr.%ci#
Proof. + Since f is nonnegative, symmetric and f (u, u)= 0 for 71 ER. then the axioms of reflexivity, symmetry and definiteness are fulfilled. Since d is metric, then d(.x. y) d(y, z ) 2 d ( z .z ) for all z, y, z. Consider z’. y. z such that xj = c,, yl = cy, zl = c, for all J and some constants c,,cy and ri. The triangle inequality for d reduces to f ( x C cyC) , + f(yc, z,) 2 j”(rr> zC)> hence f is metric. +=== Trivial. 0
+
EL,
Theorem 3.21 Let f : X x Y Rt be a function. d ( z , y) = f ( x f .y r ) is metric zflp(~:,y)= [ f ( ~ ~ , y i ) ] is ~ )metric + [Gower, 14861. --f
(cF=i
Remark 3.8 Direct product spaces allow us for a construction o,f a ne’ii) space b y combining two (or more) spaces; see also See. c3.3.2. Given a number of square dissimilarity matrices, a new dissimilariky matrix can be creded b y applying an elernmt-wise operator, szLch as sum or m,ax, t o them. For instance, D = D1 + Dz. I n the light of Sec. 3.3.2, this m,eans that finite generalized metric spaces are combined into a new one. The spaces are assumed to be defined on th,e same finite set X , yet they ure distinguished by the dissimilarities measures used. Now,the consequences from Th,eorems 3.10 and 3.11 and the mathematical induction are: The ma2 and sum operators preserve the metric properties. A square dissimilarity matrix, resulting from the summation of dissirnilaritg matrices preserves the t ,-embeddability, hypermetric a d negatiue type properties. If DT2 and Dg2 are o,f negative type, then the matrix (DT2+ D;’)*k is t2-embeddable. This follows from the preservation of the negative type property by summation and Theorem 3.20(7).
116
3.4.2
The dissimilarity representation for p a t t e r n recognition
Square distances and inner products
Assume a configuration X of n vectors {XI,x2,.. . , x,~}in a Euclidean space. d2(x,,xj)= (xi - xj,xi- xj) holds thanks to the definition of R Euclidean distance and an inner product. Therefore, &(Xi, X,j)= (Xi, Xi)
+ (Xj,Xj)
-
+ d2(Xj,0)
= d2(Xz:0)
2(Xi,Xj) -
2 (Xi,Xj),
(3.2)
where 0 is the origin in this space. Conscquently, 1
( X i ,Xj) = -2 (&Xi,
Xj) - &(Xi,
0) ~
d2(Xj;0)).
(3.3)
Based on the well known properties of inner products arid the above formula, the square distance &(xi, X) of xi to the mean sf of the configuration X can be expressed by the square distances as follows: d 2 ( X i , X ) = I/Xi
= (Xi-
~
= d 2. %.
x:xi
-
X) = (Xi,Xi) + (x,x) - 2 (Xi,%)
- -d2 1
(3.4)
2"'
where, abusing somewhat the notation, d;, stands for the mean computed over the 1:-th row of the matrix D*2 and d2, is the overall mean [Torgerson, 1967; Goldfarb; 19851. Let 11s assume, without loss of generality, that the mean vector coincides with the origin, i.e. X = 0. This implies that d2(xi,0)= d2(xz,%). By conibiiiirig this with Eqs. (3.3) and (3.4), one gets
1 rr s = l "'
d2(Xi:Xj) ~
- Cd2(Xz,Xs)
(3.5) n
s=l
p,s=l
Characterization of dissimilarities
117
valid for all i , j = 1 , 2 . . . .,n. Let X E R n x k be a representation of all vectors (x: is the i-th row of X ) and let G be the matrix of inner products, i.e. G = X X T . such that gtJ = (x,,xJ).Eq. (3.5) simplifies to
(3.6) Let D*2 be an n x n square Euclidean distance matrix. By substituting:
1 dz, = - 1D*21T
n2
and after straightforward calculations, G becomes
Alternatively, by Eq. ( 3 . 2 ) ,D*2 can be defined by the Gram matrix G as D*'
= glT+lgT-
2G,
(3.8)
where g is a vector of the diagonal elements of G, i.e. g = diag(G), or g = (G * 1)1,where * is the Hadamard product. In this way, an explicit linear relation between the Gram matrix G and the matrix of square Eiiclidean distances 0*' is expressed. The assumption on a zero mean of the configuration X is not essential, since the configuration can be shifted such that the origin coincides with any other vector lying in a convex hull of X . This means that instead of ST = XTkl = 0. one requires that XTs = 0 with sT1 = 1. As a result, J from Eq. (3.7) becomes J = I - 1sT arid in a bottom-up way. we have reached Theorems 3.14 and 3.15, and Remark 3.6. The same reasoning holds for a pseudo-Euclidean space R(F',q) , because of the linear relation between square pseudo-Euclidean distances and the corresponding indefinite inner products. Therefore, Eq. ( 3 . 5 ) is valid for a pseudo-Euclidean space, where an indefinite inner product . ) E , defined by Eq. (2.1), is used instead of (., .), and a pseudo-Euclidean distance Eq. (2.2) is used instead of the Euclidean distance. As a consequence, the matrix of indefinite inner products G becomes G = X [ - 7 ~] XT. This leads to the conclusion that Eqs. ( 3 . 7 ) and (3.8) remain true for a pseudo-Euclidean space, as well. Further discussion follows in Sec. 3.5.3. ( 3 ,
118
3.5
The dissimilarity representation f o r pattern recognition
Linear embeddings of dissimilarities
Dissimilarity data can be embedded into a Euclidean space in a number of' ways. Since we are interested in a faithful configuration, an embedding is found such that the distances are preserved as well as possible. Linear embeddings are considered, the isometric ones first and then their approximate variants. Since it is not always possible to isometrically embed the data in a Euclidean space, a pseudo-Euclidean space will be used as well. From such a perspective, any finite prenietric space can be isometrically embedded in a pseudo-Euclidean space.
3.5.1
Euclidean embedding
Consider a set R of n objects, R = { p l , p 2 , . . . , p n } . These objects may not be yet represented for the use of computer algorithms, therefore, you think of R as of' an index set. X , on the contrary, is a representation of the objects froni R in a Euclidean (pseudo-Euclidean) space R k . Hence, it is a set of vectors {XI,x2,. . . , xn} in this space. Given a Euclidean pairwise distance matrix D = D ( R ,R ) E I R n x n between the objects of R, a distance preserving linear mapping into a Euclidean space can be found. Such a projcction is known as classical scaling (CS) [Young and Householder, 1938; Cox and Cox, 1995; Borg and Groenen, 19971. The dimension k , k 5 n and a configuration X E R n x k have to be determined such that the (squared) Euclidean distances are preserved. Note that when one configuration is found. any other can be created by a rotation or a translation, as the Euclidean distance is translation- and rotation- invariant. Without loss of generality, the projection is constructed such that the origin coincides with the mean vector o f t h e configuration X . To determine X , the relation between the Euclidean distances and inner products are used. We know from Sec. 3.4.2 that D*' = g l T l g T 2 G , where G is the Gram matrix of the underlying configuration X, G = XX', arid g = diag (G). G can also be expressed as G = JD*'J, where J is the centering matrix J = I - ~ l l T E R n X nJ . projects6 the data such that the final configuration has a zero mean vector. Then the factorization of G by its eigendecomposition is found as
+
~
-;
G=QAQ~,
(3.9)
6A more general projection can be achieved imposing that a weighted mean of X becomes zero; see also Sec. 3.4.2. Then J = I - 1sT, where s is such that sT1 = 1 and G = - J D*' JT. By choosing a proper s , any arbitrary vector of X can be projected at the origin, as well.
Characterrzation of dassamilai~ties
119
where R is a diagonal matrix with the diagonal consisting of nonncgative eigenvalues (G is positive definite by Theorem 3.14) ranked in descending order arid followed by the zero values. and Q is an orthogonal matrix of the corresponding cigenvectors; see Theorem 3.15. G is a niatrix of inner products, so G = XXT. As a result. given k , k 5 n,non-zero cigerivalucs. a k-dimensional representation X is determined as
x = Q~ A;,
(3.10)
where Q k E W L Xisk the rnatrixlof k leading eigerivectors (i.c. corresponding to k largest eigenvalues) and A$ € p S k x k contains the square roots of the coiresponding eigenvalues. This is the result of classical scaling. Note that X determined in this way is unique up to rotation (the centroid is now fixed). since for any orthogonal matrix U , XXT = X U UTX = ( X U )(XU)T. Also the features of X are uncorrelated. since the columns of Q k are orthonormal. The estimated covariance matrix of X becomes then
Sincc C is a diagonal matrix, the vector configuration X is equivalent to the Principal Component Analysis7 (PCA) result as shown below [Fukunaga, 1990; Duda e t al., 2001]. Moreover, the eigcnvalues of G play a key role here as they linearly scale the resulting features (basis eigenvectors) and, therefore, they decide which of them are significant and which not. Note t,liat an uncorrelated vector representation X is obtaincd only for J = (Iwhich corresponds t o s = i l in Theorem 3.14 for J = ( I - IsT). Only then the vector mean of X is set to the origin. This justifies why this particular s is in favor. Note also that G is a reproducing kernel for the Euclidean space R k ;see also Sec. 2.6.1. This means that the erribeddirig procedure can be directly performed on a positive serriidefinite matrix G, hence a kernel, treated as a similarity matrix.
illT),
Proposition 3.1 (On the equivalence between the PCA and CS results) Assume u vector configumtion Y in a Euclidean space Rk and a Euclidean distance m a t r i x D ( Y , Y ) . Let t h e m e a n vector of Z lie a t the origin. T h e n t h e classical scaling resudt, Eq. (3.10), i s equivalent to t h e PCA projection based o n th,e estimated covariance m a t r i x of 2 , c o i i ( 2 ) =
12%. n-1 7More details on PCA can be found in Appendix D.4.3.
120
The dissimilarity representation for pattern recognition
o Adding2.r
0.12 0.1 ** ~
o.08-o * 0.061, f;
i
O 04/. 0.02
1
Figure 3.5 Eigenvalues resulting from the embedding of a 100x100 modified Hausdorff dissimilarity representation D of the NIST-38 digit data (described in Appendix E.2) and the corrected representations Dzr and D,. The eigenvalues are sorted according to their magnitudes, so the order of eigenvalues coming from different embedding is different. Remember that 7 =,I,A,I, where, , ,A is the smallest eigenvalue. Adding 27 t o the off-diagonal elements of D*2 is equivalent t o adding 7 t o all the original eigenvalues (hence the smallest one becomes now zero). Eigenvectors remain the same. The relation between original eigenvalues and eigenvalues of D , is nonlinear.
Proof. We start from an nxn distance matrix D and the (Gram) matrix G= JD"2.J. Let (Az, 9,) be an eigen-pair of G. Then YYTq, = A,q, and further by the multiplication by &YT, one obtains &(YTY) (YTq,) = "-(YTq,), n- 1 which is further equivalent to cov (Y) (YTq,) = &(YTq,). Drfine qrCA = YTq,/&, z = 1 , 2 . . . . , n. It is straightforward to check that the vectors qfCA4 are orthonormal. This means that qrcn)is an cigen-pair of cov(Y). The solution of the PCA projection, xrCAis given as x, = Y q y A = YYTq,/&. which is equivalent to x, = Aq,.In the
-+
(5,
1
matrix iiotation. X = Q k A f , which is the classical scaling result. 3.5.2
0
Correction of non-Euclidean dissimilarities
The Gram matrix, G = -$ JD*2J , is positive (semi)definite (pd or psd) if the dissimilarity matrix D E EXnxn is Euclidean; see Theorem 3.14 and Remark 3.5. Therefore, a non-Euclidean dissimilarity matrix D gives rise to an indefinite Gram matrix (which is not psd). Since G has negative eigenvalues, then D cannot be isometrically embedded into a Euclidean space. Algebraically, the Euclidean representation X cannot be constructed by Eq. (3.10) as it relies on square roots of the eigenvalues. However, D can be corrected such that the corresponding G becomes psd. Possible approaches to address this issue are:
Characterization
0
0
of
dissimilarities
121
Only p positive eigenvalues are taken into a y o u n t , resulting in a pdimensional Euclidean configuration X = Q , A;, p < k . Since the actual dissimilarities D are nonnegative, the magnitude of the smallest negative eigenvalue of G is smaller than the largest positive one. Also the slim of the positive eigenvalues is larger than the sum of magnitudes of the riega,tive ones. Hence, after neglecting the negative contributions, the resulting Euclidean distances are overestimated. This might be a justified approach if the negative eigenvalues are relatively small with respect to the positive ones; see Sec. 3.5.6, where the issue of noise influence is discussed. We argue that the distances which are directly measured may be noisy and, therefore, not perfectly Euclidean. This will result in small negative eigenvalues of G. Therefore, noise is diminished, when they are disregarded. By Theorem 3.19, there exists a positive constant T 2 - A m i n , where Amin is the smallest (negative) eigenvalue of G, such that the distance matrix Dz, = [D*2+ 2 7 (llT - I ) ] * $ is Euclidean. This means that the corresponding Gram matrix G, is pd. The eigenvectors of G and G, are identical, but the value 7 is added to the non-zero eigenvalues, giving rise to a corrected diagonal eigenvalue matrix A, = A k TI. This is equivalent t o ‘regularizing’ the covariance matrix of our final configuration X by C = (A, 71) and changing X , respectively. Note that the original dissimilarities are distorted significantly if T is a large value. See also Fig. 3.5. There exists a positive constant 6 2, , , ,A max, where, , A max is defined in Theorem 3.19, such that D , = D K (llT - I ) is Euclidean. After the correction of D , the corresponding Gram matrix G, yields eigenvalues and eigenvectors which are different than those of G. See also Fig. 3.5. By Theorem 3.9, there exists a parameter p for a function g defined as in Def. 3.12 such that the matrix D, = ( g ( d i , j ; p ) )is Euclidean. In practice, p is determined only by trial-and-error, although p < 1, in general. In principle, an indication of the value of p is given by (1- T ) , where T is the ratio of the absolute value of the smallest negative eigenvalue to the largest positive one. An algorithm to determine p is also proposed in [Courrieu, 20021.
+
+
0
0
+
These approaches transform the original dissimilarity data such that a Euclidean configurat’ioncan be found. This is especially useful when the negative eigenvalues are relatively small in magnitude, which suggests that the original distance measure is nearly Euclidean. In such cases, the negative eigenvalues can be interpreted as noise contributions. If the negative
122
The dasszmalaraty representataon f o r p a t t e r n recognataon
eigeiivalues are relatively large (in magnitude), then by neglecting them, important information might be disregarded [Laiib and Muller, 2004; Pqkalska rt nl., 2002bI. There is still an open question whether it is beneficial for learning to transform the dissiniilarity data D to exhibit the Euclidean behavior, either by neglecting the negative eigenvalues or by directly enlarging D*2 by a constant. In general, a hollow dissimilarity matrix D, Def. 2.45, can be corrected to have the Euclidean behavior. First, to make it definite, any zero dissimilarity between two different objects should change into a small fixed value E , depending on the overall distances, e.g. 0.01. Alternatively, if the zero dissiniilarity is obtained for different but nearly identical objects of the same class only, one may consider them as belonging to the same equivalerice class with the equivalence relation z y iff d ( z , y ) = 0; see also Example 2.7. Next, to make the dissimilarity symmetric, an operation like avcraging of di,i and d j 2 or taking their niaxirnurri value can be performed. Sirice D has become quasirnetric, any of the corrections described above will make it Euclidean. It is also possible that the corrections applied are less than required for guaranteeing the Euclidean behavior (i.e. by adding a constant to the offdiagonal elements of D*2 smaller than the necessary one). In such cases, the measure is simply rnade ‘more’ Euclidean (hence, also ‘more’ metric), since the infliicncc of negative eigenvaliies will become smaller after the discussed transformations; hence it may become negligible.
-
3.5.3
Pseudo-Euclidean embedding
When a Euclidean space is not ‘largc enough’ to cnibed the dissimilarity data (due to negative eigenvalues of the corresponding Gram matrix), Goldfarb proposed to embed D into a pseudo-Euclidean space [Goldfarb. 1984, 19851. Such a procedure can be applied to any premetric finite dissimilarity representation. A pseudo-Euclidean space is a direct orthogonal ciecorriposition of two vector spaces IWP and Rq,for which the indefinite inner product is positive definite on IWP and negative definite on IWq; see See. 2.7 for details. To find this embedding, the same reasoning as in the Euclidt.an case is applied. The essential differencc rcfers to the notion of an inner product arid a pseudo-Euclidean distance. Now G = - JD*’J is the Gram matrix, but cxpressed as (3.12)
Characterization of dissimilarities
123
where JPq is the fundamental symmetry matrix in a pseudo-Euclidean space IR(P.q). The eigendecomposition of G becomes then (compare also to Eq. (3.9)):
where p f q = k . A is a diagonal matrix of p positive and q negative eigenvalues, presented in the following order: first the positive eigenvalues with decreasing values, then the negative ones with decreasing magnitudes followed by zeros. Note that this is a proper eigendecornpositiori in a pseudoEuclidean space. Remark 3.9 (Eigendecomposition of the Gram matrix) In a pseudo-Euclidean .space, t h e G r a m m a t r i x G i s positive de,finite in t h e Euclidean sense, but it i s not J-positive defin,ite. However, GJPpcl is J-pd. T h e eigendecomposition of G& c a n be therefore performed in a psewdoEuclidean space. T h i s gives GJpq = Q h Q * , which leads t o t h e traditional eigendecomposition of G as G = Q(AJPq)QT,,where Q i s th,e m a t r i x of eigenualues of G and (hJpq) is th,e diagonal m a t r i x o,f corresponding eigenvalues; see A p p e n d i x B.2 f o r details. The configuration X can be therefore determined in a pseudo-Euclidean space IW' = I[R(p>q)of the signature ( p , 4 ) as (3.14) where only k non-zero eigenvalues in A h are taken into account. Q k is the matrix of k leading eigenvectors. Otherwise. additional zero eigenvalucs would describe X in a degenerate indefinite inner product space as explained in Sec. 2.7.) Additional explanation on eigendecomposition in a pseudo-Euclidean space can be found in Appendix B.2. The estimated pseudo-Euclidean covariance matrix is defined by (see Proposition B.3) (3.15)
Consequently, X is an uncorrelated representation. Although C is not positive definite in the Euclidean sense, it is positive definite in the pseudoEuclidean sense, i.e. J-pd by Def. 2.82. This means that X is can be interpreted in the light of a PCA projection. hence the whole embedding procedure becomes a non-degenerate indefinite-kernel-PCA approach [Schiilkopf
124
The dissimilarity representation for p a t t e r n recognition
et al., 1999b; Schiilkopf, 19971, where the kernel G is a reproducing kernel of the pseudo-Euclidean space R(P,q); see also Sec. 2.7.1. Similarly to the Euclidean space, such an uncorrelated vector representation is obtained only for J = ( I or in other words, for s = il, when the following Gram matrix G = -iJ,D*’J; with J , = ( I - I s T ) is considered. Corriputing the square distance in a pseudo-Euclidean space R(P,y)can be realized by computing the square Euclidean distance in a ‘positive’ space R P and subtracting tlie square Euclidean distance found in a ‘negative’ space Rq. The distances derived in the ‘positive’ space only are overestimated, therefore, the purpose of the ‘negative’ space is to correct them, i.e. makc them non-Euclidean. Since in this work, pseudo-Euclidean spaces will result from an embedding of nonnegative dissimilarities, the ‘negative’ contributions to the overall distances are smaller than the ‘positive’ contributions.Hence, by the construction of the configuration X , the projected vectors of X take smaller (in magnitude) values in Rq than in RP.Practice confirms that many summation-based measures arc close to the Euclidean distance, giving risc to relatively small negative eigenvalues in the embedding. On the contrary, the dissimilarity measures based on minimum or rnsxirnum operations, such as variants of Hausdorff distance, Sec. 5.5, are vcry different and give rise to large (in magnitude) negative eigenvalues. Note that the proposed embedding is very general. Any symmetric dissimilarity matrix can be embedded in a pseudo-Euclidean space. And any asymmetric dissirnilarity matrix D can easily be made symmetric
illT),
3.5.4
Generalized average variance
Assiinic a set of vectors {xl,xz.. . . , xn} in a pseudo-Euclidean space E = R(”>‘J). p + y = k , determined in the pseudo-Euclidean embedding of D . Let t h y he stored as row vectors in the matrix X . The generalized average variance of the configuration X is the trace oftlie covariance matrix of X , i.c. tlie s l i m of variances, as
Remember that 1 ) . I I E = ( . , . ) E . Since X reflects the same geometry as imposed by the square distancc matrix D * 2 ( R , R ) based , on the set R = ( ~ 1 . ~ 2 ., .., p n } , it is possible to express V d ( X ) only in terms of such distances. Formally. one has:
Characterization o.f d i s s i m i l a r i t i e s
125
Corollary 3.6 Given the dissimilarity matrix: D = D(R, R ) , the generalized average variance of the embedded pseudo-Euclidean corifigurution X is equivalent to the average square dissimilarity: (3.17)
Proof. We will now show the equivalence between Eq. (3.17) and Eq. (3.16). Making use of Eq. (3.8) and the facts that lT1= rI and lTg = tr(G) = Cy=,llx,ll$,one gets 1 rL Vd(T)= 7 2n
=
1 C C d2(pj,;pl,) = 1TD*21 j=1 k=l
2n2
1
1
1
27L2
2n,
2n
- [lTglT1+ lT1gT1 - 2 lTG 11 = - ITg + -gTI
-
I rL2
- lTG 1
as lTG 1 = lTX Jpq XT1 = ETJpq X = jI%j,:I for Jpq being the fundamental symmetry in the space €. The above reasoning remains valid for an Euclidean distance matrix D as 0 one uses JPq = I .
3.5.5
Projecting new vectors t o an embedded space
Let X E I t n x k , k = p + q , be a configuration in a pscudo-Euclidean space R'" = R(p,q) that preserves all pairwise distances expressed by D ( R .R). Remember that the Euclidean case is included for y = 0. Given a matrix D, E R t X Tof L the dissimilarities between t new objects arid tlie objrcts of R. new objects can be projected to an cnibcdded space. Let X, be tlie new configuration to be determined. First, the cross-Gram matrix G,, relating all new objects to the objects from R should be found.
Corollary 3.7 Let D ( R ,R ) be isometrically em,bedded into ( X .R'")), where R'"= R(P,q)). Let D,,be a dissimilarity matrix between t 'new objects and the objects of R. The cross-Gram matrix G, sf indefinite inner products is given as G, = - i [ D i 2 J - U D * 2 J ] ,wh,ere J = ( I E I W " " ~ and u = 111TERtxa. t
illT)
T h e dissimilarity representation for p a t t e r n recognition
126
Proof. Assume that {yl, y 2 , .. . ,yt} is a vector representation of new objects projected into the space Rk. It follows from Eq. (3.3) that the inner product between a new vector and the original vectors is given by ( y z , x 3 )=~ -i[d;(y,,x,) -d:(y,,O) -d2(x3,0)]. Making use of Eq. (3.4) and the fact that the mean coincides with the origin, the indefinite inner product becomes then
(3.18) 1 (di(Y,, X j ) 2
= --
-
(di,i.
-
d2j
+ d2. ),
where ( d i ) i , stands for the mean computed on the i-th row of the dissimilarity matrix D i 2 , d2j stands for the mean computed over the j - t h row of the matrix D*2 and d2, is the overall mean. Let G,, €iRtXnbe the matrix of inner products between t new vectors arid n original ones. Using elementary matrix operat,ions, Eq. (3.18) can be rewritten as: G, = -$(D:2 J - U D * 2 J ) ,where J is the centering matrix arid u = 1 1 T ~ ~ f ~ 7 ~ . 0
+
As a result, the cross-Gram matrix G,, is G, = - ; ( D i 2 J-U D * 2 J ) . On the other hand, G, is the matrix of indefinite inner products and, thereby, it can be expressed as:
Gn = X,, JPqXT,
41q =
{[
I ERkXk, I,,, o
-IqxjE I W " ~ ,
if Rk is Euclidean. if Rk is pseudo-Euclidean
(3.19) Therefore, X,, is determined as the solution of an indefinite least square problem X , J p q XT= G,, i.e. X , = G, X ( X T X ) - l J p q ; see Theorem 2.28, Remark 2.14 and Corollary 2.11 for justification. Knowing that XTX = ( A ( and X = Q k IRI,14, X,, is alternatively expressed as
xn= G,
X 1Al-l
Jpq
or
x, = Gn Q k
1A1,l-t
Jpq.
(3.20)
Assuming that U D*2J is pre-computed, the computational complexity of determining the cross-Gram values of G, for a single object is O ( n ) . Since X (Alp' Jpq can be pre-computed as well, then O ( n k ) operations are required for a projection of a new object.
Characterization of dissimilarzties
127
Dissimilarities D(R, Ri
Vector representations R(2.1)
Figure 3.6 By adding one object to the set R, the dimension k of the vector represcntation of D ( R , R ) in a pseudo-Euclidean space might increase by more than one. The points of R = { I , J , K , L } lie on a line, but the point M does not (two upper plots). The embedding of D ( R , R ) reveals a 1-dimensional configuration (the leftmost bottom plot). After enlarging R by M , a pseudo-Euclidean configuration in R(2x1) is obtained (the rightmost bottom plot, where the z-axis describes the ‘negative’ contribution), increasing the dimension by 2. The circles in the rightmost plot correspond t o the points I L projected into a plane, parallel to the xy-plane, on which iZ.1 lies. ~
3.5.6
Reduction of dimension
By enlarging the set R by one object, one vector is added to a finite pseudo-Euclidean space, but the dimension k of the vector representation resulting from the enlarged D might increase by more than one, contrary to the Euclidean case; see Fig. 3.6 for an illustration. This happens if the pseudo-Euclidean space does not yet fully reflect the variability in the dissimilarity data. In practice, when new vectors are added, they are projected into the space determined by the starting configuration X . Note also that both outliers and noise can significantly contribute to the resultirlg dimension k . Tlierefore: the reliability of X , i.e. whether D ( R jR ) is sufficiently well sampled, plays an essential role in the process of representing new data, and consequently, the performance of learning algorithms applied later on. Originally, the pseudo-Euclidean configuration X is found such that the distances are preserved exactly and the dimension of X is determined by the number of non-zero eigenvalues of G. However? there might be many relatively small non-zero eigenvalues as compared to the large ones. Knowing
128
The dissimilarity representation for p a t t e r n recognition
that dissimilarities are noisy measurements, the small eigenvalues, as they scale the features of X , reflect non-significant directions in this vectorial representation. Therefore, neglecting small eigenvalues stands for noise reduction (see Fig. 3.7 for an illustration) or for determining a representation with the intrinsic dimension. In both cases, the distances will be preserved only approzirnately. One has, however, a control over the dimension of the reduced vector representation. Basically, the dimension reduction can be achieved by an $orthogonal projection, governed by the indefinite PCA. Recall that X = Q k l A k / + is constructed such that it is an uncorrelated vector representation, i.c. the covariance matrix C = --&& is a diagonal matrix. This guarantees that X has already the form of the 3-orthogonal PCA projection; see Eq. (3.15). It means that the reduction of dimension is performed in a simple way by neglecting directions corresponding to eigenvalues, which are small in magnitude. The reduced representation is then determined by p’ dominant positive eigenvalues and q’ dominant (in magnitude) negative eigenvalues8. Therefore, X ( m )E R”’”, rn < k , is found as X ( m ) = Qm lAmli, where m = p ’ f q ’ and A,, is a diagonal matrix of first, decreasing positive eigenvalues and then increasing negative eigenvalues, m d Qm is the matrix of the corresponding eigenvectors. How to choose p’ and q’ is an important point. In practice, the number of significant dimensions can be determined by detecting the position when the eigenvalue curve, (interpolating the eigenvalues decreasing in magnitude), flattens. Some theoretical analysis of eigenvalue spectra of covariance and Gram matrices can be found in [Hoylc and Rattray, 2003a, 2004a,b].
3.5.7
Reduction of complexity
Reduction of dimension described above is useful for data representation, since both noise and non-significant information are neglected. Still, the reduced configuration in EXm = R(P’>q’)is determined by all n objects. However, for the definition of an m-dimensional pseudo-Euclidean space only m 1 objects are in principle necessary: one to define the origin and m objects to be the basis vectors. Given a reduced configuration X ( n L )with respect to the principal axes, the question arises how to choose a reduced
+
8Remember that X is an uncorrelated representation only if the origin coincides with the mean vector of X , i.e. obtained by using the centering matrix J = ( I - I s T ) with s = i 1 in Eq. ( 3 . 7 ) :see also Theorem 3.14. If some other s is used, then the indefinite PCA should be performed in a pseudo-Euclidean space. See also Sec. 3.5.8
Characterization of dissimilarities
W.R)
Theoretical data R
-
129
O
O
,
D(R,R)+noise
o,
-5
D(R+noise, R+noise)
5
5 0
1
+ +
0
-5
-9b A: -54283,1462,506,467 0 5
-1
3
Figure 3.7 Noise influence on the eigenvalues of G. The leftmost upper plot presents a theoretical banana data of 160 points, for which the Euclidean distance matrix D has been computed. The rightmost upper plot shows the embedding of D into a 2D space (note that the retrieved configuration is exact up to a rotation). The leftmost bottom plot presents the projection into the first 2 dimensions of the 159D data obtained via embedding of the distorted distances D, (where & j = d i j ~ i for j i # j and ~ i j N ( 0 , 1)) (taking care that min+j l ~ i j l< mini#j dij, i.e. no negative distances arise), which become non-Euclidean. The average distortion is 0.8, while the average Euclidean distance is 7.57. The rightmost bottom plot presents the projection into the first 2 dimensions of the 4D data obtained via embedding of D ( k ) , where R consists of the theoretical data R to which 2 noisy features were added giving rise to the average distortion of 0.9. Note that the first 2 largest eigenvalues, as presented in the plots, arc relatively the same for the non-distorted as well as distorted data, which practically gives the same results in all the cases. Therefore, by neglecting relatively small eigenvalues, noise is diminished.
+
+.
set Rrrd of T = m + l (or more) objects such that the projection defined by R r e d gives a good approximation of the configuration X ( T ( l . To avoid an intractable search over all possible subsets, an error nieasiire bctween the reduced and approximated configurations can he defined and then rninimized, e.g. in a greedy approach. Such criteria are proposed and analyzed in an experimental study in Sec. 9.3. 3.5.8
A general embedding
An uncorrelated vector representation X (with a diagonal covariance matrix) is obtained by an embedding of D which relies on the eigendeconipo-
130
T h e dassamilarity representation f o r p a t t e r n recognition
-;
illT).
sition of G = JD*2J for J = ( I Another vector representation X,? can be obtained from the Gram matrix G = ( I - I s ~ ) D * ~ (sIl-T ) , wherc sT1 = 1, following the steps as described in Sec. 3.5.3. In general, the determined representation X , will not be uncorrelated. In such cases, finding the principal components may be helpful.
-+
Proposition 3.2 (Pseudo-Euclidean PCA) Let R(p14) be a pseudoEuclidean space. A s s u m e that a configuration of n vectors is shifled s3uch th,at the X lies at the origin. Hence the pseudo-Euclidean covarG see also Proposition 13.3. BY Propoance ni,atriz is C E = &xTxzPq; sition, B.2, the eigendecomposition CE = QRQ*. A J-orthogonal m,atrix Q and the diagonal matrix of eigenvalues A can be ,found as suggested in Remark B.l. Let P be a projection t o a subspace R(p'.g'), p' 5 p a n d q' 5 q spanned by the dominant k = p' + q' eigenvectors, correspondi n g t o the f i r s t p' largest positive eigenvalues followed by the q' largest an magnitude negative eigenvalues. Let t h e m be described by a matrix Q k = [qlq 2 . .. qP,+,,]. By Proposition B.1,such a projection i s defined as P = Qk(Q~Jp,Qk)-'QzJP, = Q k J p ~ q / Q ~ J p qas , the vectors in Q k are orthonormal, Q I J p q Q k = J p ~ q ~ T. h e resulting matrix of principal components Y (rernem,ber th,at the transposed vectors are placed n o w in th,e rows of the matrices) is Y = X & , Q & I ~ , Q ~ . 3.5.9
Spherical embeddings
Nonlinear projection methods can rely on the geodesic distance, i.e. the shortest path between two points on a manifold; see e.g. [Tenenbaum et al., 2000; Lee et al., 2000. 20021. Euclidean distance is the geodesic distance on a hyperplane. Since there exists a natural connection between the spherical geodesic distance and a linear Euclidean embedding, also spherical embeddings are bricfly considered.
Definition 3.16 (Spherical distance) Let SF c IWTn+' be an rrdirrierisional spherical space, such that ~ ~ ~ = ~r 2 . 'Wex assume f that the renter of the sphere lies at the origin. The spherical distance d, between two vectors x,y E Sy is d , ( ~ y, ) = r arccos ( 5 x ' y ) . This is the geodesic distance on the sphere, which coincides with the angle between the vectors. Given an nxn dissimilarity matrix D and a positive r , a question arises whether there exist points { X I , x 2 >... , x n } on a sphere such that the
SF
Characterization of dissimilarzties
131
spherical distance d,5(xi,xj) = d i j . This can bc t>ransforniedinto the problem of embedding a suitable distance matrix in a Euclidean space:
Theorem 3.22 (Schoenberg) L e t D be a n n x n dissimila D can be embedded into u spherical space SF for a posikiiie r ifl di,i 5 r r f o r all i, j = 1 , 2 , . . . n and t h e m a t r i x G = (cos( is p s d [Schoenberg, 19371. T h e n the smallest nz s u c h that D embeds into a: spherical space Sp i s m = r a n k ( G ) - 1. T h e solution i s undefiried if mnk( G ) = 1.
F))
Proof. Although the proof can be found in [Schocnberg, 19371,we present it here since it has an educational value. A problem of tlie ernbedding of D into the spherical space Sp can be transformed into thc problem of embedding of some transformed matrix into a Euclidean space. The requirement of d,, 5 r T is obvious, since no distance on the sphere with the radius T can exceed 7r T . Suppose that {XI , x2,. . . , xrL} lie on a sphere SF. Let, xo be t,he center of Sp.Then all xi form an n-simplex in RT1,+lwhose edges have the lengthsg: poi = = r arid ptJ = = 2 T (sin $) for i , j == 1,.. . , n. Consequently, we have to prove that the distance niatrix D , = ( p i 3 ) is Euclidean, or equivalently, that D , ernbeds into a Euclidean space. Based on Theorems 3.14 and 3.15, D, is Eiiclideari if S = - L 2 (I-IsT) D i 2 (I-slT) for sT1 1 is psd. Since in our case, x g shoidd become the origin (i.e. the center of the sphere), then we choose s = el. As D;2el = r 2 ( 1- el), then after straightforward basic transformations we get S = r 2 ( ( 1- e l ) l T- ;Di2),. Given that xo = 0, it is sufficient to consider a matrix G which is tlie matrix S without the first row and the first column. Then G = !r2(llT-2 sin2(dzj/(2r))= r2 (cos(d7,J/r)). Thc condition that S is psd imposes that G should be psd, which firiislics the proof. 0 ~
Consequently, we have also proved the following:
Theorem 3.23 An, n x n , dissimilarity matrix D embeds into a spherical space iff D, = ( p i j ) , i , j = 0 , 1 , . . . , n,, with pol = fr a n d p i j = 2 r (sin(dz3/(2r)),i = 1,.. . n is Euclidean. Note that the embedded points are found by applying Theorem 3.15 (with s = e l ) to D,. Corollary 3.8 Spherical distances D e z a and Laurent [1997].
(SF,d,)
are !l-embeddable; see also
'From geometry pz", = p&+pi3 -2 paz pol cos ( $ d z 3 )which , after basic transformation gives ptJ = 2 sin($dZ3).
132
The dissimilarity representation f o r pattern recognition
Figure 3.8
Illustration of the spherical distance
Sketch of proof. Based on Theorem 3.5, it is sufficient to show the existence of a nonnegative measure space such that d, is the measure of the symmctric difference. Let S y be an m-dimensional sphere with the radius r = 1. Define tlie measure p on S;" as the fraction of rn-dimensional wol(A) hypervolunie such that p ( A ) = v .10 E [O, 11 for A 2 S;". Let A be a collection of subsets of ST. Consequently, (S;",A,p) is a probability space. Consider further a hemisphere H;"(x) = {y E ST : d,(x.y) 5 $ } centered at 2 . Then the measure of volume symmetric difference between 1
arccos(XTY)
2.rr vol(ST) = two hemispheres is p ( H ; " ( x )AH;"(y)) = 2 $ d,(x,y). On the pictorial illustration, Fig. 3.8, p (N;"(x) A H y ( y ) )is the volume of the shaded regions. Based on Theorem 3.5, the space (S;",d,) is Y1-embeddable.
3.6
Spatial representation of dissimilarities
Assunie that we have a set of raw data examples R, e.g. typed words, shapes. digitized voice cxcerpts, and a dissimilarity measure provided by an exptrt. Since the computation of dissimilarities is usually costly, in practical applications, R is a rclatively small sct of objects c.g. chosen from a larger set T . Given tlie dissimilarity data D ( R ,R), a spatial representation of D is a configuration of points representing the objects in a space. Usually. it Euclidean (pseudo-Euclidean) space is considered or, alternatively, R" equipped with an t p metric. Spatial representations are in fact approximate ernbeddings into suitable low-dimensional vector spaces, which should reflect the dissimilarity relations between the objects. Hence, they are often used as a visualization tool. Such spatial representations are visually appealing and often enhance interpretability of the relations in the data. The configurations are believed to reflect significant characteristics,
Characterazation of dissimilarities
133
as well as ‘hidden structures’ of the data. Therefore, objects judged to he similar result in points being close to each other in such a space. Thc larger the dissimilarity between two objects, the furt,her apart they should be in the resulting map of points. More generally, spatial representations are interpreted as (possibly reduced in complexity) feature-space configurations of the overall dissimilarity structure in the data. They are often used in clustering, classification or data-mining techniques. From the previous section we already know that (approximate) linear embeddings of the dissimilarities are methods for obtaining spatial representations. In this section, we will discuss two more techniques, namely a linear projection FastMap [Faloutsos and Lin, 19951 and nonlinear multidimensional scaling (MDS); see e.g. [Kruskal and Wish. 1978; Cox and Cox, 1995; Borg and Groenen, 19971. It is important to emphasize that the resulting axes in such spatial maps are, in themselves, meaningless. What is important is the relative positions of the points, representing objects. In the case of a Euclidean space, additionally, the orientation of the projection is arbitrary, since any rotation of the configuration does not change the distances (the same is valid for pseudo-Euclidean spaces, however, in terms of an appropriate rot,ation, represented by a J-orthogonal matrix, defined there). See Fig. 3.10 for an illustration of the basic spatial models on a theoretical banana data. A number of other techniques exists for obtaining spatial representat,ions. These will be briefly introduced in Ckiapter 7 )in which also practical aspects of spatial models are considered. 3.6.1
FastMap
FastMap was introduced in [Faloutsos and Lin, 19951 in the Data Mining conirniinity and it is originally meant for vectorial data accompanied by a distance measure. Assume that a set R = { P I , . . . , p r L } and an n x n
134
The d i s s i m i l a r i t y representation for pattern recognition
Euclidean distance matrix D ( R , R ) are given. Then there exists an mdinierisional Euclidean space, rn 5 n such that the distances are preserved perfectly; see Sec. 3.5.1. The idea is to project the data on m mutually orthogonal directions. This is realized in an incremented way, starting from the first dimension. The basic principle is to orthogonally project p i into a line in IwTlL determined by two pivot objects, T I and 7'2. Pivot objects should bc the ones which yield the 1a.rgest distance. The projection xi of the object pi into this line can be determined from the cosine law of the Euclidean geometry (as illustrated in Fig. 3.9) as:
(3.21) Since thc objects will lie in a Euclidean space RnL1 the projection method is extended as follows [Faloutsos and Lin, 19951. Let H be an ( m- 1)dimensional hyperplane perpendicular to the line defined by T I and r2 (or the remaining EXmp1 space). Then after mapping all objects on this hyperplane (or in fact updating the distances appropriately), the problem to be solved is identical to the original one, but with the dimension ( m - l ) , instead. Hence, the solution can be found recursively using Eq. (3.21) to determine the coordinates of the dimension of interest. Since the square Euclidean distances are additive, the distances of the objects projected into the hyperplane H becorne then [Faloutsos and Lin, 19951:
In t,he next step, D = D H , defining the same problem, but for the space . Although the dimension m should he specified beforehand, the
~ 7 r i1
algorithm may stop when the distances d H become practically zero. New points can be added to the existing map in the same recursive manner, based on the distances to the pivot objects. The algorithm requires the computation of 2m distances, so the complexity is C?(m). Note that the cosine law captures the same relation as the one given by Eq. (3.2). The Euclidean embedding realized by classical scaling, as described in Sec. 3.5.1, makes also use of the cosine law, however, the projection is optimized for all the triplets (defining Euclidean triangles) simultaneously, instead of in an incremental way as FastMap does. Note that for non-Euclidean dissimilarities the derived configuration approxirnat,cs the original distances, since the cosine law (which is the founda,tion of FastMap) is valid for the Euclidean distances only. In the mapping process, at some point the distances d H of the objects projected into the
Characterization of dissimilarities
L... 1 1 _ 1
Theoretical data
,
-I
Classical Scaling --I
,
135
. Fast Map
'- +
Sammon map S-1
Sammon map 5'1
Figure 3.10 Spatial 2D maps of 200 x 200 dissimilarity matrix D for theoretical banana data (leftmost subplot). D is defined by the city block distance. Since D is nonEuclidean, Theorem 3.20, the 2D maps are only approximate embeddings. The scale is preserved in all subplots.
hyperplane H may become negative, which indicates that H exists in a pseudo-Euclidean space. Yet, the projection is always done to a Euclidcan space. Formally. if the dissimilarity data D can be embedded into the R(piq) space, then thc dimension m used in FastMap should be such that m 5 p , since FastMap preserves, in fact, the Euclidean distances corresponding to the embedding into BP.In summary, FastMap is less optimal than classical scaling with respect to the preservation of distances. However, it is fast, as an incremental mapping, which has a possibility to an early stopping. 3.6.2
Multidimensional scaling
Multidimensional scaling (MDS) refers to a group of' linear and nonlinear projection methods of the dissimilarities. Although the theory of MDS was developed in behavioral and social sciences [Kruskal and Wish, 1978; Cox and Cox. 1995; Borg and Groenen, 19971, its applications were extended to pattern recognition and other related fields. The reason is that the MDS methods facilitate data visualization and exploration. These projection techniques aim to preserve all pairwise, symmetric dissimilarities between data objects, resulting in a low-dimensional representation of the geornetrical relations between the points. Such a configuration is usually found in a Euclidean space. although any other 8, space, p 2 1, can also be considered [Cox and Cox, 1995; Borg and Groenen, 19971. The MDS output is a spatial
136
The dissimilarity representation for pattern recognition
representation of the data. Most of the concepts presented here as well as the discussion on the MDS algorithms can be found in the books of [Borg and Groenen, 19971 and [Cox and Cox, 19951. The latter book provides a good, concise introduction into the subject, while the former book is a thorough compendium. Our work is concerned with Sammon mapping a,nd it relies on [Pekalska et ul., 1998a,c,b, 19991. Metric MDS is a description of methods which assume that both the input data and the output configuration are metric, or rather that the dissimilarities are described by quantitative values. Suppose an n x n dissimilarity matrix is given. The aim is to find a possibly low-dimensional space such that the discrepancy between the original dissimilarities arid the est,ima.teddist,a,ncesis minimized. Intuitively, each pairwise distance corresponds to a ‘spring’ between two anchors (points) in this low-dimensional space. Then the MDS technique tries to rearrange the points such that the overall ‘stress’ of a fully connected spring system is minimized. The dissimilarities can describe the relations between objects represented originally in a high-dimensional space, measured (e.g. matching costs of image patterns, road distances) or given (human judgments). When the observed or measured dissimilarities convey qualitative instead of quantitative information, they give rise to non-metric MDS methods. In essence, they are solved in a similar way as metric MDS methods with the exception that the nature of dissimilarities is different, such as preferences or ranks [Borg and Groenen, 1997; Cox and Cox, 1995; Kruskal, 19771. These methods are not discussed here. Sirice now on, MDS will stand for metric MDS. There are different ways of preserving the structure of the data, giving rise to somewhat different techniques of MDS. Traditional classical scaling (CS) is the most simple, linear MDS algorithm. It has been already int,roduced in Sec. 3.5, where the embeddings of pseudo-Euclidean distances are discussed. Also that FastMap can be considered as a linear MDS example.
Nonlinear MDS. Nonlinear MDS projections rely on the minimization of an appropriate nonlinear function. This is a loss function. called stress (acronym for standard residual sum of squares), which measures the diflerelice between the Euclidean distances (or P,-distances) of the present configuration of n points in RnLand the actual (given) dissimilarities. Hence, the problem of finding the right spatial configuration resolves itself into an optimization problem. where a Configuration yielding the minimum of the strcss is sought. Here, for convenience, we will adopt the notation used
Characterization of dissimilarities
137
in MDS. Let A be the actual nxn dissimilarity matrix and let D he the estimated distance matrix for the projected configuration. We will write d,, ( X ) to indicate that the distances are computed for a retrieved configuration X . The most elementary MDS loss function is the raw stress, defined as [Kruskal, 1964; Borg and Groenen, 19971:
(3.23) It yields a square badness-of-fit measure for the entire representation. .f is a continuous parametric monotonic function, a transformation applicd to the given dissimilarities 6 i j . In many cases, f is the identity function, but it may be some other function such as polynomial or logarithmic. Usually, = f ( S i j ) , called disparity, is adopted. Although the raw the notation of &, stress is used in practice e.g. [Borg and Groenen, 19971, in our opinion, it is not an informative function to be rnininiized iteratively as it reflects the absolute error. The differences between actual and estimated dissiniilarities should rather be expressed in relative terms to avoid that large absolute differences contribute significantly to the error function, while small differences do not. Large differences do not necessarily indicate a bad approximation. Therefore, the stress should be normalized in a way that avoids a scale dependency. This leads to a least squa’res scaling (LSS) loss function [Cox and Cox, 1995; Kruskal and Wish, 1978; Kruskal, 19771: n,-1
TL
where fwrj are appropriately chosen weights. For instance, the weights can be used to shift the emphasis to small dissimilarities by choosing w,,~= l/&j for non-zero 6,. Concerning disparities, straightforward choices arc c.g. a linear or logarithmic function, i.e. & j = Q BSij or & j = Q log(&j), where Q and /3 are estimated in the least square sense by modeling the perfect relation d i j ( X ) = S,,. The normalization by estimated distances makes the error measure invariant under rigid transformations, like shifts and rotations, arid non-rigid transformations, like uniform stretching or shrinking, of the derived configuration. The stress is optimal when all original disparities 8i.j are equal to the estimated distances d r J ( X ) . Since this is unlikely to happen, d i j ( X ) will be a distorted representation of the relations within the data. The larger the stress, the greater the distortion. The optimization procedure for the
+
+
138
T h e dissimilarity representation f o r p a t t e r n recognition
LSS is an iterative process of two alternating stages: fitting & j to d i j for a present configuration X (hence dij are considered as fixed for that moment) and minimization of the stress function, i.e. updating X , given 6 i j . As from an application point of view, one is interested in the relative positions of objects in the spatial map, a general suggestion in the MDS area is to consider a ratio MDS [Borg and Groenen, 19971, where & j = pbij for > 0. It means that the ratio of two disparities should be equal to the corresponding dissimilarities: & j / i k ~ = bij/bkl. For the SLss stress, the optimal pFSs can be derived analytically as the one minimizing SLss, provided that D ( X ) is fixed. By setting up the derivative of SLss over /3 to zero, its optimal value is found as pcSs= Cj Cdi: j ( X ) / CjCi &jdij(X). Alternating the computation of pcSswith an iterative improvement to the stress provides an efficient procedure for finding the solution to the ratio MDS . Most of the minimization algorithms are based on gradient methods [Kruskal arid Wish, 1978; Kruskal, 1977; Borg and Groenen, 19971, but also other techniques have been especially adopted for tlie MDS purposes, such as iterative majorization [Cox and Cox, 1995; Borg and Groenen, 19971. In our experience [Pckalska et al., 1998a1, this algorithm has a slow convergence. An interesting modification for vectorial representations is studied by Webb [Webb, 1995, 19971. He looks for a nonlinear transformation in tlie reduced space Rm in which the approximated &distances are close to the actual !,-distances in terms of the weighted raw stress. The transformation is defined by radial basis functions, hence the iterative majorization technique determines its parameters. This results in a mapping that is applicable to new data. Another way to normalize the raw stress is to use the original dissimilarities instead of the approximated ones. This lea,ds to loss functions being variants of the Sammon mapping. Sammon mapping. The original Sarnmon mapping was proposed in pattern recognition by Sammon [Sammon Jr., 19691 as a tool for a nonlinear projection from a high-dimensional Euclidean space to a low-dimensional space. To our knowledge, it is not mentioned in books and articles devoted to the MDS research. However, it may be considered as a method in this area, if interpreted as a projection technique which tries to preserve the original dissimilarities. For the sake of simplicity, we will account the variants of Sammon mappings as the MDS examples in this book. Samnion mapping is a nonlinear projection realized by the minimization of the
Characterization of dissimilarities
139
following loss function:
In general, the stress can be defined in a number of ways, e.g. as studied by us in [Pekalska et al., 1998a,c,b]:
for t = . . , . -2, - l , O , 1 , 2 , . . ., which results in the following measures for the identity function f,i.e. 6,, = f ( & ) = dt3:
S-I(X)
=
S(X) n-1
n
We will refer to all of them as (variants of) Samrnon mappings. Each of’ the loss functions mentioned above emphasizes a different aspect of the geometric relations between points, i.e. it emphasizes. to some extent, either smaller or larger distances, which directly influences either local or global aspect of the method. For instance, S-2 emphasizes very small distances. i.e. it penalizes the error in representing small dissimilarities more than the same error for large ones. Therefore, S-2 focuses on local details, hence it is very nonlinear. On the other hand, 5’2 emphasizes larger distances, hence it tends to present more global map of relations. SO provides a balance between large and small distances, i.e. errors in representing s r d l and large dissiniilarities are penalized equally. Depending on the application requirements, the loss function can be chosen appropriately.
140
T h e dissimilarity representation for p a t t e r n recognition
By applying the ratio approach to the Sammon stresses, i.e. the discrepancy 6 i j = P&j, one gets
for t = . . . , -2, - l , O , 1,2 , . . .. Note also that scaling of 6 i j by [j is equivalent to scaling of d i j ( X ) by which is further equivalent to scaling of‘ X by i.e. St(X,p) = S t ( i X , l ) . The optimal p* can be determined as the point yielding minimum of St for the present configuration X (hence also D ) . By setting the first derivative of S t ( X , P ) with respect to ,3 to zero, after straightforward calculations, one obtains [j* = (CIl<, . . 6t. ’Il d : j ( X ) ) , / ( C j <6jT11dij(X)). i After simplifications, inserting y* into St(X,p) yields
i>
i,
for t = . . . , -2, - 1 , 0 , 1 , 2 , . . .. Note that 0 5 S t ( X , p * )5 1 by the nonnegativity of dissimilarities and the Schwartz inequality. Theorem 2.16, since
c,,, 6:~z+1(6:ydzIl(x))5 (C,<, 4, (C,,, s:Ild;3(X))+. In order to compare the Sammon stress functions to the LSS loss funct+2
1
1 2
tions Eq. (3.24). let us introduce the variants of the SLss,similarly to the variants of Sammon mapping. So, we can introduce a general LSS loss function as:
for t = . . . . - 2 . - 1 . 0 , 1 , 2 . . . .. Then by considering the ratio MDS, 6,, = BS,,, one can express the optimal p (minimizing S:ss(X,p)) as prSs = 6,)df,fl(X))(CIl<, 6:Jdt3 ( X ) ) . The substitution of prSs into SkSs gives then
for t = . . . . - 2 , - 1 . 0 , l . 2 . . . .. If X * is the optimal configuration (corresponding to a local minimum) of the Sammon error St, then the LSS stress S:ss(X*,,3;ss) is equal to the Sammon stress S t ( X * . p * )for t = 0. This does riot hold for other t , although for t < 0 , the Sammon stress St at the
Characterization of disszmzlarities
Classical scaling ~ _ _ ~
141
Sammon map So __
~
Figure 3. 1 2D spatial maps of the Euclidean distance representation of 400 points uniformly distributed in a 10-dimensional space. The scale is preserved.
local riiiiiirnuiii of X * would be smaller than the corresponding 5 ' ;' and thc other way around for t > 0 . This can be directly deduced from tlie formulations of Eq. (3.32) and Eq. (3.301, taking into account that the MDS distances d,, ( X ) underestimate the actual dissimilarities, which leads to the following inequalities CJ<% S,, > C,<,d,, and C3<7 > 1. Note that except for the raw stress, both SOL' and So are the loss functions traditionally applied in MDS. Practically, they give the same (up to scaling and rotation) results. In general, due to the normalization by tlie actual dissimilarities, St will emphasize smaller dissimilarities than Skss foi t < 0 and the other way around for t > 0. This is not a problem, since when needed, additional weights can be used. This means that the Sarnmon stress functions can be generalized in order to incorporate the nonnegative weights of individual pairs as
2
Usually, the weights are chosen to be either 0 or 1, where 0 is used to accomniodate for missing values (here: dissimilarities). However, t,he weights can also be set to J- or $- for non-zero S i j . For instance, in the latter 67 1
case, the weighted stress
1J
Sg becomes the unweighted stress S-2.
Implementation. Sirice the optimization of the Saiiinioii stress fimctioris is easier defined in the gradient terms, we give preference to Saniniori mappings. To find a Sammon representation, one starts from t,lie initial configuration of points X (e.g. randomly chosen or from the classical scaling result) in Rm for which all the pairwise distances are computed. Next, the points are adjustled so t,hat the stress decreases. In an iterative process, the configuration is improved by shifting around all points to approximate
The dissimilarity representation for pattern recognition
142
better arid better the model relation & j = d i j for i , j = 1 , 2 , . . . , n, iinti1 a (local) minimum of the stress is reached. In such a procedure, a steepest descent, Newton-Raphson algorithm [Press et al., 19921, iterative majorization [Borg and Groenen, 1997; Heiser, 19911, conjugate gradients [Shcwchuk, 19941 or scaled conjugate gradients (SCG) [Mder, 19931 can be used to search for the minimum. In our experiments with artificial and real data [Pckalska et al., 1998a,c,b],we found out that concerning the convergence rate, the scaled conjugate gradients and Newton-R.aphson techniques are preferable. The former technique is characterized by large improvements in the first iterations, and a slow convergence to a minimum later on. Therefore, a hybrid algorithm can be considered, which switches to the Newton-Raphson minimization after the first few iterations. The found minimum depends on tlie initialization. Usually, the output of classical scaling is a, good starting point, since it is the global minimizer of the raw stress (in a linear way). However, it is useful to compare its result to the Samniori output obtained from a random initialization, since the optimization algorithm may get stuck in a local minimum close to the initial configuration. A 'better' initial configuration, i.e. a scaled version of the classical scaling result X C S i.e. t* X C X was suggested in [Trosset and Mathar, 2000; Malone et al., 20021. It is, however, not new, since their t* equals Sijdij(X)/ d $ ( X ) , which is equivalent to in the ratio MDS with the stress Skss (and also SO).A good initialization is still an open problem. From our experience is follows that Sammon mappings are less sensitive to the starting configuration than the LSS mappings. In summary, it is important to emphasize that the MDS techniques hased on tlie minimization of' the normalized square differences will produce maps, where projected points will tend to be enclosed in circular or ellipsoidal shapes. This can be clearly observed for a Euclidean distance matrix D computed for an artificial example of 400 points unifornily distributcd in a 1OD hypercube. The MDS result can be seen in Fig. 3.11. See [Hughes and Lowe, 20031 for the formal proofs referring to the raw stress based on the square Euclidean distances. To avoid this artifact, other types of error measure can be considered, for instance in the form of
xjci
cj
&
c3<,
IS, - d 7 , ( X ) / / C 3 <6%, 7 01' E ( X ) = C3<% 167, - ~%?(X)I/6%,. The measiires bascd on absolute values are, however, difficult to optimize (due to discontinuous derivatives). Another MDS technique. which is more robust against outliers can also be designed by considering the following fit F ( X ) = rnedian,,,fr [Cox and Cox, 19951.
E ( X )=
'
6
z
J
r
61,
d
'
J
(
x
)
'
Characterization of dissimilarities
143
Two different spatial configurations can be matched by the use of Procrustes analysis. This might be useful to compare two configurations derived from optimizations of different loss functions or to indicate how a configuration changes when the similarity between objects changes over time (as e.g. human preferences of some products). Basically, the configurations are matched be determining the optimal translations, rotations and scalings; see [Borg and Groenen, 1997; Cox and Cox, 19951 for details.
3.6.3
Reduction of complexity
Given objects, a nonlinear MDS method requires the computation of O ( n 2 )distances in each iteration step arid the same memory storage. However, for a low, m-dimensional representation, only mn values should be determined. This suggests that the total number of constraints on distances is redundant, so some of them can be neglected. This leads to the idea that only distmces to a subset of all object,s are preserved, for wliicli a modified version of the MDS mapping is considered. Although X ,derived from MDS, has tlie dimension 'm, it is deterrriiricd by n > m objects. In general, a linear space can be defined by m+l linearly independent objects. If they were placed such that one lies in tlie origin arid the others lie on the axes, they would determine the space exactly. Since this is unlikely, the space retrieved will be an approximation of the original one. When more objects are used, the space is better defined. Following [Cho arid Miller. 20021, objects having relatively many close neighbors (lying in the areas of high density) can be selected for the representation set R C T of the size T > m, 011 which the (non-)linear mapping could be based. For a dissimilarity representation A ( T , T ) ,a natural way to proceed is the k centers algorithm (Ypma and Duin, 19981. It looks for k center objects. i.e. exarriples that minimize the rriaximurn of the distances over all objects to thcir nearest neighbors; see also Sec. 7.1.2. It uses a forward search strategy, starting from a raiidoni initialization. Note that the k-means algorithm [Duda et al., 20011 cannot be used since no feature rcpresent8atjiori is assumed, only the dissimilarities A. For a chosen set R, the linear mapping of A(R,R)into an m-dimensional space is defined by Eqs. (3.7) - (3.14). The remaining objects A(T\R, R ) can then be added to the map by the use of Corollary 3.7 arid Eq. (3.20). In case of the Sanimon mapping, a modified version should be defined, which generalizes to new objects. Following [Clio arid Miller, 20021, first the Sarrimon mapping of A ( R ,R) into the space Wm is performed, yielding
144
The dissimilarity representation for pattern recognition
the configuration X g . The remaining objects can be mapped to this space, while preserving the dissimilarities to the set R, i.e. A' = A(T\R, R). This can be done via an iterative minimization procedure of the modified stress Adt, using the found representation X A as
(3.34) for t = . . . , - 2 , - 1 , O , 1 , 2 , . . .. Equivalently, the modified loss functions of tlie LSS and MDS can be defined as: n
r
(3.35) Thanks to these procedures, new objects can be added to an existing map. Their complexity reduces from O ( m n 2 )computing , O ( n 2 )distances in the R" space, to O(nrnr+nr2)in each iteration step. Another possibility to define a Sanimon mapping is by the use of neural networks, as studied in [Mao and Jain, 1995; de Ridder and Duin, 19971.
3.7
Summary
This chapter presents ways of characterizing dissimilarity measures, especially for those represented as finite generalized metric spaces. The basic concern was whether a dissimilarity measure is metric or not, which can easily be checked for a finite representation. Also transformations preserving the metric properties are considered. Usually, such transformations are able to make a non-Euclidean dissimilarity measure 'more' Euclidean. An essential question, however, is whether a given distance measure is Euclidean or city block. The Euclidean distance is important because a Euclidean space is both a metric and an inner product space. Hence, there exists a natural connection between the traditional inner product and the Euclidean distance, which allows one to embed any Euclidean distance matrix in a finite-dimensional Euclidean space. Both isometric and approximate linear and nonlinear embeddings are presented here, as well as their generalizations for the projection of new examples. These are the multidimensional scaling techniques. If a measure is non-Euclidean, no isometric projection into a Euclidean space exists. Some solutions are presented, where either tlie dissirnilar-
Characterization of dissimilaritaes
145
ity is corrected such that it becomes Euclidean or it is projected into a pseudo-Euclidean space. Any premetric non-Euclidean measure (satisfying the definiteness and symmetry constraints) can be formalized in such an indefinite inner product space. This builds a general framework where any symmetric dissimilarity representation can be explained. The city block distance is important because, on the other hand, due to its additivity property. Finite generalized metric spaces can also be represented by weighted, fully connected graphs, where the weights correspond to the given dissimilarity values. A city block distance is perfectly structured by an additive tree model, where the dist,arice is understood in terms of the shortest path in this tree. Other dissimilarity measures can also be interpreted via such tree models; however, only approximately. See Chapter 7 for more discussion. I11 short, t,his chapter deals with the characterization of generalized metric spaces, especially finite spaces represented by n x n dissimilarity matrices. It introduces useful tools for checking metric or Euclidean properties and for finding the dependencies in the family of &-distances. In particular it discusses the issue of (approximate) embeddings into pseudo-Euclidean spaces which can be carried out for any symmetric dissimilarity measures. In this way, it establishes the foundation for designing learning algorithms on spatial representations in Euclidean and pseudo-Euclidean spaces, as will be seen in Chapter 4. Also the process of data exploration is supported, either by visualization of two-dimensional spatial maps or by visualization of the organization of objects and their underlying structure as given by a tree model.
This page intentionally left blank
Chapter 4
Learning approaches
The leaming and knowledge that we have, is, at the most, but little compared with that of which we are ignorant. PLAT0
Although objects may have different initial or intermediate representations, in the form of sensor measurements, shapes, relative graphs or numerical features, we will ultimately describe them by pairwise dissirnilarities. By now, the ground for learning methodologies for dissimilarity representations has been established. First of all, various spaces arid the relations between them were characterized in Chapter 2 . They prepare mathematical frameworks, in which dissimilarities are explained. Next, Chapter 3 discussed basic properties and transformations of dissimilarity matrices as representations of finite generalized metric spaces, especially in the context of metric or Euclidean distances. Finally, also isometric arid approximate embeddings into pseudo-Euclidean space were presented. General statistical learning aspects for vectorial representa.tions are briefly summarized in Sec. 4.1. Statistical learning is described to establish the context in which learning froin dissimilarities will take place. Section 4.2 formally introduces dissimilarity representations and explains their role in unifying of the statistical and structural approaches to learning. Also an extension of dissimilarity representations is mentioned, based on the 'true' inductive learning, as illuminated in [Goldfarb, 1990; Goldfarb and Golubitsky, 2001; Goldfarb et a,l.; 1995,20041. Next, three main dissimilarity-based learning approaches are presented. They refer to three interpretations of such representations in some spaces, for which particular statistical learning niethodologies can be adapted. In the first approach, the dissimilarity values are interpreted directly, hence they are characterized in (pre)topoIogical spaces. The second approach focuses on dissimilarity spaces in which each dimension corresponds to a dissimilarity to a chosen object. The third approach finds a spatial representation, i.e. an embedded (pseudo-)Euclidean corifiguration such that the dissimilarities are preserved as well as possible.
147
148
The dissimilarity representation for pattern recognition
More details on these niethodologies can be found in Secs. 4.3 - 4.5. This chapter ends with additional remarks on generalized kernels as well as some insights in the connections between dissimilarity spaces and the underlying pseudo-Euclidean spaces, as given in Sec. 4.6. The purpose of this chapter is not only educational; it also presents the ideas behind the learning from dissimilarity representations and explains the basic methods. Although this material relies on our publications [Pqkalska and Duin, 2002a; Pqkalska et ul., 2002a,b, 2004a1, there are many new insights and observations presented here. Also new is the perspective from which it is discussed.
4.1
Traditional learning
Learning from examples is the process of discovering, distinguishing, detecting or describing patterns present in the data. It relies on both extraction and representation of information from the measurements collected in order to understand the process (phenomenon) that created them. The result of learning is that the knowledge already captured in mathematical terms is used to describe the present independencies such that the relations between patterns are better understood or used for generalization. The latter means that a certain concept, e.g. that of a class, is formalized such that it can be applied to unseen examples of the same domain, inducing new information, e.g. of a class label. In this process, new data objects should obey the same deduction procedure and follow the same reasoning as the original examples. Note that the word ‘pattern’ refers to both a property or a characteristic of an individual object (i.e. its structural or mathematical representation) arid a property of the entire set of objects given by their characteristics.
4.1.1
Data bias and model bias
Pattern recognition is usually concerned with the learning of a concept from a set of examples. Here, a concept is a general notion of an entity serving to designate a class of instances or another type of relations. More practically, an abstract or real set of all possible examples of the concept to be learned is a domain. For instance, if one wants to learn the concept of a Dutch tulip (in fact of a tulip class), then this domain consists of all types of tulips ever grown in the Netherlands. So a domain is a complete representation of the
Learning approaches
149
concept considered. In practical applications, domains cannot he studied in their entirety due to their complexit,y7t,he costs of the collection process, the physical limitations of both measuring and storing devices, and the measurement costs. Consequently, domains are sampled. This means that only some examples are provided to represent a domain, and, as a result, only a limited amount of data is available for learning purposes. So, data represent i n f o r m a t i o n and knowledge available f o r a particular d o m a i n . A (concept of a) class1 is represented by a finite collection of instances, but it is not yet described by this. The description of a class has to be based on description of each single instance in the measurement process, where each instance is characterized by a set measurements arid additional knowledge about the class. Measurements in general refer to the outputs of measuring tools, algorithms, or procedures, and they can be performed directly on objects or inferred from raw measurements. Raw measurements refer to the raw outputs of sensors or devices which record signals. images, hyper-spectra images etc. All such outputs can be used for a definition of relational descriptions, features or a proximity measure. On the other hand, an abstract domain might be represented by some example structures, order in the data or inference rules, provided from outside. In such cases, the standard measurement process may not play a direct role; instead, the support is given by structural representations. In those cases, it is often very difficult to define numerical features. Still, a proximity measure can usually be constructed. Data introduce a bias (‘a systematic error introduced in sampling or testing by selecting or encouraging one outcome over others’ [Webster dictionary]) of the domain we wish to learn. We have a bias with respect to the chosen representation and to the model chosen such as a learning approach. The latter is caused by a dissonance between the learning procedure imposed on the data and the validity of the assumptions. Such a model bias is related to some error measuring the discrepaiicy between the assumed and learned values, so it is related to a bias of an estimator of an ideal model; see also Sec. 4.1.2. D a t a bias refers to both domain and data description. The first one, a sampling bias, is caused by assuming that data exaiiiples are representative for the domain. Since it is often impossible to supply instances describing the whole domain variety, a finite sample gives ] A class is either a natural category, i.e. present in reality, like a class of tomatoes or mugs, or an abstract category consisting of objects or instances sharing common properties considered for the application’s need, e.g. articles on sport, human silhouettes, people with a particular disease, etc.
150
The dissimilarity representation for pattern recognition
rise to a sampling bias. The representation bias results from a selection of characteristic features, a proximity measure or a structural representation. Taking into account the efficiency of both data, representation and learning algorithms, as well as the resolution of measuring devices, it is impossible to consider an infinite set of features, an infinitesimally precise proximity measure or complex and detailed structural information. The necessary simplification or redundancy of the data representation introduces a representation bias. Data bias has important implications for the learning algorithms. It strongly contributes to the model bias; if the data examples are a poor representation of the domain, then the selected model, optimized by using the given examples, does riot describe the reality well. Data are well described if siniilar objects are close in their representations (e.g. if two similar objects are represented by two vectors which lie close together in a vector space), the so-called compactness hypothesis [Arkadiev and Braverman, 1964; Duin, 1999; Duin and Pekalska, 20011 and if two close descriptions correspond t o the objects that resemble each other, the so-called true representation. The basic principle is that the objects do not posses random descriptions; on the contrary, the neighbors of a particular object in the representation are similar to it in reality. Note that true representation implies that distinct objects lie far away (with respect to the chosen dissimilarity) in the representation. This means that the measurements contain sufficient information not only to support the resembling objects, but also to tell them apart from distinct objects. Moreover, data are well sampled if all instances in the domain are somehow described in data or, in other words, if adding new instances will not change this description significantly. Given a lot of data relevant to the problem at hand (actually with respect to a chosen model; see also footnote 4 on page 155), the learning task becomes relatively easy (in the methodological sense; the computational cost may increase), since the data bias becomes smaller. Consequently, if data are representative and well sampled, there is enough support and information in the data to model their functional dependencies, hence the model bias becomes smaller as well. Only such data will assure a good generalization of a learning algorithm. The problematic situations are those where the amount of data is small or where there are many unlabeled examples (sometimes the collection of data can be automated, while the labeling process is slow and expensive since it should be done by humans). Conventionally, data are described by features. For instance, the class (concept) of apples (domain) can be represented by features (obtained in
Learning approaches
151
the measurement process) such as weight, size and color. A feature-based representation of a concept relies on selecting n instances to represent the domain and on defining, say, rn features for the description. We can think of vertical and horizontal samplings. where these samplings coincide with the choice of objects and features, respectively. Such data are often expressed as an nxm matrix A, where A is interpreted as a configuration of n points (feature vectors) in an rn-dimensional feature space R"'. usually Euclidean. This representation is mainly used in statistical pattern recognition [Fukunaga. 1990; Duda et al., 20011, where it is assiinied that the distribution of pattern classes can be derived from a represcntative set of such points ( a training sct) with sufficient accuracy. This often requires (strict) additional assumptions on the distribution characteristics.
4.1.2
Statistical learning
Statistical learning is usually understood as the process of determining an unknown dependency between inputs and outputs given a limited nuinher of observations. i.e. training examples. A probabilistic framework is often considered for this task, as it is mathematically appealing for haridling w certainty. Input vectors J: E X , usually X = RTrL> are assumed to be drawii independently from a fixed but unknown, probability density function ~ ( 3 ; ) . The functional dependency between outputs y E Y and iriput,s .7: is g''lven as a fixed conditional density p(yl~:),which is also unknown. Depending on the domain of Y , different learning problems can be presented. If Y is discrete, then it is a classification problem. wliilc if it is continuous, it is a regression problem. If no Y is present, then the learning problem becomes a density estimation. The training exaniples T,, = { ( J : ~yi) , :i = 1 are considered to be iid (independent and identically distributed) according to the joint probability density2 p ( x , y ) = p ( x ) p ( y l z ) . The difficulty relies on determining the relationship between X and Y , based on T, only. For a more elaborate introduction to statistical learning, see tlic books [Hastie et al., 2001; Cherkassky and Mulicr, 1998; Vapnik, 1995. 19981 and [Scholkopf and Smola, 20021. We are often interested in prediction and, therefore, in modeling the conditional probability of observing a particular y given a specific 2 . By 'All these assumptions, although general, are in fact strong. They actually assume a fixed (stationary) distribution from which the examples arc sampled. This is often violated in practical applications, e.g. when the data are collected in various conditions or even by differently calibrated sensors.
152
The dissimilarity representatton for p a t t e r n recognition
Bayes formula, one can write p(ylz) = p ( z ~ $ ( y ) . Assuming that the quantity p(y111:)can be computed, the most appealing approach for assigning an output to a new II: is the value of y which yields the maximum a posterior probability p(y1z); see Appendix D . l . This is known as a theoretical optimal Bayes rule. In practice, since the true distributions are unknown, the Bayes optimal rule cannot be found. One, therefore, tries to estimate this ideal by a function g(z) coming from a general hypothesis space of functions G = ( 9 : X + Y’}, where Y’ is e.g. R1,(0, l} or {-13I}, depending on the task. The goal of learning is then formulated as a selection of g* E G which best approximates the outputs y, given a finite set of examples. To measure the discrepancy (hence define the best fit) betwecn the estimated outcome g(z) and the original output y for a given z,a loss function L : Y x Y ’ + [0,MI is needed. A single output L ( g ( z ) y), , however, is not very informative over a particular function g. Rather tlie overall expected loss should be used to infer about g. This is the true loss of the hypothesis g, given by the error or rosk functional as:
which minimizes Ideally, the learning is a process of estimating g* E the error E ( g ) . This requires the integration over the complete probability distribution of all possible inputs II: and outputs y. Since p ( x ,y) is unknown and the only information is a set of available training examples T,, the learning problem is ill-posed. To make the learning task feasible, one usually considers a specified class of functions { g a } ( e g . polynomials), where a are parameters indexing the functions (e.g. polynomial degrees). Then go* E G miiiirriizes the error € ( g o ) . Note that the true, optimal Bayes solution y, does not necessarily belong to { g a } . To tackle such a learning problem, an empirical error (or risk) is minimized. It is expressed as:
For a given finite training set T,, there might be infinitely many functions minimizing the empirical error, since they need to behave identically only for the training examples. Therefore, by the selection of a class of functions {gcy},i.e. narrowing the scope of interest, the learning task is better formulated. Note, however, that this is purely a choice made to be able to tackle the learning problem, unless some other prior knowledge exists.
Learning approaches
153
Depending on the loss function, basic learning problems such as classification, regression, density estimation and clustering can be set up in this statistical framework. Since knowing p ( z , y ) would allow one to solvc any learning problem expressed by the minimization of the risk, the density estimation is the most general (hence most difficult) problem. Later, we will focus on predictive learning such as classification and regression.
Classification. A general multi-class classification problem can be decomposed into a number of two-class problems [Fukunaga, 19901. Hence, a two-class problem is considered as the basic one. Assume a set of training examples { (xi,~ i ) } r = with ~ , the corresponding labels yi E (0, 1} (sornetimes also yi E {-1,l)). The hypothesis space becomes then a set of indicator functions. The most common loss function is then L(y,g,(z)) = I ( y # ga(x)). The corresponding risk or the t r u e error ET = €(go) denotes the probability of misclassification (given equal costs). The empirical error, also called the training or apparent error, becomes then E A = Eemp(gn,Tn) = ~ ~ Z(yi~# ga(xi)). = 1 As a result, the learning can be simplified to finding a classzjier g&* (z) that minimizes the empirical error. Regression. Regression is based on estimating a functional dependence f between inputs z and outputs y in the form of y = f ( z ) + E ; where E is such that E ( E J z=) E P ( ~ ~dy Z )= 0, i.e. rmdom noise with a zero mean. f is then seen as the expectation of the output conditional probability f ( x ) = J y p ( y l z ) dy. The risk measures the dissonance between the actual outputs and the expected predictions with the common loss function being L(y,g,(z)) = (y - go(.))’. Under the assumption of a zero mean noise and based on the fact that y - g a ( z ) = y - f(x) + f ( x ) - g a ( z ) ? the risk Eq. (4.1) can be decomposed3 as a sum of two contributions, noise variance a,nd approximation accuracy as:
s,
where en is a fixed value, since it does not depend on g a . So, learning can be Note that the latter term equals zero: since is a random noise with a zero mean. Note also that
154
T h e d i s s i m i l a r i t y representation f o r p a t t e r n recognition
now stated as determining ga* E 6 that best approximates the (unknown) ,f. Tlic empirical risk with respect to the set of functions { g a } is expressed (!It - ga(2,))2. as EccrrLp(cJa,Tn)= Note also that tlie true error & ( g a ) can be expressed as
;c:=l
where the last two contributions arc due to the squared bias and the variance. respectively. Since the classification can be considered as a special case of regression (with y being discrete), thc decomposition above is an important phenomenon.
Density estimation. The process of estimating densities is concerned with tlie input vectors T = { x % } ; = ~ only. The output represents the density itself. The loss function is usually given as L(g(1:)) = g(z) or L ( g ( z ) ) = -logg(z), yielding in the latter case, the following risk: E ( g ) = - logg(z)p(z)dz, which in a finite casc simplifies to E,,,(g,),T =
-; c;=1logYa(&).
4.1.3
I n d u c t i v e pr inciples
Predictive learning such as regression or classification consists of two steps: thc process of learning, i.e. the estimation of an (unknown) dependence between inputs and outputs, and tlie process of generalization, i.e. prediction of outcomes for newly coming examples based on the discovered concept. In practice, the first step is closely related to induction (‘inference of a generalized coriclusion from particular instances’ [Webster dictionary, site]), while the second step rcfers to deduction (‘derivation of a conclusion by reasoning’ [Webstcr dictionary, site]). In a general form, however, the deduction is much simplified; it involves only the computation of outcomes based on thc derived parameters in the learning stage. Probably, that is why such a process is called an inductive learning paradigm. The minimization of the expected risk relies on this principle. Hence, the entire problem is put in the framework of a global function estimation. Another approach is based on estimating the risk functional by using the training set at the moment, when a new example appears. This requires a reformulation of the learning problem such that additional unlabeled examples are treated in the context of the given training set. So, the dependence between the training data and test examples is estimated when required and may differ from instance to instance. This approach is called trunsductive
155
Learnang approaches
(
tnductive learning Training data
i
\
7-___
)
Transductive learning
Test Training examples’ data
A priori assumptions
\
A priori assumptions
J
Model I dependence \____
Tep ;e ,s-
___-
t
[-j-Predicted the learned concept
outputs
L-
_
_
~
Figure 4.1 Inductive (left) and transductive (right) learning paradigms. A priori assumptions are here understood in terms of the specified assumptions on a set of learning algorithms and related parameters.
inference [Vapnik, 19981. Such an inference might be applied locally (the unknown examples are related to the objects in local neighborhoods), but not necessarily. If it is applied globally, the comput)ational burden might become high (under the inductive paradigm, only one final funct,ional dependence is estimated). Examples of this approach are the cases of learning from partially labeled sets or designing linear classifiers in local neighborhoods. This inference can also be reduced to the deduction step only, like in the k-NN rule for a fixed k . A schematic illustration of the inductive and transductive learning principles is shown in Fig. 4.1 Statistical learning theory is mostly developed for inductive principles. This is somewhat surprising, since in a general context of inference from srnall sarriple size training data4, only restricted information is available. Vapnik [Vapnik, 19981 formulated the main learning principle as: ‘‘lfyou posses a restricted amount of information f o r solving some problem,, t r y to solve the problem directly and never solve a m,ore general problem, as an, intermediate step. It is possible that the available information i s suficientf o r a direct solution, but’it is i n s u f i c i e n t f o r solving a more general intermediate problem. ” Following this rule, we conclude that, not only a predictive learning problem should be approached directly (instead of e.g. estimating the probability density function first as usual parametric methods do), but, more importantly, that it should be solved only f o r the points of interest, instead of estimating a single function globally at the entire domain. This 41n the classical sense, a small sample size problem is understood as an inference from n data vectors for an estimation of M free parameters of the approximating function, wherc n / M is small, e.g. 2 or even 5 1. Vapnik defines it with respect to a class of approximating functions of the VC dimension h,, as a problem, where n / h v c is small, such as 10 [Vapnik, 19981. See Sec. 4.1.3.2 for more details.
156
T h e dissimilarity representation f o r pattern recognition
(a) Training set
(b) Test set
Figure 4.2 Overtraining. (a) A zero-error classifier. (b) This classifier is overtrained, since it yields a high error on an independent test set and its boundary is too complex for such a srnall training set.
means that the learning problem is solved ‘at the spot’. Consequently, the application of this principle naturally leads to a transductive learning. This type of learning has not yet evoked sufficient interest of researches, probably due to the expected computational cost in a testing stage. Yet, it becomes one of the open issues for further research. We will focus on inductive learning methods. These provide a general prescription for handling the data vectors and the assumptions on the approximating functions in the learning process. Here, the empirical risk minimization and a few paradigms based on the Occam’s razor principle are considered within the framework of inductive principles.
4.1.3.1 Empirical risk minimization (ERM) In this paradigm, a function g&* is sought such that the empirical error, i.e. t,he training error E A = Eemp(gru,T,) L ( y z , g a ( x z )is ) minimized. The training error is a rough (biased) approxiniation of the (unknown) true error. We assume that an optirrial function g* (hence the Bayes rille) exists in S,although the class of functions { g a } E S might not contain it. In classification, the actual risk &(g*) is the minimal risk ever possible, called also the Bayes error. The Bayes error is the error of a theoretical optirrial Bayes rule, which assigns a vector z to the class yielding the highest postcrior probability p(ylx). In practice, since the true distributions are unknown, the Bayes error cannot be computed. Depending on the loss function and a set of chosen g a l the ERM can be employed in a riurriber of ways, e.g. based on the maximum likelihood estimators, Appendix D. 1, or linear regression. Usually, this principle is iiscd in the parametric methods, where a model is specified first (e.g. a normal density-based linear classifier assuming normal distributions) and
Learning approaches
157
then the parameters are estimated from the training data. This works well, provided that t,he number of training examples is large wit,h respect, to the model complexity (Vapnik's approach: related to the VC dimension, see Sec. 4.1.3.2; classical approach: related to the number of free parameters, which agrees with the Vapnik's approach for polynomial chssifiers). Such models do not have enough flexibility, hence they can result in a largc bias (see below). The difficult'y of applying the ERM for limited t,raining chta is tha.t, it does not yet guarantee a small expected risk, i.e. the true error. In the classification case it rneans that a small error on the training set docs riot imply a small error on an independent test set. The phenoinerion that y&, yields a small empirical risk, but still shows a large true error on an independent test set is called o.ciertraining or ouerfitting . A siifficiently flexible function can perfectly fit the training data, completely adapting to all the information available there, reaching a zero empirical error. As a result, this function can describe structures (due to the noise) which in fact are not present in the data; see Fig. 4.2. Hence, to avoid overtraining for fixed and srriall sample sizes, simple models are preferred to the complex ones. The problem is much more pronounced when the riiirnber of features, hence the dimension m, is very large with respect to the nuinbcr of data vectors. ASSUKIEa fixed number of objects. From a classical point of view, adding of new features may give worse results on an independent test set. This is caused by a poor estimation of the function parameters due to insufficient amount of the data vectors, and called the m r s e of dimensionality (Jain and Chandrasekaran, 1987; Jain et al., 20001. See also Fig. 4.3. A number of solutions is proposed to treat the curse of dimensionality, such as feature selection [Devijver and Kittler, 19821 or feature extraction [Duda et al., 20011 techniques. The first wies find the hest few features, while the latter construct new features functionally depending on the old ones, e.g. as their linear combination. Such procedures still might not be sufficient to guarantee a good generalization for complex functions.
The empirical risk depends on training exBias-variance dilemma. amples, hence different training sets will yield different models ,y&* ( T n ) . Consequently. the loss function is also a function of a training set. This dependency can be removed by averaging over training sets of a fixed size. Then the expected empirical risk with respect to all the training sets of cardinality n becomes E,[E,,,(g,, T T L )where ], En[.]denotes the exprctation. In the case of regression, there is a clear decomposition of the latter
T h e dissimilarity representation f o r pattern recognition
158
Number of training examples
c
d
0 Number of features m (or parameters M)
Figure 4.3 Curse of dimension.
quantity into a (squared) bias term5, measuring ‘the accuracy or a quality of the match’ of the learning algorithm to the problem [Duda et al., 20011 and a variance term, measuring ‘the precision or specificity of the match’ [Duda et al., 20011. Additionally, there is an irreducible term eo, independent from the training sets as derived from Eq. (4.3) by using a summation instead of an integral. Hence, we have that ErL[lernp(ga, Tn)]= eo +En [; (sa ( X i ) - f(4)2] +En ;[ (scu( X i ) -En [sa( X i ) ] 12] ’ This decomposition indicates that there exists a bias-variance trade-off, which is a fundamental problem while fitting a model to the data [Geman et al., 19921. The practical implication of such a trade-off is that a flexible function g a , i.e. a function which is able to model the irregularities well, will have a high variance since it will tend to fit the desired outputs well (yielding a snialler bias). Consequently, it will vary dramatically between various training sets. Conversely, an inflexible model will tend to behave similarly with respect to the training sets, yielding small variance, but its inflexibility might cause a high bias [Hand, 19971. A bias-variance decomposition becomes more complicated for the zeroone loss function in t,he classification case. Although it is possible to extend the reasoning behind the square loss to a classification problem by assuming that the interaction between the bias and the variance is multiplicative [Geman et al.; 1992; Duda et al., 20011, there is no clear interpretation. A recent unified bias-variance decomposition is proposed in [Domingos, 2000b1, of which the zero-one loss is a special case. There, the variance contribution is additive for unbiased examples (i.e. examples wrongly classified) and subtractive, otherwise. This means that the zero-one loss allows for a larger tolerance of a learning algorithm with respect to variance than in the
ci
ci
5Here, bias is understood as a bias of an estimator. If 8 is a random variable, then the estimator 8 is biased if the bias b = E[d]- 8, where E[.]is the expectation, is non-zero.
Learning approaches
$1
.
\+True
159
error / Expected risk e T
Number of training examples n
Figure 4.4
Consistency of the empirical risk regularization principle
case of'the square loss. This follows from the offset contribution to the averaged loss (i.e. empirical error) by the biased examples. This explanation is logical, since in the end, the classification problem directly focuses on a proper assignment of objects to classes and not on a proper estimation of probability functions. See [Domingos, 2000b,a] for details. Another general framework for the additive bias-variance decomposition with different loss functions is proposed in [James, 20031. Additional insights can be found in [Heskes, 1998; Hansen and Heskes, 20001.
Consistency. To assure that a small empirical error guarantees a sniall true error, a consistency between the true and empirical risks is needed. For a fixed function gal the empirical error Eq. (4.2) will converge to the true risk E(g,) Eq. (4.1) by the law of large numbers. But this is not enough, since it should hold for any gCY.Let go* minimize the true error, i.e. go* = argmingeEG€ ( g o ) and let g&* minimize the empirical risk (hence it depends on Tn). Then the consistency of the ERM principle requires that €(g&,) = limn+m Eemp(gh,, T,) = €(gCY*) holds in probability. This requires a one-sided uniform convergence of the empirical risk to the actual risk in probability [Vapnik, 19981: Theorem 4.1 (Vapnik and Chervonenkis) T h e necessary a n d SUBcient condition for t h e consistency of t h e empirical risk to the a c t i d risk: is t h e on,e-sided u n i f o r m convergence in probability:
This is illustrated in Fig. 4.4. Note that the asymptotic error might differ from the Bayes error.
160
4.1.3.2
T h e dissimilarity representation for p a t t e r n recognation
Principles based o n Occam’s razor
Statistical approaches have been developed to assure convergence (4.5). Tliese often rely on the Occam’s razor principle. Assume a learning problem and a set of functions {gol}, depending on the parameters a , analyzed to find a solution. The learning problem is now coniplex since it relies on the estimation of both: the model structure or complexity (the degree of a polynomial), called model se1ectio.n of the parameters (coefficients) in some optimization procedure. Such methods are put in paradigms more general than the ERM. It is assumed that the best prediction is achieved for a model of the right complexity, found by applying the Occam’s ruzor principle. This principle states that one should not presume more things than the required minimum; in the selection process, among otherwise equivalent models, it advocates to choose tlie simplest one. The Occam’s razor principle can be implemented in a number of ways, taking into account that there is a trade-off between the model complexity (e.g. the number of free parameters) and the model fit to the training data. The most typical examples are: structural risk minimization, regularization principle, Bayesian inference and minirriurri description length. We will focus on the first two principles.
Structural Risk Minimization (SRM). The approximating functions are ordered according to their complexity (like ordering polynomials by the degrce) such that a nested structure is formed. The complexity of functions linear in parameters is related to the number of parameters. I Kgeneral, ~ it is cstirnated by the so-called Vapnik-Chervonenkis (VC) dimension h,, [Vapnik, 19981, which describes the capacity of a set of functions { g a } EG. In case of a binary classification, h,, is equivalent to the maximal number of points N which can be separated into two classes in all 2N ways by using functions from the considered set {gCY}.It means that for each possible labeling of N points into two classes, there exists a function from { g a } which takes 1 for examples coming from one class and -1 (or 0) for examples from tlie other class. An analytic upper-bound based on the VC dimension is provided by Vapnik [Vapnik, 19981 to estimate the expected risk. Given n training points, with the probability at least 1-q, the bound below remains t,rue:
Learning approaches
161
The estimate above is used for the model selection of the optimal complexity in the following way. For n training examples, the expected risk is controlled by two quantities: the cnipirical risk, which depends on the choseii function for particular Q and the VC dimension h,,, of the considered set of functions. Therefore, in order to control h,,, the approximating functions are ordered according to their complexity such that if G k = { g o : Q E A k } , where Ak: is a set of parameters, and GI c G2 c G3 c . . ., the corresponding VC dimensions fulfill h$ 5 hzz 5 h!: 5 . . .. Tlie SRM principle chooscs the function from a subset G k for which the bound yields minimum. Note that the bound derivation is based on the worst-case scenario, since the VC dimension considers all possible labelings of an arbitrary configuration of points. The importance of this bound, however, is that it guarantees the uniform convergence of €e7rLp to the actual risk, Eq. (4.5), for a finite h,, (which is a necessary and sufficient condition) [Vapnik, 1998; Evgeniou et al., 20001 and for thc indicator functions (hence a classification problem). It might not be true for other functions [Evgeniou et al., 20001. Regularization principle. This principle assumes a flexible set of approximating functions { g o } , but the restriction in the solution result,s from an additional term capturing the complexity of the function y,. Comequently, a penalized risk is minimized:
where 4 is a rioiiriegative functional and the nonnegative A, indepertderit of the training data, controls the strength of the regularization. For X = 0: the penalized risk reduces to the empirical risk, while for a large A: a simple solution is obtained, mostly ignoring the training examples. Hence, the model estimate is described as a trade-off between fitting the data and a priori knowledge on the function’s complexity (regularization term). The q5 functional can be selected in many ways. The simplest method counts the number of free parameters in the function, while a more sophisticated method uses the l2 norm of the parameters cr or the curvature estimator of g a . See also [Girosi, 19981 for the relation between the regularization principle and the SRM. In statistical learning, data are assumed to be represented by vectors in bounded regions in a feature space, so the learned function slioiild change smoothly over the space, avoiding high oscillations. Tlie smoother the function? the lower its complexity. So, functions of lower coniplexity are preferred for finite sample sizes (the regularization t,erm is niea,nt,to penal-
162
T h e dissimilarity representation f o r pattern recognition
,,
\,Classifier's
\\
complexity
True error e-
Asymptotic errors --i Bayes error
1 '/
.classifier's complexity L
Number of training examples n
Figure 4.5
Complexity of classifiers vs. cardinality of the training set.
ize coniplex functions more). In the limit (when the nuniber of training examples grows to infinity) complex functions offer better solutions (the bias is small). In the case of classification, this phenomenon is illustrated in Fig. 4.5. Bayesian inference. This principle assumes that a model M , such as a function g a , is adequately selected to describe the problem. The vector of parameters a of the model M is assumed to be drawn from a theoretical parameter distribution, which means that a is considered as a random variable. A prior distribution over this unknown a should be then specified to capture the beliefs about the problem before seeing the data. For instance, one may require that a is smooth and takes reasonably small values. In practice, Gaussian prior distribution is frequently used. In the simplest case it can be chosen to be a spherical Gaussian, for which the prior distribution is p ( a ) = N(0,a21).The hyperparameter 0' can be now treated again as a random variable, or assumed fixed. The Bayesian inference is then based on the Bayes formula for updating the priors given the evidence from the data as p(M1data) = where p ( M ) = p ( a ) . p(data) is the probability of observing the data, p(data)M) is the likelihood, i.e. the probability that the data are generated by the model M and p(M1data) is the posterior probability of a model 11.1given the data. Hence, one tries to find a complete density function for the vector of parameters a . As a result, one never actually estimates or chooses any value for a ; all possible parameter values, although in different degree, play some role. A preference for simpler models is encoded by encouraging particular prior distributions. More details can be found in [Jensen; 1996; Robert, 2001; MacKay, 20031.
p(dazz
Learning approaches
163
Minimum description length (MDL). This principle is based on the information-theoretic basis and the concept of algorithmic coniplexity characterizing t,he randomness of t,he datja [MacKay, 20031. Its basic insight is that statistical learning is related to firiding regularities in the data, which can be used to compress the data, i.e. describe them by using fewer symbols than originally needed. Learning is, therefore, related to data compression [Griinwald, 20051. Roughly speaking, given a number of hypotheses (models) and a data set, the hypothesis H is chosen which compresses the data most. Models are assumed to describe tlie regularities in the data and should contain a few easily encoded parameters. N is identified as the model which minirriizes L ( H ) +L(datulH), where L ( H ) is the shortest binary code describing H and L(data1H)is the shortest binary code describing D when encoded with the help of H . For an introduction to MDL, the reader is referred to [Griinwald, 20051. Relations between the MDL arid the SRM can be found in [Vapnik, 19951. 4.1.4
Why is the statistical approach not good enough f o r learning f r o m objects?
Even after such a brief review on statistical learning theory, the reader can be convinced that. the methods developed in a proper framework guararitee good solutions. The a n s w e r is afirmative, provided t h a t t h e prohlems to be solved are oiiginally generated a s p o i n t s in, a suitable vector spacc, such as Euclidean. There i s a missing link between a collection of real osr abstract objects to be learned from and th,eir proper represen,tat%on,sto be used as a basis in a learning paradigm. In the statistical approach (except for the MDL priiiciple), the description of objects is often drarnatically reduced to points in a vector space. The analysis starts at this level. often neglecting the (one-way) correspondence between the objects and the points. Also such a simplification of an object to its riunierical description (i.e. without any structural information) precludes any inverse mapping to be able to retrieve the object itself (this is partly possible in tlie structural representation of objects, e.g. from the skeleton of an object in an irnsge, its shape can be retrieved well enough). The objects are simply treated as if they are already generated as points in the space. Note that these assumptions are very strong. Since the connection between the points and the objects is forgotten, any learning in such a framework is in fact learning with respect to the assumed distributions realized by a sample in the space R". Hence, this learning is in a purely mathematical sense. Besides the
164
The dissimilarity representation SOT pattern recognition
guarantees on providing good learning solutions, one should be concerned with the guarantees that the representation of objects by points enables to achieve that. For all these reasons, Goldfarb and his colleagues [Goldfarb, 1990; Goldfarb et al., 1995, 1992; Goldfarb and Deshpande, 1997; Goldfarb et nl., 2004; Goldfarb and Golubitsky, 20011 strongly oppose the statistical approach to learning, separated from the ‘real’ objects themselves. Real objects possess their internal structure, organization or ‘interconnectivity’, as observed in their shapes, for instance. This property can be reflected by the connectivity of neighboring samples in the sensory data such as an image. However, in the traditional feature space representation, all the continuity of an object, all the structure is lost [Chan and Goldfarb, 1992; Goldfarb, 1992; Goldfarb et al., 19951. The structure information may partially be encoded in some feature values, e.g. when features are defined as the responses of several image filters, but in the representation itself it is not available anymore. This also holds for the vectorial pixel representation of images, in which each pixel defines a separate dimension in a feature space. The complete image resides there, but the fact that some pixels are neighboring and others are remote is not expressed in the representation6. The Euclidean vector space assumes independent feature contributions and, therefore, it precludes the possibility of reflecting the structure of an image. The structure may be rediscovered to some extent by computing correlations between pixel-features from a set of images or by trying to find a low-dimensional manifold on which the set of images, represented as points, lies. Still, this is not the original structure. Moreover, the necessity of learning them in such a way is disputable if the primary structure of an image, reflected by the connectivity of neighboring pixels, is already given. Structural representations, on the other hand, are specified in terms of instance’s components and their interconnections. For a real object, it might be its structure. The structure, however, should be regular enough to be described by a relatively small number of primitives, i.e. fundamental structural element>s,such as strokes, corners or other shape elements. For instance, shapes can be represented as their skeletons, contours by their string rcpresentatioris, where each character in the string corresponds to some kind of a stroke. Also more abstract instances, such as articles in 6The fact that consecutive features correspond to neighboring pixels in an image cannot be used in the feature-based representation. Features can be permuted and, in a Euclidean space, the configuration of points (representing the images) is the same up to the rotation, while the structure of an image is completely lost by shuffling the pixels.
Learning approaches
165
a database can he represented structurally. An article might be organized in a hierarchical way, expressing the fact that it is composed of a title, an introduction, body, conclusions and references, The body might be e.g. an interview, a comment, a letter, a speech or a general writing. In turn, an interview can be made with a famous artist, actor, writer, scientist, etc. In this way, the detailed information on articles can be represented by trees. Other type of phenomena, e.g. a financial condition of a family, might he captured by graphs, expressing the relations between all the important factors, such as incomes of the family members, mortgage, loans, the number of children and their age, etc. In general, structural pattern recognition [Fu, 19821 assumes that there exists sufficient and suitably formulated knowledge to build a structural description of objects and classes. This knowledge is defined and encoded either explicitly by an expert or implicitly by a set of (training) exarnples. In order to relate new objects to the described classes, a (dis)siiiiilarity measure between objects and the structural description of objects and/or classes is needed. Like in the sta.tistica1 approach, the demands here are strong: suitable knowledge should be available t o build the structural model and an informative dissimilarity measure should be defined between the model and real-world observations. The research of pattern recognition is meant to establish a link between object representations, derived from their (sensor) measurements of objects or structural descriptions, and a learning algorithm; see also [Duin et ul., 20021 for perspectives. In the statistical learning theory, the bounds ensuring a good generalization (in classification: a small test error. given a small training error) are based on the notion of a classifier complexity, which can be related to the VC dimension. For a binary classification problem, this notion is derived for the worst configuration of n (training) points arid considering all 2n labelings of them. This is a possible scenario to coilsider if the association to the (real or abstract) objects one started from is neglected. From the pattern recognition point of view, this is unrealistic t o postulate a class of (similar) objects being described by arbitrarily labeled points. If this had been the case, one would have chosen another representation. Basically, the representation of objects is not accidental. it should be such that similar objects are close in their representations (of course, one should first define what the closeness means). This is the compactness hypothesis [Arkadiev and Braverman, 1964; Duin, 19991. Ideally, also the true representation hypothesis should hold, which is the reverse of the compactness hypothesis. It says that two close object representations
166
The dissimilarity representation ,for pattern recognition
Sensors / Raw measurements
1
Sensors Data
I Sensors /
t R ~ W measurements
Figure 4.6 Proximity as unifying statistical and structural approaches t o learning. This work focuscs on optimized dissimilarity representations.
correspond to ohjccts that resemble each other. Note that in order to solve a r e d pattern recognition problem, such a subjective hypothesis is necessary to cornplcment the available objective information. For that reason, the bounds derived by Vapnik are strongly over-pessimistic. 111 surnrnary, the answer to the question of this section is: Suitable representations of objects should be considered first before using or adapting th,e desueloped leurning meth,odology. S u c h representations should possibly ernbody a priori knowledge orn the class of objects as well as their possible structural information. There might be hybrid representations as well. Our work is concerned with an example representation based on the notion of (t1is)similarity and by this relying on the compactness hypothesis. It is pioneering, riot directly in the following methodology, but in the statement of thc learning problem and resulting adaptations. 4.2
The role of dissimilarity representations
Pattern recognition relies on the description of regularities in observations of classcs of objects. A class i s a set of similar objects (e.g. sharing similar characteristics or commonalities). This implies that the notion of ‘similarity‘ is more fundamental than of a ‘feature’ or of a ‘class’, since it is the similarity which groups objects together and, thereby, it should play a crucial role in the class constitution [Edelman and Duvdevani-Bar, 1997; Edelman et al., 1998; Edelman, 19991. Such a proximity should be possibly modeled such that a class has an efficient and compact description. In applications, however, numerical features often come before proximity is taken into account. Using the notion of proximity (instead of features) as a prirnary concept renews the area of automatic learning in one of its founda-
Learning approaches
167
tions, i.e. the representation of objects [Mottl e t al., 2001b,a; Gukrin-Dugud et al., 1999; Duin et al., 19981. Conceptually, it is a novel approach, but other researchers are conscious of the essential role that proximity plays for the class description. Examples can be found in [Bunke et al., 2001; Strehl, 2002; Mottl et al., 2001b,a; Jacobs e t al., 2000; Edelman et al., 1998; Edelman, 1999; Goldfarb, 1984, 1990; Goldfarb et al., 2004; Goldfarb and Gohibitsky, 20011. Proximity measures have the capability to capture both the statistical and structural information of' patterns and, thereby, they form a natural bridge between these approaches. Recently, also a universal distance measure was proposed in the MDL learning framework, based on Kolmogorov complexity [Bennett e t al., 1998; Li e t al., 20031. The distance, as such is impossible to compute, however, can be approximated by a normalized compression distance [Cilibrasi and V i t h y i , 2004, 20051. This is an important fact, as a n y binary files or any strings can compared by a chosen compressor. This finding extends the universality of proxirnity measures as a basic and most general way of representing information for learning, as they can be derived in all learning principles. proximity representations are then the starting point. Two main types of representations are considered: the ones which are learned and the ones which are optimized (or fixed). This work builds some foundations for the latter, called simply proximity representation,s. Learned representations remain an issue for further research. Proximity representations can be divided into relative and conmptual. In a relative representation, pairs of objects are related to each other to measure the proximity value between them. Consequently, each object is described by a set of proximities to other objects [Duin et al., 1998, 1999; Pekalska and Duin; 2002a; Pekalska e t al., 2002bl. They may be defined on a feature-based (vectorial) representation, see Fig. 4.6 by using the distances between fcature vectors, but also on the structural representation by the distances between graphs or other structural models, or dircctly on the raw data, e.g. by similarities between shapes in images. So, proximity representations arc very general as they conibirie all types of approaches. Briefly, they describe the sampled domain in a relative way, based on the pairwise comparison of objects. Remember that an object is a general notion of a real object, any entity process, phenomenon or any abstract instance, as long as one is able to compare them in a qumtitativc way. Proximity representations can be extended to depict a relation of one entity to a number of them or a relation of a model to the whole concept. Such representations are called conceptual. Examples arc a similarity of an
168
T h e dissimilarity representation f o r p a t t e r n recognition
PI
Set T = ( p l ,p2, .,p7}
P
T$:
p7
R P
P3
x
x
x
x
x
x
DFR) Representation set R = (PI ,p2,p3]
Figure 4.7 Dissimilarity representation D ( T , R ) . The representation objects are elemerits of the set T .
object to a (sampled) domain such as a resemblance of a particular niug to a class of mugs, a siniilarit,y of a language to a group of European languages, a growth and development of a child to the model development or an image query for retrieving similar images in a process of redefining the query. Also, in the statistical sense, the posterior probabilities of' an object .'c (or in fact its feature-based representation) with respect to C classes, form a similarity conceptual representation [P(n:Iclass I), . . . P(zlc1ass C)]. Conceptual representations will appear in Chapter 8, where one-class classifiers are built based on a proximity of an object to a class, and in Chapter 10 in the context of classifier combining techniques. Now, we discuss relative representations, where our main focus is on dissimilarity representations. ~
Definition 4.1 (Dissimilarity representation) Assume a collection of objects R = { p l ; p 2 , . . . , p n } , called a representation s e t , or a set of prot o t y p e s , and a dissimilarity7 measure d . The dissimilarity d is computed or derived from the objects directly, their sensor representations, string representations or other intermediate representations. To maintain generality, a notation of d ( p i , p j ) is used, instead of d(f(pi),f ( p j ) ) , where f ( p i ) corresponds to some possible intermediate representation of p i . A dissimilarity representation of an object z is a set of dissimilarities between z and the objects of R expressed as a vector D ( z ,R ) = [ d ( z ; p l ) ,d ( z , p z ) , . . . , d ( z ; p , ) ] . Consequently, for a collection of objects T , it extends to a dissimilarity matrix D ( T ,R). The idea of a representation set is that R is a relatively small set of representative objects for the domain considered. The most simple 71f d is a similarity (or a proximity) measure, the corresponding representation is called appropriately. d is expected to capture the notion of closeness between two objects, however it might be non-metric. In general, we require that d is nonnegative and obeys the reflexivity condition; see Def. 2.38.
Learning approaches
169
dissimilarity representation is then D ( R ,R), hence a square dissimilarity matrix with a zero diagonal. In general, R might be a subset of T ( R C T ) or they might be completely distinct sets. See also Fig. 4.7. Although there exists a resemblance between dissimilarity and featurebased representations in their matrix notation, the meaning is completely different; see Fig. 4.8 for an illustration. Dissimilarity representations will be used in (statistical) learning. An important question refers to the characteristics of informative dissimilarity measures. For instance, for a robust real-world object description, a measure should incorporate the necessary invariance, like translation, rotation, scale and illumination invariance. Essentially, the measure should be such that the compactness hypothesis holds, i.e. representations of similar objects are similar. This means that a small variation of an object should impose only a small change of a proximity value, hence the natural variation of objects of the same class should be captured there. Many dissimilarity measures are constructed by solving object matching problems, often defined in terms of the minimization of the mean square error or mean absolute error by the use of affine transformations. This corresponds to the Euclidean or city block distance, which may not fully integrate the mentioned invariances. Such computed distances can not directly capture the structural information of the objects since they are based on sums of (weighted) independent contributions, referring only to some object properties. On the other hand, non-Euclidean or non-metric measures have become more popular, e.g. for measuring shape distances [Dubuisson and Jain, 1994; Jain and Zongker, 19971 or others [Jacobs et al., 2000; Edelman et al.: 1998; Guitrin-Dugu6 et al., 1999; Santini and Jain, 19991. Due to measurement noise in sensory data or other uncontrolled factors, there might be a necessity to improve the resulting dissimilarity measures. Noise can be reduced either in a pre-processing stage of nieasiireinerits or by the use of (non-)linear transformations, if the measures are just given or directly result from an earlier analysis. Such transformations may be also applied to impose a more compact class description, e.g. by making large distances smaller, or (if required) by imposing particular characteristics of distances such as a Euclidean behavior. Some of such transformations are described in Sec. 3.1. Since different proximity measures, as defined in feature spaces, between strings, graphs and on the sensor data may reflect various aspects of data characteristics. as well as various types of expert knowledge, their
170
T h e dassimilarity representation foT p a t t e r n recognition
Feature-based representationA(T,F) f,
f,
f,
Dissimilarity representationD(T,R) P< P. P.
Figure 4.8 Left: feature-based representation A(T,F).Right: dissimilarity representation D ( T ,R). Assume T = { t l , .. . , t n } is a set of training objects and F = { f l , . . . , fm} are the features. An object t, is then represented as a vector of its feature values a ( t i ,f,) i.e. A(t,,F)= [ a & , f l ) , . . . , a ( t i , fm)].The feature f, is represented as a vector A ( T ,f,). Hencc, features correspond t o dimensions in a (Euclidean) vector space, where objects become points. A dissimilarity representation describes the relations between objects, hence additionally, a collection of representatives R = { P I , . . . , p r } is needed. In the most simple case, R = T and for a quasimetric measure, the resulting D ( R , R) is a symmetric matrix with a zero diagonal. R might be a subset of T or a distinct set. An object t , is represented by a vector of its dissimilarities d ( t i , p , ) to the objects from R,i.e. D ( t Z R) , = [ d ( t , , p l ) , . . . , d @ i , p r ) ] . D ( T , p , ) = [ d ( t l , p , ) , . . . ,d(tn,p,)IT refers to dissimilarities to a particular object p,. Any entry in A is a feature value for a particular object, while any entry in D is a similarity value between two objects.
conibination might be beneficial. They may be considered either jointly or exclusively, or they may form a new proximity representation. The possibility of a combination makes a dissiniilarity representation a more universal representation due to the increased flexibility. Now, a complex problem can be described by a number of dissimilarity representations between their different aspects or characteristics. For instance, an article in a database can be represented (in an intermediate stage) as a point in a vector space, where each feature corresponds to the frequency of the specified keyword, but also a5 a trec organization of a title, an introduction. body, conclusions and references, c-tc. Next. two different dissimilarity measures can be designed in the statistical and structural approaches, yielding two distinct representations. which can be further combined. Combining proximity measures (or their transformed versions) [Pqkalska and Duin, 2001bl is closely related to the area of combining classifiers [Kittler et al.. 19981. Examples can be found in Chapter 10. Another fundamental question refers to the learning paradigms, especially those which deal either with non-metric or non-Euclidean measures. Basically, they take place in spaces, already introduced in Chapter 2. More precisely. they built on methods of linear algebra and functional analysis,
Learning approaches
171
as well as statistical learning [Hastie et al., 2001; Diida e t al.. 2001: Vapnik, 19951, kernel methods [Cristianini and Shawe-Taylor, 2000: Scholkopf and Smola, 2002; Vapnik, 19981 and approximate erribeddings in pseudoEuclidean and KreYn spaces [BognAr. 1974; Iohvidov et al.. 1982; Duin and Pekalska, 2002; Goldfarb, 1985; Pekalska et al., 2002b: Pqkalska and Duin. 2002~1,as presented in Chapter 3. Further on, the usefulness of pretopological spaces, offering poorer axioms than Euclidean spaces, can be studied as well. The compactness hypothesis may serve as a basic demand for building pretopological spaces from more general neighborhood relations. The learning approaches are discussed in the next section. 4.2.1
Learned p r o x i m i t y representations
We realize that the developed framework for dissimilarity representations is only a first step in the direction of integrating both statistical arid structural approaches, the problem of constructing an informative representation and proper learning methodologies. For dissimilarity representations, the mcasure itself is assumed to be given. To some extent it can be optimized with respect to a set of objects, but, rather in a limited way, such as the determination of parameters. The next step is to investigate how dissimilarity measures can be learned from a set of examples. For this purpose, a learned representation can be considered, primarily based on the structure present in real objects [Goldfarb, 1992: Goldfarb and Golubitsky, 2001; Goldfarb e t al., 20041. Two possibilities are considered: to learn a relative representation or to learn a conceptual representation. The first focuses on defining a dissimilarity measure and a set of prototypes to which other objech will refer. Such a representation is further used for learning. The conceptual representation descrihcs a dissimilarity of an object to a class. Such a dissimilarity is related to the costs (weights of transformations) of generating an object from a set of primitives (basic descriptors) in the context of other objects within a class, as well as objects outside this class. This is an attempt of a truly inductive way of learning [Goldfarb et al., 20041, where not wily the essential transformations and the weights are learned. but primitives as well. Siich a formulation is close to the problem of one-class classification [Tax, 2001; Tax and Duin, 20041. Another simpler approach is to combine the strengths of' the structural arid statistical frameworks on the level of a relative representation. Assume that one deals with objects that possess an identifiable structure,
172
T h e dissimilarity representation for p a t t e r n recognition
Interpret D(.,R) as a dissimilarity vector space and design a classifier there
W,R)
Figure 4.9
d(;pJ
Classification in a dissimilarity space: an illustration
such as spectra, time-signals, images or text documents. The first step is to define a small collection of fundamental structural detectors, yet general enough to be applicable in many problems, independent of a specific expert knowledge of the application. This means that such detectors are defined for the given measurement domain, e.g. spectra or images. The useful subpatterns should then be identified by the detectors when applied to the consecutive measurement values. The inter-relationships between the subpatterns should be captured in a relational intermediate representation (e.g. by a graph or by a string). These would be the basis for the matching proccss and the derivation of the final dissimilarity. The learning relies then on the learning of proper weights (contributions) assigned to the identified subpatterns such that the specified dissimilarity is optimal for the discriniination between the classes. The simplest example is the edit-distancc between string descriptions of objects, however, more general approaches are needed to be developed. Note that one may also consider statistical feature extractors (such as wavelets or Gabor filters), which work on the consecutive measurements, to be the building blocks of the learned dissimilarity. How to learn such measures is open for research. 4 .2 .2
Dissimilarity representations: learning
Statistical learning approaches can be adapted for dissimilarity representations. The added value of a dissimilarity-based framework lies not always in the following methodology, but in the representation itself. As we discussed in tlie previous section, a dissimilarity representation can include both the statistical and structural properties of data. Hence, instead of a single representation of a problem, one may also consider either a complex representation, built from many dissimilarity representations, or as a hybrid representation, where different aspects of tlie data are described
Learning approaches
173
in various ways such as by features, dissimilarities, and inference rules. Then one needs to face the task of combining various expertise [MCSOO, 2000; MCS02, 20021, as discussed in Chapter 10. In this section, we will concentrate on the classification task. Given a training set of K classes, a classifier tries to model a functional dependence between the data representation and the class indicators (labels) such that a new object is assigned to a specific class. The goal is the minimization of (the cost of) misclassification such that novel examples are as correctly labeled as possible. The problem to be faced in establishing classification methodologies for dissiniilarities is that the measures used in practice are often non-Euclidean or even non-metric. Nevertheless, they may perform well and it remains of practical interest to study their properties fiindameritally. Since dissimilarity representations encode the information on objects dissiniilasit,ies in a numerical way, the nature of learning is unavoidably numerical, which leads to the use of spaces. In general, we can distinguish three main approaches to dissimilarity representations, interpreted in the context of suitable spaces. They are briefly introduced below and trcated more thoroughly in the subsequent chapters. Assume a dissimilarity representation D ( T ,R), where R is a representation set and T is a training set. The measure d is general, as our basic requirements are only the nonnegativity and reflexivity, Def. 2.38. In the first neighborh,ood-based approach, making use (directly or not) of pretopological spaces, the dissimilarities between the objects are interpreted directly. This means that a dissimilarity representation describes an abstract space, where the basic neighborhoods or generalized closure operators play a key role; see also Sec. 2.3. The space is abstract in the sense that it is usually not explicitly given or it is not a vector space. It is defined by a set of available objects, e.g. a set of finite binary strings, performed measurements and the number of additional factors, such as camera positions or lighting conditions, playing role in the measurement process. So, this abst,ract space is either a. set of objects or some nieasiirement space. The neighborhoods arid generalized closure operators are defined by the use of dissimilarities to the objects from R. An example classifier is the k-nearest neighbor rule (k-NN). The second dissimilarity space approach, addresses a dissirriilarity representation as a data-dependent mapping specified by the representation set R. A mapping $(..R) : X + R" is defined as $ ( x , R ) = [ d ( z , p l )d ( z , p 2 ) . . . d ( z , p n ) ] . Note that X denotes either objects therriselves (e.g. a set of convex subsets of a finite-dimensional space), or
T h e dissimilarity representation f o r pattern recognition
174
iIll
Find a spatial representation In the underlying space E
*
Dfi ,/
)-
*.
D(RV
TRAIN
T
... D(T-R,R) R
TEST
Project the test data D(S R) onto E and apply the classifier
S
*
.
D(S.R)
Figure 4.10
Classification in a embedded space: an illustration.
a feature-based vectorial representation of objects. Note that X might not be given explicitly. The dimension of such a space is controlled by the cardinality of R.Using this formulation, decision functions can be directly constructed on dissimilarity representations, as in a vector space, in which each dimension corresponds to a dissimilarity to a representation object d ( . , p i ) . This is possible, since a dissimilarity vector space is assumed to be endowed with the traditional inner product and the associated norm and Euclidean metric. Additionally, if beneficial for learning, other distance measures, e.g. from the (,-distance family are considered. Since dissimilarities are nonnegative, all the data are mapped as points to a nonnegative ortharit of a vector space. Many traditional classifiers can be now applied [Pqkalska and Duin, 2000, 2002a; Pqkalska et al., 2002bI; as discussed in Sec. 4.4. See also Fig. 4.9. The third embedding approach requires that a dissimilarity representation D ( T , R ) is such that R C T. First, a spatial representation of the syiiimetric D ( R , R ) is found, i.e. a vector space V , where the objects are mapped as points such that their distances reflect the actual dissimilarities, as presented in Sec. 3.6. Then the remaining objects T\R, if exist, are projected there. Next, a decision function is determined in the this embedded vector space; see also Fig. 4.10. This is possible, as the embedded vector space is equipped with an inner product and additional algebraic structures, if needed. In learning, new objects are first projected to V and subjected to the classifier. Further details are presented in Sec. 4.5. In summary, to build a classifier for dissimilarity representations, the training set T of N objects and the representation set R of n, objects are used. R is a set of prototypes, possibly covering all the present classes. R is usiially considered to be a subset of T , R C: T , although R and T might
Learning approaches
175
be different for the first two approaches. R might be chosen from T either randomly or in a systematic way. For instance, n objects can be chosen such that thc minimum distance between them is maximized. Another possibility is based on a greedy approach. In an iterative mariner, starting from a randomly chosen object, an object is selected which is the most dissimilar to all objects already chosen. This might be applied globally or for each class separately. Such objects are likely to be atypical or positioned between the classes. If computationally tractable, one may also usc the complete representation D ( T ,T ) and try to select R to optimize a particular classifier on d ( T , R). Selection methods will be discussed in Chapter 9. In the learning process, a classifier is constructed by making use of the N x n D ( T , R ) ,relating all training objects to all the prototypes. The information on a set T, of s new objects is provided by their. dissiniila,rit>ies to R , i.e. as an s x n matrix D ( T SR). ,
4.3 Classification in generalized topological spaces Let X be either a finite set or a vector space. Consider a generalized metric space (X, p) with a, dissimilarity measure p: X X X4 R$ such that p ( z , z ) = 0. Let, B b ( z ) = { y E X : p ( z , y ) _< S} be a S-ball ('closed' neighborhood ball) for some positive 6. For each z E X define its miriinial neighborhood as B ~ ~ r z ( z=l ({yz )E X : p ( z , y ) 5 dnn(z)},where 6,r,,(z) = p ( n : , n n ( x ) )is, the dissimilarity of z to its nearest neighbor n n ( z ) . By the . growth function gr(A) is now reflexivity property of p, z E B ~ , , , ( z ) A defined on the power set P(X)as a generalized closure operator such that the following axioms hold for all z E X and all A E P ( X ) : (1) gr(G?) = 8.
(2) gr(z) =
" BiL,J&)
= B&)(4
= UzEA gr(z).
(3)
It is straightforward to check that this growth function fulfills tlie axioms (1) (4) of Def. 2.19, hence ( X ,gr) is a pretopological space. Such a closure operator describes the d,,-neighbors pretopology. Imagine now that d,, does riot depend on x , hence 6 ,, ( x ) = E > 0 for all z E X . If E is chosen as the smallest distance inX, E = minzEX p ( z , n n ( z ) ) then , the growth of the element z is gr(x) = BE-(z),hence is described by tlie &-ball. If X is a Banach vector space equipped with a metric distance p, e.g. such as the [,-distance, p 2 1, then thanks to the property of convex neighborhoods on which the &-ballrelies, gr2(z) = gr(gr(z)) = gr(B,(z)) = ~
176
t
The dissimilarity representation for p a t t e r n recogitition
Metric distance in a normed space
t
Non-metric dissimilarity
Figure 4.11 Assume a vector space X with the additional dissimilarity p (it might be for instance a measurement space). Example growth operators gr(z) = B E ( z )with a fixed E > 0 are shown, when p is a metric distance (left) or when p is a non-metric dissimilarity , in this (right). In the non-metric case, gr(gr(z)) is not necessarily identical t o B z E ( z ) as metric case.
BZE(z)and, more generally, gr"(x) = B,,(x) for m E Ns. For a metric not defined by a norm or for non-metric measure p, this is not true, in general. Still it may happcn that B,,(x) C gr"(z). See also Fig. 4.11 for ari illustration. Alternatively, one can define the k-nearest neighbor pretopology, where the growth function of z is defined by its k nearest neighbors, gr(.z) = {y : y E { n n l ( z )nn2(x),. , . . , n n k ( x ) } ,where nni(z) is the i-th neighbor of z. It additionally satisfies gr(0) = 8 and gr(A) = UzEAgr(z). It is then straightforward to check that gr fulfills the axioms of pretopology in Def. 2.19. Another possibility is to use neighborhoods to define the neighborhood basis at 2 . Let BE(x) = {y E X : p ( z , y ) < E } be an e-ball for a positive E. Then the neighborhood basis is defined as N B ( ~=){ B E ( z:) E = c p ( x , n n k ( x ) ) , k = 1 , 2 , . . .}, where c = 1 S is a constant and S > 0 is very small. Consequently, (X,NB) is a pretopological space. Consider now a training set T and a dissimilarity representation D ( T ,R ) . A generalized closure (growth) operator or neighborhood basis can be defined for every class wi based on D(T2,RZ),where Ti c T and
+
8We will show that & ( 5 ) = gr2(z) for a metric distance p in a Banach vector space. Note that BzE(z)= ( 2 : p ( z , x) < 2 ~ and ) gr2(z) = UZIEB,(2)&(y) = {(z,Y) : p ( z , Y) < E
A P(Y>z)<E). z t &(z)
+ Let
e p ( z , z ) < 2 ~ In . a Banach vector space with the metric p , there exists a unique middle point such that p(z, z) = p ( z , y) p(y, z). Hence, p(z, y) < E and p ( y , z ) < ~ .It follows that z t g r 2 ( z ) and consequently, B z E ( z )C gr2(z). + L e t zEgr2(z) H p ( z , y ) < & A p ( y , z ) < c . Since pisnonnegative, then p ( z , y ) + p ( y , z ) < 2 ~ By . the triangle inequality, p ( z , z)< p ( z , y) p(y, z). Hence, p ( z , z ) 5 Z E . It follows that z E B z E ( z )and, consequently, gr2(z) C B z E ( z ) .
+
+
Learning approaches
177
R, c R correspond to the objects from w,. So, BELis the neighborhood basis for the class w,. An unknown object is assigned to the class wk if it belongs to a generalized closure or a neighborhood of one or more objects from the class wk only. If no single class exists. then the sets Bme, for rn E N can be used instead as an approximation of the successive growth by a repetitive use of the generalized closure (in a metric vector space with a growth function defined by a metric distance, B,,(z) C gr"(z) holds). If an object belongs to the intersection of neighborhoods (or closures) of two (or more) classes, then the final decision should be made by looking at the majority of objects from a particular class within the neighborhoods. It means that the decision rules built on the dissimilarities directly can be interpreted as classifiers in pretopological spaces. Examples are variants of the nearest neighbor rules. A classifier based on the repetitive generalized closure operators in pretopological spaces (hence based on growing neighborhoods) is also discussed in [Lebourgeois and Emptoz, 1996; Frillicot and Emptoz, 19981, Nearest neighbor (NN) rule. A straightforward approach to dissimilarities leads to a nearest neighbor rule [Cover and Hart. 1967; Fukunaga, 19901 or, more generally, to an instance-based learning [Aha et al., 19911. In its simplest form, the 1-NN rule assigns a new object to the class of its nearest neighbor. Originally, this nearest neighbor is chosen from the training set T . The k-NN rule is based on majority voting. An unknown object becomes a member of the class the most frequently occurring among the k nearest neighbors. Usually, k is assumed to be odd to avoid ties (for two-class problems). Notc that when k is fixed, no training is involved. Traditionally, the k-NN rule is applied to data represented as vectors in a vector space, for which either Euclidean or city block metrics arc derived. This means that indirectly the corresponding dissimilarity representation is used. This nearest neighbor principle can be extended to other (nonmetric) dissimilarity measures, obeying the compactness hypothesis: see also Sec. 4.1. The k-NN classifier is attractive, since it is simple, intuitively appealing arid no prior knowledge of the data distributions is required. It can estimate complex boundaries locally and differently for each new instance (hence its adaptations can be seen as an example of transductive learning). Moreover, it is known [Devroye et al., 1996; Diida et al., 20011 that for the k-NN rule fk" in metric vector spaces, the empirical risk € , , , ( f k ~ ~ ) converges uniformly to the actual risk € ~ N Nwith increasing n, Eq. (4.5), such that
178
T h e dissimilarity representation for pattern recognition
. . . & 3 N N 5 El” 5 E(f*)(2 -&E(f*)), where &(f*) is the Bayes error and K is the number of classes. This means that the k-NN rule is asymptotically at most twice as bad as the Bayes rule. Please refer to [Devroye et al., 19961 for proofs and other bounds. In practice, when one deals with finite sample sizes, the asymptotic inequalities will not hold. The k-NN rule is expected to perform well, provided that the domain of a problem, hence the data are well sampled. In cases, where a t least one of the classes is undersampled or badly sampled, the k-NN rule deteriorates. The k-NN rule can also be interpreted as the one which locally tries to estimate the posterior probabilities. These estimates rely on a neighborhood determined by the k-furthest neighbor. For small k , the nearest neighbors might often lie further away due to the data sparseness or the estimates might be poor due to noisy examples. Increasing k allows one to reduce the noise influence. However, the nearest neighbors with large dissimilarities in the voting scheme may lead to an unnecessary error. Therefore, a weighted voting [Devroye et al., 19961 might be an option, where the neighbor contributions are weighted accordingly to their dissimilarities to a particular object. In our approaches, the nearest neighbors will be found among objects from the representation set R C T . For a test example t,, the k-NN rule will be then applied to D ( t s ,R).
&(f*) 5 . . . 5
&21+1NN
5
12Z-lNN
5
Weighted nearest neighbor (WNN) rule. The use of k nearest neighbors in the k-NN rule is based on the assumption that they are relatively close to the object z at question. If this is not the case, it might be sensible to weight the neighbor contributions accordingly to their dissimilarities to z. Asymptotically, for a fixed k , all the k neighbors should be very close to 2 , hence the weights should not have a large impact for the classifier performance. On the contrary, for finite training sizes, it niight be helpful. Let T L ( ~ ) ( denotes ~ ) the i-th nearest neighbor of x (under a specified dissiniilarity d ) ) with the label yi and let w1 2 w2 2 . . . wk be the corresponding weights. Formally, for a two-class problem and the labels (0, 1}, the k-WNN rule (assume odd k to avoid ties) is:
Weights should emphasize the neighbors which are nearby x, but also the contributions of the far-away neighbors should be counted. So, d(x.n(%) (x))
Learning approaches
179
should be weighted by a monotonically decreasing function. For exam1 if the denominator is zero),
1where E > 0 is a small constant to avoid the division by d, +E ' zero. 7rJ = e C d ( x , n ( ~ ) ( z ) ) where /u, n > 0 determines the neighborhood or u3 = e - N d ( z ~ n ( , ) ( 2 ) ) 1a,P 3 , > 0. In empirical study of [Zavrel, 19971, the weights chosen according to the first possibility mentioned above, gave rise to the improved k-WNN rule also performing better than the k-NN rule. VJ =
Edited and condensed nearest neighbor rules. Despite the simplicity and good performance of the k-NN rule, the criticism points a t both the space requirement to store the entire training set, and the computational expense of' computing dissimilarity to all training examples. Consequently, there has been an interest in condensing the training set in order to reduce its size [Hart, 1968; Dasarthy, 1991; Wilson and Martinez, 20001. In oiir terminology this is equivalent to a selection of a proper representation set R out of the training set T , hence it niight be called a prototype selection. as well. Also editing [Devijver and Kittler, 19821 is considered to increase the accuracy of the k-NN predictions, in the presence of' noise in the training data. A basic editing algorithm removes noisy instances as well as close border cases, leaving smoother decision boundaries. It also retains all 'internal' points; i.c. it does not reduce the number of objects as much as most other reduction algorithms. More on editing and condensing can be fourid in Sec. 9.2. Many variants of the "-rule, taking into account the local structure of the data or weighting the neighbor contributions appropriately, are invented or adopted for feature based representations; see Sec. 5.6 for a, brief information. The question on how such measures should be constructed is beyond the scope of this work.
In conclusion. The potential of generalized closure operators and neighborhood systems as the means of defining pretopological spaces is not explored yet. Although Chapter 2 provides basic definitions and some intuition in this area, learning methods are not developed. Currently, only generalized variants of the nearest neighbor approaches are considered. We think that the study of pretopological spaces may bring additional understanding on the perspective of existing statistical methodologies. It may open new ways of defining classification rules, in which relations between objects are described by some binary function, such as a general proximity measure.
180
4.4
The dissimilarity representation f o r pattern recognition
Classification in dissimilarity spaces
The novelty of our approach relies on interpreting D ( T ,R) as a representation of a vector space, called a dissimilarity space, where each dimension describes the dissimilarity to a particular object. More formally, let X denote a set of objects (e.g. a set of fruits, a set of finite strings over a particular alphabet or a set of gray-value images of particular size), or a feature vector representation of objects. Let R = ( p 1 , p 2 , . . . , p n } be a collection of representation objects, chosen from X and let d be a dissimilarity measure used to compare pairs of objects from X. We will assume that d is bounded, i.e. there exists a positive constant M such that d(x,y) 5 M for all 2 , y E X . If this is not true, then a suitable semimetric transformation, Theorem 3.7, can be used to bound it. A data-depending mapping 4 ( . .R ) : X + RTL> described as d(x,R) = [ d ( z , p l )d ( z , p z ) . . . d ( x , p n ) l T def i r m an n-dimensional vector space, a dissimilarity space denoted D ( . ,R). 4.4.1
Characterization of dissimilarity spaces
In order to use the apparatus of statistical learning, a dissimilarity vector space will be equipped with the traditional inner product and the associated norm and Euclidean metric. Additionally, other dissimilarity measures, especially from the family of the !,-distance will be considered. Important observations can be made for metric distances. Proposition 4.1 Let D ( . ,R ) be a dissimilarity space built b y a metric distance d . Consider ( D ( . ,R ) ,d,), the dissimilarity space equipped with the max-norm distan,ce. Then the space ( D ( . ,R ) ,d,) results from an embedd i n g of a metric distance representation D ( R ,R ) .
Proof. This follows from Lemma 3.1, where it is proved that the max-norm distance in a dissimilarity space D ( ., R ) , defined as d m ( d ( P 7 ;R ) ,d ( P j , R ) )= maxp€R ld(P2,Pr) - d(Pj,Pr)I is equal to the 0% irial distance d(pi,p ] ) . 0 Proposition 4.2 Let ( X ,d ) be a set of objects X with a metric distance d . ASTUTE that R c X and D ( . ,R ) is a dissimilarity space built by d . Consider a scaled dissimilarity space ( D ( s ) ( R),d,), ., defined as D ( , ) ( x ,R ) =
C k D ( x ,R ) , equipped with the !,-distance d,, p 2 0. Then the metric d majorizes d, or, in, other words, the mapping $ : ( X ,d ) + ( D ( s ) ( R . , ) ,d,) is a contraction (Lipschitz continuous with IE. = 1).
Learning approaches
181
Proof. The statement above means that in a suitably scaled dissimilarity space the !,-distances take values not larger then the original distances d. This also holds for p~ (0, I), hence non-metric l,-distances. Since d is metric, the backward triangle inequality, Theorem 2 . 5 , Id(z. z ) d(y, z)l 5 d ( x , y) holds for any IC, y, z E X. Since distances are nonnegative and the power function f, = x p is monotonically increasing on R : , this is equivalent to Id(x, z ) - d(y. .)I” 5 d ( z ,y), for p 2 0. Consider now the !,-distance in a dissimilarity space. Consider any z. y E X. Using the facts mentioned above. one has
(4.9)
Hence, d P ( D z ( zR), , D(sl( y , R ) ) 5 d ( x , 1 ~ ) and by Def. 3.3, the mapping 0 $ : (X, d ) i ( D ( s , ( .R), , d p ) is a contraction.
Remark 4.1 If the original dissim,ilarity space i s considered, then $ : ( X ,d ) + ( D ( . ,R ) ,d,) i s Lipschitz for K = ni. This holds as for th,e chosen I finite representation set R,n* i s a constant, n o t depending o n X . A S n,; m a y be a large constant, it is u s e f d t o linearly scale the given dissimilarities. Remark 4.2 If the dissimilarity measure d is non-metric, one m a y make at first symmetric, ,for instance by averaging, davT(x,y) = $ ( d ( x ,y) d ( y , x ) ) (otherwise two asymmetric dissimilarity spaces can be studied). If th,e definiteness and triangle inequality are disobeyed, then there exists a constant c > 0 , such that the distance d,,, defined as d,(x, y ) = davT(x,y) c for x # y and d c ( z , x ) = 0, is metric. This follows f r o m Corollary 3.3 and the fact that d is bounded. If d ( x ,y) 5 M for all x , y E X , th,en c E ( 0 ,MI. If the mapping $ : ( X ,d,) + ( D ( s , ( .R):d,) , as defined in Proposit%on4.2 is considered, then d, mujorizes the !,-distance in th,e scaled dissimilarity space.
+
+
Note that the mapping $ of ( X , d ) into a dissimilarity space is a type of a distorted embedding, Sec. 3.1.2. Note also that since is Lipschitz: it is continuous by Theorem 3.1. $J
182
The dzsszmzlarzty representatzon f o r pattern recognztzon
Figure 4.12
Metric 2D dissimilarity space.
If‘ the distance measure d is a metric, then all vectors D ( z , R ) lie in an ,ri,-diniensionalprism, bounded from below by a hyperplane on which the objects from R are and which is bounded from above in case of bounded dissimilarities. Consider a 2D representation D ( z ,R ) , where R = [ p i , p j ] . For brevity, denote that d,, = d ( p i , p j ) and z = d ( z , p i ) , and y = d ( z , p j ) for an object z. The following triangle inequalities should hold: z f y 2 d i j , d i j n: 2 y and d i j y 2 z for a metric d. Depending on z and y, a prism is formed as shown in Fig. 4.12. Note that in higher-dimensional spaces, the (1iyper)prism is asymmetric and the vertices of its base do not lie on the axes. For instance, in a 3D space, the vertices lie in the zy-, yz- and z z - planes. In principle, z may be placed anywhere in the nonnegative ortliant of a dissimilarity space D ( . , R) only if the triangle inequality is completely violated. This is, however, impossible from the practical point of view, because then the compactness hypothesis would not be fulfilled. Consequently, this would mean that d would have lost its discriminating properties of being (relatively) small for similar objects. Therefore, the measure d, if not metric, it has to be sufficiently close to a metric and, thereby, D ( z , R ) will still lie either in the prism or in its relatively close neighborhood’. See Fig. 4.13 and Fig. 4.14 to get some intuition. A justification for the construction of classifiers in dissimilarity spaces is as follows. The property that dissimilarities should be small for similar objects, i.e. belonging to the same class, and large for distinct objects, gives a possibility for a discrimination. Thereby, D ( . , p i ) defined by thc dissimilarities to the representative p i can be interpreted as a ‘feature’. If p , is a characteristic object of a class w , then the discrimination power of
+
+
’This is not always true. If one considers a power transformation of a metric distance, which as a monotonic transformation preserves the order of dissimilarities, t,hen a large deviation from the triangle inequality can be expected. For instance, this happens for d = d;O for the Euclidean distance dz taking values in [0,5].
Learning approaches
Theoretical data
Dissimilarity space
Theoretical data
5
5
0
0
-5
183
Dissimilarity space
-5 0
5
1 0 1 5 0
5 DLpJ
10
0
5
1 0 1 5 0
5
10
DLp,)
Figure 4.13 Simple illustration of 2D dissimilarity spaces. The first and third plots show the theoretical artificial data with a quadratic classifier. The t5/2-distance reprcsentations D ( T , R ) were considered defined by two representation objects R = [ p l r p 2 ] and the measure d(ti,pj) = Jt,k The second and thc fourth plots present the dissimilarity spaces D ( . , R ) , where the objects p l and p2 are marked by circles on the first and third plots. For a well chosen R, a linear classifier in a dissimilarity space D ( . . R ) separates the data well. In this example, a dissimilarity representation may be used to encode nonlinearities from the original vectorial representation. Here; a dissimilarity representation is derived for the given vector space. In general, we assume the other way around: only a dissimilarity representation is given for which is interpreted in a suitable space.
(C;=,
D ( . , p i ) can be large, i.e. the dissimilarity values for the objects from w become small. If pi is an atypical object of its class, then D ( . , p i ) may not be informative. On the other hand, as the usefulness of D ( . , p i ) is judged with respect to objects of other classes, atypical exarnples of w may still be distinctive. The overall strength lies in using the complete representation set R. Another reasoning relies on the following fact. If the objects 3: and y are alike and the dissimilarity value d(z,y) is small, then for othcr objects z , the dissimilarities d ( z ,z ) and d ( y , z ) might not ta,ke similar values if the measure d is non-metric. However, is the dissimilarities of z and y to a given set of prototypes R are inspected, one can expect that, although the individual values will differ, the vectors D ( z ,R) and D ( z ,R) are correlated in their entirety. If so, then the representations D ( z ,R ) and D ( z , R ) arc close in a dissimilarity space (as judged e.g. by the Eiiclideaii distance there). Consequently, the dissimilarity space approach should be useful for non-metric mea,sures. One may wonder what the added value of such a representation over a feature-based representation is, if the traditional classifiers designed for vectors spaces may be applied in the end. The strength of a dissimilarity representation relies on the flexibility of a dissimilarity measure to be defined on quite arbitrary sets or measurements, such as sequences? text documents, digital images, spectra or features. Moreover, a dissimilarity
184
The dissimilarity representation for pattern recognition
Euclidean
Dissimilarityl o 8
Modified Hausdorff
Dissimilarity to 8
Figure 4.14 Examples of 2D dissimilarity spaces and linear classifiers for a subset of handwritten digits 3 and 8. Two dissimilarity representations D ( T , R ) are shown based on the Euclidean distance (metric) computed between blurred images and the modified Haiisdorff distance (non-metric) between digit contours; see also Sec. 5.5. R is randomly chosen and consist of two examples, one for each digit.
measure has a potential to encode statistical or/and structural characteristics of data instances. Also dissimilarity measures can be naturally combined, e . ~by . their weighted sum. A dissimilarity representation allows one to capture the properties of objects more adequately, as more emphasis and knowledge is put to a class of similar objects. Dissimilarity representations are numerical descriptions and are interpreted in suitable spaces. The choice of a vector space D ( . ,R ) is in agreenient with its mathematical concept in the following way. The dimensions of such a space are now dissimilarities to the prototypes which are derived according to a specified measure. Hence, they convey homogeneous type of information. This is riot valid for a general feature-based representation, where features have different character and range, e.g. weight or Icngth. Another advantagc of a dissimilarity representation is that since a dissimilarity measure already possibly encodes the object structure and/or other characteristics, the designed classifiers might be chosen to be simple, e.g. linear models. Defining a well-discriminating dissimilarity measure for a non-trivial learning problem is difficult. Designing such a measure is equivalent to defining good features in the traditional feature-based classification problem. If a good measure can be found and a training set is representative, then the k-NN rule is expected to perform well. The decision of the k-NN is based on local neighborhoods and it is, in general, sensitive to noise. It means that k nearest neighbors found might not be the best representatives
Learning approaches
185
for making a decision to which class an object should be assigned. For small or non-representative training sets, a better generalization can be achieved by a classifier built in a dissimilarity space. For instance, a linear classifier in a dissimilarity space is a weighted linear combination of dissimilarities between an object and the representation examples. The weights are optimized on the training set and large weights (in magnitude) emphasize objects which essentially influence the final decision. By doing this, a more global classifier can be built, by which its sensitivity to noisy representative examples is reduced. Our experience confirms that a linear or quadratic classifier can often generalize better than the k-NN rule; especially for a small representation set R; see also [Pqkalska and Duin, 2001al. 4.4.2
Classifiers
A dissimilarity representation D ( T ,R ) is an N x n dissimilarity matrix. D ( x ,R ) is a row vector in D ( T ,R ) , for simplicity reasons, however, it will be treated as an n x 1 column vector without any further notice. A general linear function in a dissimilarity space D ( . ,R ) has the following form:
c n
g ( D ( x ,R ) ) =
wj d ( z , p j )
+ wo = wTD(z,R ) + wo.
(4.10)
j=l
In the training process, y is determined as a decision boundary between two classes. A classifier is a function returning the class assignments. Usudly, one assumes that the equation g ( D ( x ,R ) ) = 0 defines a decision boundary, hence for a two-class problem, the sign of g ( D ( z , R ) )indicates to which class the object z belongs. Assume a, t,wo-classproblem with a set of labels {+l,-1). Then Decision boundary: g ( D ( z ,R ) ) = 0. Classifier: f(D(.? R ) ) = sign ( g ( D ( x R)). ,
(4.11)
For the sake of convenience, since now on, we will not make any distinction between a boundary function g and its indicator function f , both denoting as c2ass$ers. We will also assume that a dissimilarity measure d is scaled suitably to build a dissimilarity space, e.g. as described in Proposition 4.2; by a semimetric transformation or in some other way to guarantee that the dissimilarities are bounded by a not too large constant. In the description below, K classes, w l , w2;. . . ,W K , are distinguished. Usually two classes are considered, K = 2, since any multi-class classifi-
186
T h e dzssamilarity representation for p a t t e r n recognition
cation problem can be decomposed into a number of two-class problems. Thc k-th class with cardinality rj,k is denoted by wk,k = 1 , 2 , . . . , K , and its prior probability by p ( w k ) . Thc mean vectors per class, determined in the dissimilarity space D ( . ,R) are denoted as mi, and the overall mean is given as m. In features vector spaces, the training pairs {(xi,yi)}yE1 are corisidercd such that x L is a vector arid y2 = (-1, I} is a label. We will make an additional observation here. Assume that each class is modeled by a proximity function f w , capturing proximity of an object to thc class w,.This, in fact, means that, a onc-class model is designed for each class. An example is the log-likelihood, when tlie class is niodeled by a single or a mixture of Gaussians or the reconstruction error, when the class is modeled by a PCA-subspace. The boundary betwecn two classes, w , and wj is then defincd by f u , ( D ( z ,R ) ) = f w , ( D ( x ,R)), which means that g ( D ( z ,R) = fu, ( D ( z ,R ) ) f u , ( D ( z ,R)). Note that in such a situation, a multi-class problem does not require the training of all two-class classifiers. As each proximity function evaluates commonality of a particular objects :I: to the given class, the final decision is made by the maximum evidence (provided that these values can be compared). Assuming that f w , is a. similarity fiinctioii, the classifier f is defined as f ( D ( x ,R ) ) = class wk if fu, ( D ( x ,R ) ) = maxi{fut(D(x,R ) ) } .If f u , is a dissimilarity function, then tlie final decision relies on the minimum dissimilarity. In the probabilistic reasoning, proximity functions estimate a posteriori probability, herice the decisioii and the final classifier are based on the maximum a posteriori estimates. Consequently, all density-based class$ers c a n directly be design,ed as m,ulti-class classi6ers. Classifiers originally designed in (inner product) vector spaces can be adopted to dissimilarity representations. Below, several decision functions, used in statistical pattern recognition, is described. The list is by no means complete. For instance, neural networks are not covered here, as they are highly nonlinear classifiers [Bishop, 1995; Ripley, 19961. Since the knowledge on tlie problem is assunied to be incorporated to the dissimilarity measure, simpler classifiers are preferred. Please refer to [Fukunaga, 1990; Hastie e t al., 2001; Ripley, 1996; Duda et al., 20011 for further details. -
Normal density based linear and quadratic classifiers. Most of the commonly-used dissimilarity measures, such as Euclidean, city block or Hamming distance, are based on sums of differences between single measurements. The central limit theorem states that the sum of independent random variables with finite variances is normally distributed in the limit, provided that none of the variances of the sum’s components dominates
187
Learning approaches
(otherwise, the distribution is x 2 ) [Wilks, 19621. An approximation can already be good for a relatively small number of variables, such as 10, for instance. Practice shows that summation-based distances which are built from many components of similar variances are often approximately normally distributed (in fact, this is a clipped distribution due to the 11011negativity of the dissimilarity measure). This suggests that (regularized) linear/quadratic normal density based classifiers [Fukunaga, 1990; Ripley, 19961, which assume normal class distributions, should be of use in dissirnilarity spaces. For a two-class problem, the normal density based linear classifier (NLC) built on the set R is defined as
(4.12)
and the normal density based quadratic classifier (NQC) is defined as -
m,)
i=l
(4.13)
where C1 arid C, are the estimated class covariance matrices and C = (C, C2) is the sample covariance matrix, all deterniiried in a dissirnilarity space. The square Mahalanobis distance between D ( z , R ) and the class mean mi is given as ( D ( ~ , R-) mi)TC,-l(D(x,R) - mi); see also a paragraph on quantitative data in Sec. 5.1. When the covariance matrix C (or C1, or C2) becomes singular, its inverse cannot be computed. A solution is to use a regularized version instead. defined as Crea= (1 - A) C X I , where I is the identity matrix [Ripley, 19961. Since it is hard to choose a proper A, the following regularization is used in our implementations: CA, = (1 - 2A) C Xdiag ( C ) k t r ( C )I , n = IRI. The regularization term is now expressed relatively to the variances, so it can be determined more easily. In practice, X equals 0.05, 0.01 or less. The resulting regularized classifiers are called appropriately and denoted by the RNLC and RNQClO.
+
+
+
+
"Although the NLC and NQC classifiers rely on the ERWI principle, their regularized equivalents, the RNLC and RQNC, as well as the SRQC are based on the regularization of the covariance matrices. This, in turn, allows one to find their bounded inverses, which implies that the classifier's weights are bounded as well. S o , this is an indirect attempt for the use of the regularization principle; see Sec. 4.1.3.2.
188
T h e drssimilarity representatzon for p a t t e r n recognrtzon
Nearest mean linear classifiers. If the covariance matrix C is the identity matrix, the NLC reduces to nearest mean classifier (NMC), assigning an object to the class of its nearest mean vector in the Euclidean sense. If C is a diagonal matrix, then the resulting decision rule is the weighted nearest mean classifier (WNMC). Note that these are multi-class classifiers. Strongly regularized quadratic classifier (SRQC) . This classifier is similar to the RNQC defined above; the difference lies in regularization, which diminishes the influence of covariances with respect to variances. Each class covariance matrix is estimated as CT = (1 - K ) Ci ~ p ( d ?diag ) (Cz)> where K E [O, 11. If K = 0, then the classifier simplifies to the NQC, while if K, = 1, then it beconies the scaled nearest mean linear classifier [Fiikiniaga, 1990; Skurichina, 20011. So, by varying K one moves for one extreme to tlie other. We often use K = 0.2 or 6 = 0.8, where we become closer to one of the extreme cases.
+
Fisher (FLD) and pseudo-Fisher (PFLD) linear discriminants. The Fisher linear discriniinant [Fukunaga; 19901 is a linear classifier, where tlie weight vector is deterrriined by maximizing the Fisher criteyion J ( w ) defined as:
J(w)
=
wTC,w WTCWW
,
(4.14)
K
where Cg = Ck=l rbk(mk - m)(mk:- m)Tis the between-class scatter and C b l , ~is tlie within-class scatter (sum of the class covariance matrices) [Fukunaga, 1990; Duda et al., 20011. The weight vector maximizing F(w)is fourid to be w = CG’ (ml - mz), where m, are tlie class mean vectors. It is known that for a two-class problem with equally probable classes, the FLD is equivalent to the NLC. For a dissimilarity rcpresentatiori D ( T ,R ) , the FLD is constructed as
(4.15)
If tlie estimated covariance matrix Cw is singular, a pseudo-inverse operation is proposed instead, yielding the pseudo-Fisher linear discrirninant [Raudys and Duin, 19981. The pseudo-inverse relies on the singular value decomposition of the covariance matrix Cw. In practice, tlie pseudoinverse is computed as the usual inverse of Cw, but in the subspace spanned by the eigenvectors corresponding to m largest non-zero eigenvalues (or singular values). The classifier is found in this subspace. The PFLD is reached
Learning approaches
189
in the limit of the RNLC if the regularization X goes to zero [Raudys and Duin, 19981. Naive Bayes classifier (NBC). This decision function naively assumes the probabilistic independence of features, given the class. This means that, the .joint conditional probability can be equivalently expressed by the product of marginal conditional probabilities. Hence, for a vector x = [ z l , ~. . ,. , z , ] E R”,this translates to P(x!wk)= P ( z j wk). Tbc NBC is based on the Bayes rule, so f (xi)= arg rnaxk P ( w ~P(n:ij ) Iwk). Despite the unrealistic assumption on independence, hard to fulfill in practice, the naive Bayes classifier may perform well. The decision may be correct even if the probability estimates arc improper, as discussed e.g. in [Domingos and Pazzani, 1997; Rish, 20011. It is also demonstrated that the this decision function works best in two extremes: with completely independent features or functionally dependent features [Rish, ‘LOOl] . If R is a subset of T in a dissimilarity space, then the ‘features’ d ( T , p , , ) are likely to be correlated, hence possibly functionally dcperident. The NBC is defined as
ny=, ny==I
Logistic classifier (LogC). The logistic classifier belongs to a group of discrirninatiTue classifiers, which estimate tlie class boundaries or the posterior probabilities directly instead of assuming models for the class densities as inform,atiwe classifiers, such as the NLC and the NQC, do [Rubinstein and Hastie, 19971. The LogC models the posterior probabilities such that, the log-posterior ratio of the classes wi and W K is linear. For two classes, it = bTx
becomes log P(W2IX)
+ bo. which leads to P(wlIx)=
rxp (-bk-bo) ltexp
(-bk-bC,)
and P(w2Ix)= 1 - P(wl/x).The discrimination decision is based on tlie maximum a posterior probability, i.e. f(x) = argmaxi P(wi1x);see also Appendix D.1. The parameters b and bo are estimated by rnaxirriizing the conditional likelihood L(x,b, bo) = C,”=, P(wiIx7) [Rubinstein and Hastie, 1997; Hastie et ul., 20011. For normally distributed classes with a common covariance matrix, the LogC and the NLC should he similar [Hastie et al., 20011. The difference lies in the fact that the NLC maximizes the complete log-likelihood, making an assumption on the marginal P(x)as a mixture density, while the LogC maximizes the conditional log-likelihood. In principle, if the class models are correct, then ignoring the information
xfZl
190
The dissimilarity representation f o r pattern recognition
on P ( x ) may worsen the classifier, but ignoring the class models might be beneficial if' they are incorrect. In dissimilarity spaces (including also the dissimilarities which are not necessarily summation-based) , the deviation from the normality assumption may be large. Consequently, the LogC may be of interest, since it relies on fewer assumptions [Hastie et al., 20011. For K classes, the posterior probabilities are estimated as
(4.17) All the parameters b = { b i j } are determined by maximizing the likelihood C(b, D ( z ,R ) ) = C,"=,p ( ~ y i l D ( z j , Rb). ); Support vector machine (SVM).
General references explaining the details of support vector machines [Vapnik, 1998; Burges, 1998; ShaweTaylor arid Cristianini, 2004; Scholkopf, 19971. Let n training pairs {xi,yi}yZl be given in a Euclidean (Hilbert) space. Each point xi belongs to one of two classes as described by the corresponding label ui E {-1,1}. The support vector machine is the hyperplane f ( x ) = wTx+wo maximizing the margin between two separable classes or, alternatively, minimizing the norm In the case of overlap, a soft margin hyperplane is introduced, which handles the misclassified objects appropriately. The linear SVM is defined as f ( x ) = ai yyi (x,xi) 0 0 , where (x,xi) = xTx, is the dot product operation and cxi are nonnegative values determined by maximizing the (soft) margin. Note also that w = Cr=lai yi xi. Since many oi appear to be zero, only the objects corresponding to non-zero weights. the support vectors (SV), contribute to the classifier. The SVM is an elegant implementation of the SRM principle in practice (hence its importance), Sec. 4.1.3.2, by combining the theory of the largest margin with the control over the VC dimension of a class of linear functions. We will briefly recapitulate this fact here; see [Vapnik, 1998, 1995; Scholkopf, 19971 for details. Assume two sepa,rable classes. Let S h be a subclass of all hyperplanes in R', i.e. Gh = {g : g(x) = ( w , x ) wo}. The VC dimension of Gh is h,, = r + l , which means that the maximal number of arbitrary labeled
&
El"=,
+
+
Learning approaches
191
points in R‘ separated by hyperplanes into two classes is r + l . Since h,, is finite, the Vapnik’s bound Eq. (4.6) holds. Still, we need to introduce the nested structure of function classes for the SRM principle to be true. This turns out to be possible by bounding the linear functions in @. Denote $ = {g : g(x) = ( w , x ) W O , J/wll25 A}. Cleuly, if XI < X 2 , then S,”, C G,”, . To ensure that the same inequality follows for the corresponding VC dimensions, one should require that the hyperplanes are selected as the largest’ margin hyperplanes for a given labeled set of data points. It was then shown that the VC dimension for G t can be effectively bounded as h,, 5 min{X2R2+l, r + l } for /Iwliz 5 A, where R is the radius of the smallest sphere enclosing t,he data point,s [Scholkopf, 19971. An extension from a hyperplane to a nonlinear decision function is obtained by a mapping @ of the input data to a high-dimensional Hilbert space and finding a,linear classifier, f ( x ) = g(@(x))= wT@(x)+wothere. Equivalently, such a classifier can be expressed as f ( x ) = C:=l ai yi(Q,(x), &(xi))+ cvo. The inner product can be replaced by its generalized version K(x, x i ) = (@(x), @(xi)),which is a reproducing kernel K, as explained in Sec. 2.6.1. Since in a high-dimensional space, the SVM is defined by the inner products between the vectors and support vectors only, the kernel operator can be explicitly used instead of the map @. The kernel is any symmetric (or Hermitian) positive definite function; see also Theorem 2.21. Hence, in a general formulation, where the (non-)linearity of K determines the nonlinearity of f in the original space, the SVM is defined as
+
f(X)
=
c
ai
yi K(x,X i )
+
010.
(4.18)
oi,>O
As for now, K was assumed to be a pd kernel. Note that any conditionally positive definite (cpd) kernel can be used. Proposition 4.3 A p d kernel is also a cpd kernel. The SVM can he constructed o n any cpd kernel K . Proof. This is trivial to observe that any pd kernel is also a cpd kernel. For a real pd kernel, z T K z > 0 holds for any z , hence also for z such that zT1 = 0, which defines a cpd kernel. Note that The reverse formulation does not hold. Let K = ( I - IsT)I? ( I - slT),where sTl = 1. It is known that K is pd iff &7 is cpd; see Theorems 3.13, 3.14 and 3.18. It follows that an n x n matrix ? I is a matrix of negative square Euclidean distances in a vector space; i.e. ?I = -Dh2. Then I? = 2 K - diag ( K )lT-1diag ( K ) T .By the
4
192
T h e dissimilarity representation f o r pattern. recognition
use of equivalent algebraic transformations one can check that the function -$ a'diag (y)k d i a g (y)a a'l (which has to be maximized in the formulation of Eq. (4.20)) reduces t o a'diag (y) 2 K diag (y) CY + aT1. This holds thanks to the condition that a'y = 0. This gives a proper SVM optimization. 0
+
~
Nonnegative slack variables Iz. are introduced to handle linearly nonseparable classes. Their goal is to account for classification errors. The soft margin SVhl is found as the prinial solution of the quadratic programming (QP) procedure: Minimize s.t.
wTw
+ y C:='=, cz.
y? (W'@(X,)
+
70,)
2 1 - &.
2
= 1 , 2 . . . . ,n
(4.19)
Ez. 2 0 The term c:==, <+,is an upper bound on the misclassification of the training samples and y can be regarded as a regularization parameter, a trade-off between the number of errors and the width of the margin". To determine the SVM such that the inner products between vectors in the high-dimensional space, hence the kernel values, can be used directly, the dual12 QP is solved. The dual formulation relies on the Lagrange multipliers ( v i , 1: = 1 , 2 , . . . , n, called also a dual variable a. Following the standard way of determining this dual formulation for an n,xn kernel matrix K , Kij = (@(xi);@(xj)),one gets [Boyd and Vandenberghe, 2003; Cristianini and Shawe-Taylor, 20001: Maximize
-;aTdiag (y)K diag (y)a + a'l
s.t.
a'y = 0 0 5 ai 5 y 5
(4.20) i = 1,2,. .. >n.
Thc necessary condition for an optimal solution of an optimization problem with inequality constraints are the Karush-Kuhn-Tucker conditions [Bertsekas, 1995; Boyd and Vandenberghe, 20031, which for the SVM "In the literature, usually the constant C is used instead of 2 . A5 C denotes an estimatcd covariance matrix, we usc y as a trade-off parameter in this work. "Duality is an essential notion of linear and nonlinear optimization theory. One may refer to a standard textbook on mathematical programming for details such as [Bertsekas, 1995; Boyd and Vandenberghe, 20031. Since the dual of Eq. (4.19) is also QP, it is guaranteed that the values of the objective functions at the optimal solution for primal and dual problems coincide. This means that by solving the dual problem, the optimum of the primal problem can be reconstructed. Note that this duality is in fact related to the continuous dual spaces of normed (Banach) spaces.
Learning approaches
193
are [Cristianini and Shawe-Taylor, 20001: ai =
0
+
yi f ( x i ) 2 1 and
= 0,
0 < ai < 7 =+ yi f(x'i)= 1 and ai = y
+
yi f (xi) I 1 and
= 0, [i
(4.21)
2 0.
As a result, indeed, the solution is sparse in the coefficients ai. All vectors xi outside the margin, yi f(xi)> 1, do not contribute to the SVM, as ai = 0. Only the vectors which lie on the margin, y i f (xi)= 1 or inside the margin yi f(xi) I 1 yield non-zero ai, and are support vectors, rnentioncd before. According to theory from Sec. 2.6.1, a pd kernel K in the SVM is a reproducing kernel. It' defines a reproducing kernel Hilbert space (R.KHS) 7 - l ~on a set of fiinctions h(x) = a i y i K ( x , x i ) . The norm of h in % K is computed as l l f l l . ~=~ ai yi K(x, xi), t x j gj K ( x ,x,j))xHh. = cu'diag (y) K d i a g (y)CY thanks to the reproducing property of K . Minimizing the latter quantity, which is equivalent to maximizing -l]h311 7 - 1 ~in
xi (xi
xj
the dual formulation above, corresponds then to bounding a class of hyperplanes h in the regularization principle, mentioned in Sec. 4.1.3.2.
Remark 4.3 Note that a cpd kernel I? i s in general a reproducing k e 7 d in a Pontryayin, space of the rank of negativity equal t o 1. This follosws by Remark 3.7, as K = -D*2 in s o m e vector space. Th,e remon mh,y K can be used in the formulat.ion of the SVM %s that the solution function .f minimizes the n o r m aTdiag (y)K diag (y)a with respect to a in a switable subspace defined b y a'y = 0 , in which the indefinite K simplifies t o a pd K . This subspace, can he again treated as a RKHS. By a straightforward implementation, SVM can be constructed in a dissimilarity space. In the linear case, this leads to a kernel K consisting of the elements Kij = ( D ( p i , R ) . D ( p j , R ) ) .The SVM becomes then f(D(z> R ) ) = C,"=, ai yi ( D ( z ,R ) ,D ( p i ,R ) ) ~ g As . a result, K = D DT for a linear SVM in a dissimilarity space. Such a kernel K is positive definite by construction. Other positive definite kernels can be used as well. Note, however, that a sparse solution in a, provided by the method is obtained in the complete dissimilarity space D ( . ,R ) . Hence, it is not sparse in the representation objects R. It means that for the evaluation of new objects, the dissimilarities to all representation objects from R have to be still computed.
+
Relevance vector machine. Note that a sparse kernel model, identical in the form to SVM, was introduced in [Tipping, 20001. This so-called
194
The dissimilarity representation f o r p a t t e r n recognition
relevance vector machine employs a Bayesian approach to learning and gives a formulation of a probabilistic generalized linear model. It does not need to estimate the trade-off parameter y,neither use cpd functions. So, it might be an alternative to SVM.
Linear programming machines (LP). Given a properly defined objective function and constraints, a separating hyperplane can be obtained by solving a linear prograrrirriing (LP) task, making the optimization problem easier t’liari in case of the SVM. Assume N training pairs (D(mi,R), yi), i = 1 , .. . , N , y , ~= (1, -l}, arid a two-class problem, with the classes w1 arid wz of cardinalities 7‘Ll and n ~ respectively , N = nl 7’12. Let f be a separating hyperplane built by using the representation set R, i.e. f ( D ( x ,R))= wTD(z,R ) + tug. Then a simple optimization problem minimizing the number of misclassification errors <,j car1 be defined as
+
Minimize
N xi=,
s.t.
yz f ( D ( X 2 ,R ) ) 2 1 -
Bi
Ei
ti, i = 1 , .. . , N
(4.22)
2 0,
where either Bi = 1 for i = 1 , .. . , N or 8 - 1if yi = 1, and 0 . - 1 - nl nz’ otherwise. It is argued in [Bennett and Mangasarian, 19921 that the latter formulation guarantees a nontrivial solution, even when mean vectors of two classes happen to be the same. This LP task can be solved by standard optimization methods, such as the simplex algorithm or interior-point methods [Bennett and Mangasarian, 19921. Since no other constraints are included, the hyperplane is constructed in a n n-dimensional dissimilarity space D ( . , R ) . A sparse solution can be, however, imposed by minimizing the tl-norrn of the weight vector w, /Iwlll = ltujl, of the hyperplane Eq. (4.10). To formulate such a minimization problem in terms of an LP task (i.e. to eliminate the absolute value lwjl from the objective function), w,; is expressed by nonnegative variables a,; and /3j as wj = a,; - / 3 j . When the pairs (aj, / 3 j ) are determined for a feasible solution, then at least one of them is expected to be zero. Similarly to the SVM formulation, nonnegative slack variables ti, accounting for classification errors! and a regularization parameter y are introduced. The minimization problem becomes then: ~
cy=,
Minimize
(4.23)
Learning approaches
195
A more flexible formulation of such a classification problem was proposed in [Graepel et al., 1999bI. The task is to minimize I jwI 11 - p p . which nieans that the margin p becomes a variable of the optimization problem. Note that p = 1 in Eq. (4.23). By requiring that llwlil is constant, the modified version of Eq. (4.23) is introduced as Minimize
&
N
Ci. Q a ,
-
pp
B i , P 2 0.
In this approach, a sparse solution in the weights 'wj = cyj flj is obtainod. As a result, non-zero weights w j point to important objects from tlie original representation set R. This gives a reduced set Rso. Therefore, this LP formulation can also he used for the selection of the representative objects, starting from R = T . This solution is similar to tlie adaptation of the SVM for feature representatioiis defined with the L P machines [Srnola et al., 1999: Scliolkopf et ol., 2000al. From the conipiitational point of view, such an LP classifier is advantageous for two-class problems, since for new objects, only the dissimilarities to the objects from R,, have to be determined. If multi-class problems are tackled by a number of two-class problems, the reduction of R t>oR,9,might be insignificant when the individual decisions are combined. ~
~
k-nearest neighbor classifier (k-NN). The k-NN method constructed in a dissimilarity space relies on computing new dissimilarities, e.g. 4,distances, between the vector representations D ( z ,R ) . This may be interpreted as a new dissiniilarity representation built over the given one. Parzen classifier. This classifier models the class conditional probabilities P ( D ( . ,R)lwi) by density kernel estimation, the normal density function. The final decision relies on the rnaxirnurn posterior probability. Let a, bc the smoothing parameter in the i-th dimension. The posterior probability of tlie class w,? is estimated as:
(4.25) Decision tree. This decision function [Breiman et al., 19841 builds a tree. which partitions the space of all possible objects into siibregioris dcscribed
196
The dzsszmilarity representation f o r p a t t e r n recognition
in leaves. Different subsets of the original vector space are used a t different levels of the tree. Each sample is then classified by the label of the leaf it reaches. Usually, binary dccision trees are considered. The maximum entropy criterion and the Gini index are the most frequently used splitting rules [Breiman et al., 19841. In each node, they determine a feature, together with a threshold, to be used for the partition. When operating on a dissimilarity representation, feature selection is equivalent to object selection. Splitting takes place by checking whether the sample under study lies in a neighborhood (given by the threshold in the considered dissimilarity measure) of the sclected object or not.
4.5
Classification in pseudo-Euclidean spaces
Assume that R and T are identical. A symmetric dissimilarity matrix D ( R . R ) can be interpreted as a description of an underlying vector configuration X , determined by a linear embedding of D , as described in Sec. 3.5. Such a procedure relies on an embedding of the Gram matrix K = G, derived from D*2 by Eq. (3.7). K is a Hermitian (symmetric) kernel in a Euclidean or pseudo-Euclidean space. Therefore, one can also start directly from a similarity representation given by K . If D ( R ,R) is asymmetric, then two symmetric dissimilarity representations can be constructed as D1 = ( D DT) and 0 2 = ( D - DT)yielding two pseudo-Euclidean configurations X I and X 2 . Then two classifiers can be built and later combined; see also Sec. 9.7. So, without loss of generality, we focus on symmetric representations. If R c T (note that the embedding cannot be performed when IRnTI 5 l),then our reasoning relies on the embedding of D ( R ,R ) and projecting the remaining objects from T \ R to an embedded space as presented in Sec. 3.5.5; see also Fig. 4.10. Therefore, for the sake of simplicity, we assume that R = T. If X lives in a Euclidean space, then any traditional classifier can be used, whose examples were described in the previous section. If X lives in a pscudo-Euclidean space, then either the associated Euclidean space is used for building the classifiers, or the conventional classifiers should be appropriately adapted. Here, we limit ourselves to simple linear and quadratic decision rilles, since they naturally rely on the pseudo-Euclidean inner products. See also Sec. 2.7 for details on pseudo-Euclidean spaces. Other classifiers known in Euclidean (Hilbert) spaces should be still adapted. Let E = Rm = IR(”,q), m = p+q, be a pseudo-Euclidean space with the
+
197
Learning approaches
signature ( p , q ) and the fundamental symmetry is defined as
Jpq.
A linear function in E
+
This decision rule can be interpreted as f(x) = wTx uo, where w = I&] = RP+q. Remember that /El is constructed by replacing the negative definite inner product in Rq by a positive definite one. Analogous to the Euclidean case (i.e. from the Euclidean perspective), one can require that a signed distance of a vector x to the hyperplane indicates on which side of the hyperplane x lies. This means that ( W ’IlVIIZ X ) E + w o - vTJpqx+wo vT,‘T,,v is positive for objects x which lie on the same side as v is pointing to. Note that in order to satisfy this condition, one should require that llvllz > 0, otherwise the ambiguity arises if llvllz can have any sign13. Consequently, the Euclidean norm of P+v in RP, should be larger than the Euclidean norm of P-v in Iwq, where P+ arid Pare the fundamental projections. In practice, since we start from positive dissimilarities, the data will be embedded such that the ‘negative’ contribution of the space Rq is smaller than the ‘positive’ contribution of the space RP.Hence, the indefinite-norm of v will be positive. Moreover, 11xl1; > 0 for any x E Iw(p,q) corning from the embedding of D . Note that this is not any longer guaranteed when onc starts from any arbitrary indefinite kernel. The negative Contribution may be dominant, as for instance for the kernel K(x,y) = z l y l ziyi. An illustration of possible and impossible linear classifiers is shown in Fig. 4.15. Note, however, that the situations presented there cannot result from an embedding of nonnegative dissimilarities. Usually, spaces of a higher dimension are obtained by such embeddings. It is very hard to construct an example with more then a few objects, which would yield R.(’,’) as the embedded space; see also Fig. 4.16. In general we will assume a K-class classification problem with thc classes w1,. . . , W K in an embedded pseudo-Euclidean space E = R(”,q). The
vJPqin the associated Euclidean space
cz1
a(’,’),
I3Imagine a simple case in where f ( x ) = vTJllx = 0 and v = [0.5 separates two classes of objects, as presented in Fig. 4.15, right plot. Since
-1IT =
-0.75, then v and Jllv = [0.5 11 point into different directions, similarly as presented in Fig. 4.15, on the right. Assume that the class labeled by yz = 1 lies above the hyperplane (this is where v is pointing), while the class labeled by y, = -1 is below the hyperplanc (this is where rllv is pointing). Then x2 = [I 2IT has a label yz = 1 and xJ = [I -2IT has a label yJ = -1. The signed distances of xt and xJ to the hyperplane f are -3.33 and 2, respectively. So, t,he ambiguity arises.
198
T h e dissimalaraty representation f o r p a t t e r n recognition
<x,x>, < 0
iR' 3
iR
1
<x,xg < 0
:~
<x,x> >
<X
-1E~
-2L -3
-4
<x,x>,< 0
<x,x<< 0
Figure 4.15 Assume data points in J R ( l . l ) . The plots present hypothetical examples of a possiblc (left) and impossible (right) classification scheme by a linear classifier (solid line). Both cases will never result from the embedding of nonnegative dissimilarities, since there exists a number of pairs of points which yield negative square pseudo-Euclidean distances. In the right plot, this also holds for pairs of objects coming from different classes. Note also that in the lcft plot, the two hyperplanes ( v l , x ) & vo = 0 ( v 2 , x ) ~ vo = 0, (marked by solid and dotted lines, respectively) are related such that v 2 = 3 1 1 ~ 1 .
+
Hence,
IlV1IlE
= IIVZIIE.
vector representation of D ( R ,R ) is given as {xl,x2,. . . , xn}. In I ,X ( i ) denotes the mean vector for the class wi and X is the overall mean vector. Thc class prior probabilities are denoted as p ( w i ) . Nearest mean (NMC) and generalized nearest mean (GNMC) classifiers. The nearest mean classifier (NMC) is the simplest linear clakifier which assigns an unknown object to the class of' its nearest mean. In a pseudo-Euclidean space I ,obtained by a linear embedding of D , such a decision is based on the pseudo-Euclidean distance. Recall that d: (x,y) = YIIZ = (x - Y)T&,(x - Y ) . Assume a K-class problem with the classes w1, w2, . . . ,W K and the embedded vector representation {XI,x2,.. . , xn}. Let "(%), i = 1 , 2 , . . . , n be the class mean vectors. Define fj(x) = Idz(~,X(~))1, j = 1 , .. . , K . The reason of using the magnitude of the square pseudo-Euclidean distance is as follows. If x is a projection of L(: to I . then the original dissimilarities d ( z , R ) may be preserved only approximately. It may, therefore, happen that d z ( x , X ( i ) ) bccomes negative for some x and a particular class w i , especially if z is a non-representative test object. In such cases, one should rather judge the magnitude of the pseudo-Euclidean distances. The NMC rule classifies a new object 2 , represented as x, as follows IIX -
Assign
II:
to w J , iRf,(x) =
min
2=1.2,
,K
{ft(x)}.
(4.27)
Learning approaches
199
As this decision relies only on the pseudo-Euclidean distances t o the mean vectors of the classes, such a classification can be carried out in a rnodifird way without performing the exact embedding of D , as formally needed in Eq. (4.27). As a result, a generalized nearest mean classifier is obtained. Assume that the class w7 is represented by a dissimilarity matrix D(R',R') based on the set Ra = { p i , . . . , p X , } . Let a new object z be represented by the dissimilarities to the set R'. Then the proximity of x to the class w z is measured by the function ,fZ defined as:
n,
(4.28)
n,
Vd(R') is a generalized average variance for the class w,, as introduced in Sec. 3.5.4 for details. Assume that X I is a pseudo-Euclidean configuration obtained from the embedding of D ( R z , R z ) Hence, . X ' is represented in €, = R(J'ZJQ~) Note that &, usually differs from the space E referring to the complete matrix D ( R ,R ) . It follows from Sec. 3.5.4 that f t ( z ) can be equivalently formulated as = I jIxL-xzII~~/ = (x2,Z)I, where x' is the vectorial representation of the object z and X' is the mean vector of the entire representation Rz in E,. This means that f Z ( x )measures the squart distance of xz to the mean of the 2-th class in a pseudo-Euclidean space. The interesting point is that such a distance can be computed without performing the embedding explicitly, since it operates only on the given dissimilarities D , Eq. (4.28). As a result, a K-class GNMC is defined as
I,(.)
Assign
IC
to w3, iff j
=
arg min { f i ( z ) } . 1=1,...,K
(4.29)
In summary, z is assigned to the class of the nearest mean, where each mean is described in an underlying space defined by the within-class dissimilarities. Additionally, we will derive what the average of the bctwecn-class square dissimilarities stands for. Proposition 4.4 Assume an n i x n j subm,atrin: Dij = D ( R z R3) , describing the between-class dissimilarities for the classes w, and wa. Let € i j denote a pseudo-Euclidean space resulting from the embedding o,f D ( [ R zRj], [RiR.71). Let 3i.i be the fundamental symmetry. Then the average between-class dis-
The dissimilarity representatzon for p a t t e r n recognztion
200
samzlanty dE(w,, w,) equals t o
cc clIx~/l;tJ+ c nL
2 db(W,.W,)
n,nJ 1 =-
n, where
xi
n~
= ___
d2(PLP.2)
k=l k 1
nL
1
n3
ilx;ll:27 - 2 (-% x ,x -j
-
k=l
nJ
(4.30) )&?,.
1=1
and x!, as well as 2 and 5 3 are repwsentpd m the space E,, .
Proof. In general, the relation between the square distances and the corresponding inner products is D*2 = glT+lgT-2 G, where G = XJ,,,,XT is the Gram matrix of X and g = diag (G). Let G, = XzJj(XJ)T, g, = diag (GT7) and g, = diag (GJ,). Assume that l n tstands for a vector of the T length n,. One has Dl: = g,lLJ + l n ZTg-2G,, J and also, l,, g , = t r (G,?) = I\xi\\i,,.Now, one gets
If we assume that the spaces E,, EJ and &%, yield the same signatures (although this is not likely), then based on the relations Eqs. (3.16) and (3.17), one can write d;(w,,w,) = Vd(R') Vd(R3) ll?? - 3il:tJ. By this, the square pseudo-Euclidean distance between class means in EZ3 can be ex-
+
+
pressed by the use of the distances as:
((XZ -FJi[;z,
= d&J,,w,)
-
V2(R2)- %(R').
(4.31)
For two classes, the above equation is the difference between the average between-class square dissimilarities and the average within-class square dissimilarities. So. the value of d;(wz,w l ) - Vd(RZ) - Vd(RJ)computed in a general case, approximates the square pseudo-distance between the class means in the ernbedded space. In general, the NMC and the GNMC in a pseudo-Euclidean space are not identical classifiers. The NMC is trained in a pseudo-Euclidean space
Learning approaches
201
E found from a linear embedding of the complete matrix D. Therefore. the dimension of E is determined by both the within-class and between-class dissimilarities. The GNMC operates only on the within-class dissimilarities. Although the embedding is not performed directly, the GNMC works in underlying feature spaces E,,defined for each class separately. It may happen that the signatures of feature spaces €, are not the same. In such a case, the performances of the NMC and the GNMC differ, because the NMC unifies the pseudo-Euclidean spaces and the signatures for all the classes, while the GNMC treats them separately. Since the GNMC makes use of distinct signatures, its accuracy is expected to be higher for problems in which the classes arc described in different ways. Fisher linear discriminant (FLD). To construct the Fisher linear discriminant, the notion of a pseudo-Euclidean covariance matrix is needed. Recall Eq. (B.1). where for a representation of n vectors in PQ(p,q), it is defined as C& = ~ ~ = “ =-, X) xZ (xz- X)’Jp, = CJpq,where X is the overall mean vector in E and C is the covariance matrix in the associated Euclidcan spacc l€l. Note that C, is .J-pd14 (pd in a pseudo-Euclidean sense). Let CB be a between-class covariance matrix and let Cw be a pooled within-class covariance matrix. both in the associated Euclidean space 1&1 = Then CB J p q and CW .Jp4 are the pseudo-Euclidean betweenclass and pooled within-class covariance matrices, respectively. Following [Goldfarb, 19851, the weight vector v of a Fisher linear discriminant. f(x) = vTJpq x V O , is defined by maximizing the pseudo-Euclidean Fisher criterion J(v) (the ratio of the between-class scatter to the within-class scatter) in the pseudo-Euclidean sense (see also Sec. 2.7):
+
(4.32)
By w = JPqv,this criterion reduces to the standard Fisher criterion J ( w ) , as defined in Eq. (4.14), with the solution w = CG1 (XI- X2). Hcncc. for a two-class problem, the FLD is determined by v = JPq CG1 (XI- X2) arid 14First, C& is 3-self-adjoint (3-symmetric), Def. 2.93 and Def. 2.99, since Cg = use of the fact that 3pq3pq = I and that C = CT). Now, C, is 3 - p d , since J p q C ~ is pd in the Euclidean sense, by Def. 2.103. This is true, since JpqC&= 3 p q C 3 , q= JppsATA3qq = ( A J p q ) T ( A 3 p q ) , where the latter matrix is pd in the Euclidean sense by construction.
TPPq C i J P q = 3 p q 3 p p s C T J= p qC J P q= C& (here we made
202
The dissimilarity representation for pattern recognitzon
Theoretical data 5 0
-5
-5
:B71 Embedding, I o 6
Embedding, I I
Embedding, l o g
0
” . t ,
0 R’
-5 5 - 5
0
R(’
5
-
-5
Embedding, i 2 5
5
5
0
1)
R(l,l)
-5
0
5
-5
-5
R2
0
5
R2
Figure 4 16 Simple illustration of the FLD decision boundary in embedded spaces The leftmost plot presents a 2D theoretical data Only three points (marked by circles) are used for training, since then the data can be peifectly embedded in W2 The remaining points, marked by ‘+’ and ‘*’, belong to the test examples, projected on the rctrieved (pseudo-)Euclidean spaces The following plots show the embedding results of the distance representations D = ( d Z 3 ) where , d,, = (C”,,, / z , k - X ~ ~ ( P ) and ~ / P p = (0 6 , 0 9 , 1 5,2} For positive p < 1, the distance is non-metric In all the plots, the FLD determined by the three training points in the original or embedded spaces is drawn For p = 2 (the rightmost plot), the theoretical data are retrieved up to rotation
!,
110 =
eP
-4 (Xi+ X 2 ) T J P q+~ log c,
- T - 1
x2)
P(W2)
’ which can be simplified to
+
x - - (Xl Xa)TCG1 (XI - X2) 2
+ log -
This means that the FLD in a pseudo-Euclidean space coincides with the FLD built in the associated Euclidean RP+q. See also Fig. 4.16.
Quadratic classifier (QC). Consider a pseudo-Euclidean space E = C& = CJPq is a covariance matrix of the configuration X in IW(P.4) and C is the covariance matrix in the associated space RPiq.
Iw(P>”). Then
Analogous to the Euclidean case, the Mahalanobis distance between a vector x and the mean X of X in I is given as ((x - TT),CF1(x- E))e = (x - X)TJPqCF1( x - X) = (x - X)TC-l (x - X). The latter follows from the fact that CF1 = JpqCpland JpqJPq = I . So, a quadratic classifier for a two-class problem can be constructed siniilarly to the Euclidean case as:
(4.34) where C1 and C, are the estimated class covariance matrices in RP+q,and p ( q ) and p(w2) are the class prior probabilities. Consequently, this QC in IR(P.q) coincides with the NQC in the associated Euclidean space Rp+q.
Support vector machine (SVM). The principles behind the SVM in Euclidean (Hilbert) spaces are described in Sec. 4.4.2. The SVNI is defined
Learning approaches
203
+
as f(x)= Ccu,>o cyi yi K(x,xi) NO,where K is a (conditionally) positive definite kernel. A linear kernel can be written as K = XXT, which is equivalent to the Gram matrix defined by Eq. (3.7). Since the linear SVM is based on the inner products only and by the linear relation Eq. (3.7) between the square Euclidean distances 0"' and K , the SVM can be easily constructed in the underlying space without performing the embedding explicitly, provided that the dissimilarities D = D ( R ,R) are Euclidean. New objects, represented by D(T,, R), can be immediately tested by using K,, the cross-Gram matrix between new objects and the original objects, as explained in Corollary 3.7. Even simpler, if K = -D*2 is used, then by Proposition 4.3, K is cpd, hence it can be directly used in the SVM optimization provided by Eq. (4.20). Moreover, since D is Euclidean. then D*' is cnd by Theorem 3.13. K1 = ezp(-aD*2) and K2 = (a + D*2)*(-1) are positive semidefinite by Corollary 4.3, hence they can also be directly used as Mercer kernels in the SVM. For a non-Euclidean dissimilarity matrix D , the corresponding Grarn matrix K is indefinite, so the traditional SVM cannot be used. However, any symmetric K may be interpreted as an indefinite reproducing kernel in a suitable Krein space (or a pseudo-Euclidean space, in fact). The corifiguration X is found by Eq. (3.14), i.e. X = Q lAI1I2, for which a linear classifier is defined by Eq. (4.26). If we now assign w = JPqv; then the classifier f(x)= vTJP,x+v" can be treated as f ( x ) = w T x + u ~in thc associated Euclidean space. The operation vTJpqis seen as flipping the values of the vector v in all 'negative' directions of the pseudo-Euclidean space. This is equivalent to flipping the negative eigenvalues to positive ones and considering the pd kernel K' = Q ( A /QT in the associated Euclidean space [Graepel et al., 1999aI. As a result, K' is a proper pd kernel to be used in the SVM. This procedure is costly as it relies on a complete embedding of D . One may also try to define an indefinite SVM, directly in the pscndoEuclidean space. Consider a Hermitian kernel K in a pseudo-Euclidean space, Def. 2.105. K can be considercd in a Krein space, but since K is finite, this effectively reduces to the pseudo-Euclidean case. Consider a linear classifier f ( x ) = vTJPq x + 110 in E . Analogous to the Euclidean space, the margin between two separable classes is defined as 2 The llvllz traditional SVM relies on finding a hyperplane, which maximizes the margin: hence minimizes the norm of the weight vector. This translates to the minimization of the indefinite-norm IIvIIE, which is a proper formulation '
3
T h e dissimilarity representation f o r p a t t e r n recognition
204
in a pseudo-Euclidean space by Theorems 2.7 and 2.30. Remember that I /vI1; > 0 is required, by our discussion in the first paragraph of this section, so the indefinite-norm of v is bounded by zero from below. From the Euclidean perspective, one guarantees that the stationary point v , ~ (of a constrained problem) is the minimum of the function F ( v ) = IlvII;, if tlie Hessian of F ( v ) is positive definite for v,. As the Hessian is equal to H = Jpq, from the Euclidean perspective, one requires that vTHv is positive. This holds due to our requirement I lv/1; > 0. On the other hand, form the pseudo-Euclidean point of view, the stationary point v, yields a minimum, if the Hessian H of F ( v ) is defined by an indefinite matrix described by the fundamental symmetry; Theorem 2.30. This is indeed fulfilled, as H = Jpy. (Note that there is no restriction on the sign of 11v11;.) Consequently, although Jpq, g > 0, is not pd in the Euclidean sense, the optimized function is positive definite at the stationary point. This means that the primal formulation of a soft margin indefinite SVM can be solved by a non-convex QP, given as:
i
i
Minimize s.t.
vTJp,v
+ y C:=l <,
+ wg)2 1 ti,
yi (vTgPqx,
-
i = 1,2 , . . . , n
(4.35)
G 2 0.
cy=l
The term 0, then for v’ = gpqv, we have ~ ~ = (v, J’~ ~ ~ V ) ~~ J = ~~, vT (~ J p~q~ =vV 11) v11;, which means that v’ is also a solution; see Fig. 4.15 for a simple illustration. Let K be the Gram matrix determined in the embedding of D . Then the dual formulation becomes the same as Eq. (4.20), but for non cpd K :
-;
Maximize aTdiag (y)K d i a g (y)a s.t. aTy = 0,
O
+ aTl (4.36)
i - l , 2 , . . . , ri.
+
The SVM becomes then f ( x ) = Ca,,O ai y, K(x,xi) a g . Analogous to a Hilbert kernel, based on the theory from Sec. 2.7.1, K is a reproducing kernel, hence it defines a reproducing kernel Krein (or pseudo-Euclidean) space ICK on a set of linear functions h(x) = azyIaK(x; xi). The indefinite norm of h in ICK is computed as I J h l l =~ ~ ( C , ai yi K ( x ,xi),C, aj y j K ( x ,xj) ) x K = aTdiag (y)K diag (y)a due to the reproducing property of the kernel. Minimizing the latter quantity in
xi
Learning approaches
205
the Kreki sense, (or maximizing - l l h l l ~ Kin ) the dual formulation above, corresponds then to bounding a class of hyperplanes h in the regularization principle, Sec. 4.1.3.2. Hence, an indefinite SVM with a positive indefinite norm llhlIxK is a proper statistical learning technique. Note also that instead of a kernel K , directly -D*’ can he used in the optimization Eq. (4.36). For a further analysis arid the connection between the SVM and the problem of separation of convex hulls in pseudo-Euclidean spaces, see [Haasdonk, 20051. Additional insights can be found in [Ong et al., 20041. In summary, given a dissimilarity representation D ( R ,R ) , the SVM can be built in the underlying feature space as follows. First, the Gram matrix K is derived according to Eq. ( 3 . 7 ) . If K is not pd, then either the problem is treated in a Euclidean space by considering the pd kernel K’ = Q (A1QT on which the SVM is built according to Eq. (4.18), or an indefinite SVM is determined directly on K by solving Eq. (4.36). The latter case can only be accepted if the found solution is such that a’diag (y)K diag (y)a is positive. In pract,ical algorithmic implementations, one needs to solve an indefinite QP problem by finding a specific saddle point (defined by J&). LIBSVM [Chang and Lin] may be of help here.
4.6
On generalized kernels and dissimilarity spaces
Kernels are usually defined as symmetric (Hermitian) operators in some Hilbert space being pd or cpd, see also Def. 2.82. Here, we will focus on kernels over the field of R. Any Mercer kernel, such as a finite symmetric pd matrix, can be seen as a Gram operator in a Hilbert space ‘FI, hence as a (nonlinear) generalization of the similarity measure based on inner products. This holds thanks to the Mercer’s condition, Theorem 2.21, which guarantees the existence of a mapping q5 : X + ‘FI from an input space X (which might not be explicitly given) to a Hilbert space ‘FI such that K ( z ,y) = (q5(z),d(y)), where q5(z) is the image of z E X in X.The squared distance in IFI is defined by using thc norm as d h ( z . y ) = II$(x) q!(y)ll’. Thanks to the relation of K ( z .y) = (q5(x),4(y)), one has: ~
d & ( 2 , y) = IIq5(z)- q5(y)II2 = K ( z , z )- 2 K ( z ;y)
+ K(y.y).
(4.37)
Note that we will write d&(z,y) instead ofd&(q5(z),d(y)), since d& can be determined by the kernel values only (without knowing 4).
T h e d i s s i m i l a n t y representation f o r p a t t e r n recognition
206
Corollary 4.1 d k ( x ,y) is a cnd (conditionally negative definite) kernel.
Proof. By Def. 2.82, it is sufficient to show that Crj c i c j d g ( z i , z j ) is positive for all TZ E N and all sets (21, 2 2 , . . . , x,} G X and {cl, ca,, . . , cn} C@ such that C,"=, ci = 0. One has: ci c j d & ( z i , z j )=
cc;=1Ci)(C,"_,
CjK(Zj,Z j ) )
2 CYj ci c j K ( z i , x j ) = -2 K is pd.
+
xxj
(Ly, cj)(c:=l
C i K ( G ,Xi))
Crj e i c j K ( z i , z j )< 0, since Cy=lci
=
-
0 arid 0
d g ( z .y) is cnd only because K is pd. This is in agreement with our previous results discussed for finite Euclidean matrices and the corresponding Gram matrices, as presented in Theorem 3.13 and Theorem 3.18. By fixing the origin in 'H such that K ( z ,2 ) = d & ( x ,O ) , Eq. (4.37) becomes
K ( x , y ) = - -1 [ d2H ( 2 , 1 / ) -d&(z,O) - d L ( y , O ) ] 2
(4.38)
Note that in practice, the distances refer to the mapped vectors q5(z1),. . . . q5(zn)in 'H, hence the origin can only be chosen in their convex hull. So, the zero vector 0 in 'H can be set to a weighted mean of these vectors, i.e. 0 = $ C,"=, si q5(zi) = q5(Zs) = q5s, where sT1 = 1 and q5s stands for a weighted mean in X. By straightforward algebraic operations, similar to the ones in Eq. (3.4), one can find that d&(zi,z,)= n s k sl d & ( z k , zl),where 4 is omitted. Z;=l s k &2( z i , z k ) $ k = l El=, Assuming that. this expresses a square distance to the origin in 'H, for K € I R n X n Eq. , (4.38) translates to ~
K
=
c"
1 ( I - I s T ) D.;I"( I - slT), where sT1 = 1. 2
--
(4.39)
K is pd iff D.;I" is cnd (or equivalently, iff -D@ is conditionally positive definite). This follows from considerations in Sec. 3.4. Some of the kernel properties can be expressed in the continuous domain. Now we briefly present a few characteristics of the positive and conditionally negative definite kernels. Then we explain how to interpret dissimilarities as distances from a possibly higher-dimensional space, where the mapping from an underlying abstract space is known only by the (gerieralized) inner product. This part is essential for understanding of our classification methods, introduced in Chapter 9. The class of pd kernels is closed under addition, multiplication by a positive constant and pointwise limits [Berg et al., 19841. Moreover, it is also closed urider a tensor product arid a direct sum. Formally, one has:
Learning approaches
207
Corollary 4.2 [ B e y et al., 1984; Cristianini and Shawe-Taylor, 2000/ (1) Let K1, Kz be Hermitian pd (psd) kernels. T h e n K ( z ,y) 1 ( 1 ( ~ a) , K2(2,y) is also pd (psd).
=
Proof: Proof follows from the Schur th,eorem, [Horn, and .Johnson, 19911 that the Hadamard product of positive definite matrices i s also positive definite [Berg et al., 19841. (2) Let K 1 : X x X + C and Kz : y x y 4C be Hermitian kernxls. T h e n K1 @ K 2 ( ( . 2 . 1 , ~ 1 ) , ( . 2 . 2 , ~ 2 )= ) K 1 ( . 2 . 1 , 2 2 ) K z ( y 1 , ~ 2i s) a pd kernel on ( X X y ) x ( X X y ) . (3) Let K 1 : X x X + C and K2 : y x y 4C be Hermitian kemels. T h e n K I $ K Z ( ( ~ I ,( ~2 I2 ), ,~ ~=) K ) l ( z l , x z ) + K z ( y l , y 2 ) is a p d kernel on ( X X y ) x ( X X y ) . The relations 2 and 3 above hold also for positive definite (defined over R) kern&. Corollary 4.3 (Relations between pd and cnd kernels) Let K and D be real kernels and let a > Q . One has [Berg et al., 1984; Cristianini and Shawe- Taylor, 2000/: (1) If K i s psd, then K
= e*uK =
(e°KzJ) is psd.
+
+
Proof: By the Taylor expansion, one gets e*aK = (. llT+ K & K*2 l3 !K * 3+ ...). Thanks t o Corollary 4.2, K*' i s psd for a positive integer r and the s u m o,f psd kernels is psd. A s a result, K i s psd as well. (2) D*2 i s end aff K = e-*' D*' i s psd. Proof: We know that D*' = diag ( K ) l T+ l d i a g is end is equivalent t o K being pd. T h e n one has K , - * ~ ( d i a g ( K ) l ~ + l d i a g ( K ) ~ - Z K ) = e-*udiag(K)1Te*2u
K
2 K . D*' ePaD" =
-
=
e -*u
ldiag(K)T
-
Ue*2uK U , where U is a diagonal matrix of positiue n.umbers U
=
diag(e-*uK). Sine U i s p d and e * 2 a K is psd by the statement above, then K is psd by the Schur th,eorem. (3) D*' is cnd ifl = .( D*2)*(-1)= (&) is psd.
K
+
Sketch of proof: First, it is triuial to show that if D*' is c n d , then D*' + a l l T is cnd as well. Next, note that e-('+2)z;dll: = T h e n cTKc = ~ ' ( 0+ o*')*(-') c = J '' c (e-('+o*2) * .( 1 1 T ) ) c d.2..
Jr
B y the point above, f o r positive z, the matrix (e-D*'+ullT) i s psd. So, K i s psd as well.
(4)
If Dh2 is cnd and for all x, one has D * 2 ( x , x ) 2 0 , then Kl with r E (0.1) and K 2 = log(1 + D*') are end.
A.
* (:z =
llT)
D*2T,
T h e dissimilarity representatzon for p a t t e r n recognition
208
From Sec. 2.6.1, we know that any symmetric pd kernel defined on a compact set or an index set T is a reproducing kernel for a Hilbert space RK consisting of bounded linear maps defined by the evaluation map 4 : z + K ( T . . ) . Hence, RK contains all finite linear combinations of the form h ( z ) = C ka k K ( J ~ J ),. As a result, K ( z ,y ) = ( K ( z ,.), K(y, . ) ) x K . If' T is a set of finite cardinality, say n, then the functions are evaluated only at a finite number of points. Consequently, the RKHS becomes an ndimensional space, where the functions simplify to n-dimensional vectors. Now. we propose to consider generalzzed kernels as arbitrary symmetric countable matrices. Such similarity matrices are kernels of the pseudoEuclidean space € or, more general, of a Krein space. Remember that a matrix K is J-symmetric, i.e. self-adjoint in the pseudo-Euclidean sense, if J p y K = ITT&,; see Def. 2.93. Therefore, one has K ( z ,y) = ($(x),$ ( y ) ) ~ , where $(z) is the image of an object z in E . Based on a logically appealing cxtcnsion from the positive definite inner product to the indefinite inner product, thc squared distance in E is defined as d$(x,g ) = ~ ~ $ ( z ) - + ( y ) ~ ~ $ , which reduces to d z ( z , y ) = K ( z , z ) 2 K ( z , y ) K ( y , y ) . By similar considerations as for the pd and cpd kernels, the equivalent formulations for an indefinite K are obtained as: ~
+
(4.40) and also
K
=
1 ( I - I s T ) D22 ( I - slT), where sT1= 1, 2
--
(4.41)
where D;2 is a matrix of' square pseudo-Euclidean (Krein) distances in E . Hcnce, K and D*2 described above are related by linear operations. Any of them can determine the corresponding space E . Remember also that K is a reproducing kernel in E as follows from Sec. 2.7.1. An asymmetric matrix K can uniquely be described by two symmetric matrices, i.e. K 1 = ( K KT)and K2 = (K - KT),where each of them can be treated as a generalized kernel. If K is nearly symmetric, then K2 contains little information, hence negligible. In such cases, K zz K1. Such transformations are needed for the interpretation of K in pseudoEuclidean spaces. For an asymmetric K ,the corresponding D*2 is defined as D*' = diag ( K ) l T l d i a g (K)T K - KT, which is symmetric. An asymmetric square dissimilarity matrix can then be considered as D*2 = diag ( K ) l T +l d i a g (K)T-2 K . Anyway, asymmetric D*2 or K can directly be treated for building classifiers in a dissimilarity space.
+
+
~
Learning approaches
209
Figure 4.17 Assume two-dimensional theoretical banana data. Four dissimilarit,y representations are considered: D k ( T ,R ) , k = 1 , 2 , 3 , 4 , based on the Po.7-distance (nonmetric), !?-distance (metric, non-Euclidean), Euclidean and square Euclidean distance, correspondingly. A linear classifier f ( D k ( z ,R ) ) = C , w j D k ( z , p , ) is trained on each D k ( T ;R ) , where T is a training set of 200 points and R is either a subset of T consisting of 20 points chosen by the k-centers procedure (such points minimize the maximum of the dissimilarities over all objects to their nearest neighbors; see also Sec. 7.1.2) or R = T . Formally, a (R)NLC classifier f ( D k ( z , R ) ) is built in a dissimilarity space of the dimension IRI. Since the theoretical data are 2D, a discrimination boundary can he drawn in the original 2D space. The subplots show the data points and the projected discrimination boundaries found originally in four dissimilarity spaces D k ( . >R ) . The left subplot presents the results when R C T , where points of R are marked by circles. The right subplot shows the results when R = T , hence a regularized classifier had to be used. Note that a linear classifier in a square Euclidean dissimilarity space D 4 ( T , R)is quadratic in the original space, which is in agreement with our observations made in Sec. 4.6.1. Other classifiers are nonlinear with respect to D*’k. This example shows that the decision boundaries in both plots look similar, so an adequate and small representation set R may serve for a good discrimination.
4.6.1
Connection between dissimilarity spaces and pseudoEuclidean spaces
The idea of building classifiers in dissimilarity spaces is general, since they can be interpreted as decision functions in the underlying pseudo-Euclidean spaces. Hence, there is a connection between these two concepts. Assiirrie a dissimilarity representation D ( R ,R ) and the corresponding matrix K of inner products derived as K = -$ (I - ~11T)D*2((l - ;1lT).Let X be a pseudo-Euclidean configuration in IW(P.4) obtained from the embedding of D . This means that there exists a mapping 4 :p j 4xj, where xj = 4 ( p j ) . Then K = X J P q X T . Consider now a general linear classifier built in a dissimilarity space D*2(.,R ) . We have:
c,”=,
Proposition 4.5 A linear classifier f(D* ’(x, R ) ) = wj d’(x,pj)+wo constructed in a dissimilarity space D*’(., R ) is a quadratic classifier in t h e underlying pseudo-Euclidean space R(P>q).
The dissimilarity representation for p a t t e r n recognition
210
Proof. Let X result from the linear embedding of D ( R , R ) into IR(P>q). Let the object z be represented in R ( P i 4 ) as a vector x. Based on the relations between the square distances D*2 and inner products K , one can write: f(D*’(z, R ) ) = C,”=, w j d 2 ( z , p J ) wo = C,”=, wJ [ K ( z z) , 2 K ( z , P j ) + K ( P j , P j ) ] + w o= W j [xTJ P q X - 2 X ~ J P q X j + X ~ J P q X j ] += W~
+
c;=,
wT1xT&x - 2 wTXJPqx+ wTdiag (XJPqXT)+ wo. The latter formulation describes a quadratic classifier in R ( P i q ) . 0 If D is Euclidean, then R ( P , 4 ) simplifies to a Euclidean space RP.Without loss of generality, a similar relation holds for a dissimilarity space D ( . , R ) ,which can be seen as D*’(.,R), where D = D * i . So, the linear classifier f ( D ( z , R ) )is in fact a quadratic classifier in the iinderlyirig pseudo-Euclidean space R(P’>q’)as determined by the embedding of D*i . If D is Euclidean, then D * f is Euclidean as well, as guaranteed by Theorem 3.20. Another important observation is that the quadratic classifier in R(P A 1 , becomes an even more nonlinear decision rule, when projected to the R ( p ) qspace. ) In fact, any monotonically increasing nonlinear transformation of g ( D ) , such as D*‘, where T E ( 0 , l ) or sigm(D) will influence the nonlinearity of f ( g ( D ( x ,R ) ) )as observed in the R(p,q) space. Analogous to the linear case, a quadratic classifier in a dissimilarity space would translate to a 4-th order polynomial in the corresponding pseudoEuclidean space. Note also that a linear classifier built in a similarity space K ( . > R )i.e. , f ( K ( z , R ) )= C j w 3 K ( z , p j ) wo = W ~ X J ~ ~ X wo is a linear classifier in R ( P > qThis ) . can also be used for any similarity kernel derived from the dissimilarities by a monotonically decreasing transformation, e.g. K = ( e - ‘ : ~ / ~ ’ ) or K = ((d:J 0 2 ) ) - ’ )If . D is Euclidean, then based on Corollary 4.3, such transformed kernels describe relations in some Hilbert spaces. For a dissimilarity representation D ( T ,R ) , where R c T , a linear classifier in dissimilarity spaces can be approximated by a quadratic classifier in the underlying pseudo-Euclidean space corresponding to the embedding of D ( R ,R) and projecting the remaining T\R objects there. The reason of such an approximation is caused by the orthogonal projections of the T\R objects which are likely to yield errors. This means that the dissimilarities D(T\R, R)are not ideally preserved. Such an approximation can still be very good. Fig. 4.17 should help in getting some intuition. I
,
+
+
+
Proposition 4.6 A s s u m e a two-class classification problem described by D ( T . R ) and the labels y j E (1, -1). A linear classifier f ( D * 2 ( x R , )) = wJyj d 2 ( z , p j )+’uQ,constructed s u c h that w’y = 0 in a dissimilarity
c,”=,
Learning approaches
211
space D*2(.,R ) is a linear classifier in the underlying pseudo-Euclidean space E%(P,Q).
Proof. Let the object z be represented in Pg(p,q) as a vector x. Following the same reasoning as above, one has: f(D*’((z,I?)) = w j y j d2(x,pJ) UJO = wTyxTJpyx- 2 w’diag (y)XJpqx+wTdiag (diag ( y ) X J p q X y uu10= -2 wTdiag(y)XJppqx+w’diag (diag (y)XJp,Xy 7UO. The quadratic term vanishes due to the requirement w’y = 0. So, the latter formulation describes a linear classifier in W(Pig). From a coniput,ational point, of view, it might be useful to consider a classifier on -D*2(.3R),which becomes f(-D*’(z, R ) )= 2 w’diag (y)Xj;,x - wTdiag (diag (y)X,&’,,XT) w g .
~ ~ = ,
+
+
+
+
An example of such a classifier is the SVM. Note that the same reasoning as above holds for the classifier f ( g ( D ( z ,I?))), where g is a monotonically increasing nonlinear transformation.
4.7
Discussion
Since the notion of proximity underpins the description of a class as a group of similar objects, we propose to move the emphasis from features to a proper proximity measure. This leads to a representation based on proximities. Since kernels, which are particular types of similarity representations, have thoroughly been studied [Kernel Ma.chines], [Cristianini and Shawe-Taylor, 2000; Scholkopf et nl., 1999b; Scholkopf and Smola, 2002; Vapnik, 19981, we here consider dissimilarity representations D ( T ,R). These are relative representations describing the pairwise dissimilarities between the objects from a (training) set T and a representation set R. The strength of such representations lies in their applicability, as they can be derived from any measurement or structural description, such as strings or graphs, or ot,her intermediate representations. Since a learning problem can be characterized by various kinds of expert knowledgc, as a result a number of dissimilarity representations can be created and combined to better describe the underlying concept. This is studied in Chapter 10. This chapter focuses on learning techniques, mostly classification aspects. on dissimilarity representations. Three main strategies are distinguished, which rely on various interpretations of the dissimilarities: (1) The first approach focuses on the relations in local neighborhoods, de-
fined for each object by the dissimilarities to its neighboring objects.
212
The dissimilarzty representation for pattern recognitzon
This is always applicable, although in general, a large representation set R is needed for good performance. ( 2 ) The second strategy defines classifiers in a dissimilarity space, a vector space, equipped with additional algebraic structures such as the traditional inner product, in which each dimension corresponds to a dissimilarity to a representation object. This paradigm can be applied for any dissimilarity measure. Classifiers built here rely on the dissimilarities to all objects from R. Hence, this is a more global approach than the variants of the nearest neighbor. ( 3 ) The third methodology is applicable for symmetric measures and when R C T . However, since any square asymmetric representation D can be expressed as a sum of two symmetric representations D1 = ( D DT) and 0 2 = ( D DT),each of them can be considered separately and the results can be combined. The learning algorithm relies first on determining a pseudo-Euclidean vector configuration such that the dissimilarities D ( T ,R) are preserved as well as possible. Then traditional classifiers can be rnodificd and applied in such a spacc.
+
~
All dissimilarity-based learning strategies are designed for numerical representations interpreted in some spaces. Ineluctably, they make use of statistical methodologies, already developed in vector spaces, by their appropriate adaptations. The innovation of our methods lies in the acceptance of any nonncgative dissimilarity measure satisfying the reflexivity condition (hence also non-Euclidean and non-metric dissimilarities). These two requirements are not only logical, but enable clear interpretation of the cornpactness hypothesis, where a small dissimilarity depicts good agreement of the compared objects. Our algorithms can handle negative dissimilarities as well. The problem lies, however, in determining an adequate meaning for such dissimilarities. Although our focus is on dissimilarities, there exists an algebraic relation between dissimilarity and similarity representations through generalized inner products. One can be derived from the other by proper linear operations. This holds for their interpretations in (indefinite) inner product spaces. namely Euclidean (Hilbert) and pseudo-Euclidean (Krein ) spaces. Therefore, any symmetric n x n similarity matrix can be seen as a generalized inner product (Gram) matrix in the corresponding (pseudo-)Euclidean space. Based on such an inner product, a symmetric square distance can be defined. So, any symmetric nxn square dissimilarity matrix can be understood as a matrix of square pscudo-Euclidean distances. Because of
Learning approaches
213
such relations, linear decision rules built in dissimilarity spaces can be presented as quadratic (or linear) classifiers in the underlying pseudo-Euclidean spaces. All these considerations refer to pairwise representations. A natural extension is to depict a relation of one entity to a number of them or a relation of a partial concept to the whole concept, e.g. a resemblance of an object to a (sampled) domain, or of a particular process to a model process. This would require learning of a measure itself from a collection of objects belonging to a class, as well as other non-class representatives. Such representations are still an open issue.
This page intentionally left blank
Chapter 5
Dissimilarity measures
I n physical science the first essential step in the direction of learm in,g any subject is to find principles of numerical reckoning arid practicable methods for measuring some quality connected with it. I often say that ,when you ca,n measure what you are speaking about, and express it in numbers, you Srmui something about it; but ,when you cannot measure it, when, you cannot express it in, numbers, your knowledge is of a meager and unsatisfactory kind; it muy be the beginning of knowledge, but you h,ave scarcely in your thoughts advanced t o the state of Science, whatever the matter m a y be. “POPULAR LECTURESA N D ADDRESSES”, LORDKELVIN
Relative similarity can be defined as a relationship between two entities which are of the same nature or possess the same characteristics, but in a different measure or degree’. The larger the sirriilarity value, the greater the resemblance between the objects. Relative dissimilarity, on the other hand, fociises on t h e differences; t,lie smaller t,he dissimila,rit,y,the more dike the objects. Both similarity and dissimilarity values express the notion of likeness between objects, but their emphasis is different. Which is more suitable to define depends on the type of data and the problem at hand. In general, such a proximity is a function of the observed variables or t,hc measurements collected. We will refer to it as to a measure, although it might riot be such in the strict matlicmaticnl scnse. In this chapter, we will present a brief overview of (dis)sirnilarity measures for various types of data, together with their characteristics. Some of them are well known, while others are relatively new. The measures defined on the features are described in Sec. 5.1. Section 5.2 elaborates further on probabilistic measures, i.e. dissimilarity measures between distributions. Such measures are iniportant when we deal with images: sets of ‘The word ‘relative’ emphasizes the pairwise comparisons of objects. Conceptual measures defined to compare an example to a concept, such as a class of objects, are not discussed in this chapter. 215
216
The dissimilarity representation for pattern recognztzon
points, or representations of the data by clouds of vectors in a vector space, since such data can be described by probability functions. In Secs. 5.3 and 5 . 5 , we will move to measures more specifically used in the pattern learning area; these are measures created in the process of matching two sequences, shapes or digitally represented objects. A few more important dissimilarity measures are described more thoroughly to emphasize their properties and potential use. Section 5.6 finishes this chapter with a brief survey on measures developed for particular applications, while Sec. 5.7 presents a general summary.
5.1
Measures depending on feature types
Iri the statistical approach, data objects arc described by features. Although such representations are not our main concern here, the learning methods designed for them constitute an important basis; see Sec. 4.4. Therefore, some attention will be devoted to features. Moreover, the use of dissimilarities is an option for data consisting of mixed features. We distinguish the following feature types: binary, categorical, ordinal, symbolic and quantitative, introduced in Def. 5.1. These types might not be sufficient for a complete description, since the real-world data may suffer from (selective) lack of information, which leads to imprecise, vague, probabilistic or even missing data. Definition 5.1 (Feature types) Let F = { f l , f z , . . . , f m } be a set of features, also called variables or attributes, and V fa set of valid values for a fcature ,f. The following features f EF can be considered: 0 binary if D f is a set of two symbols or two numbers, e.g. 0/1 to encode the gender. 0 categorical if Vf is a finite, discrete set of numbers, e.g. from 1 to 4 to encode hair color. Here, we also include the case of a discrete feature, i.e. a feature with distinct and separate values, which can be counted, such as the number of children. 0 qumtitative if f is measured on an interval and D f is a convex subset of R;e.g. height, temperature, or the time required to reach a chosen place by car. 0 ordinal if Df is a finite, discrete set of ordered symbols, e.g. a scale from 1 to 5 representing the answers of ‘strongly dislike’, ‘dislike’, ‘neutral’, ‘like‘ arid ‘strongly like’, after tasting a particular food product. The distinction between consecutive points on the scale is not necessarily
Dissimilarity measures
217
object j
Figure 5.1
0
Counters for dichotomous data.
always the same; the difference in taste expressed by giving a rating of 2 rather than 1 might be much less than by giving a rating of 4 instead of 3 . symbolic or nominal if D Dis~ a finite, discrete set of symbols; e.g. nationality. Symbolic features represent a set of possible values, symbols or modalities. Their values can be counted, but not ordered.
Measures for dichotomous data. Dichotomous (or binary) features have only two values possible. They represent either the presence (1) or absence (0) of a particular characteristics or some opposite qualities, e.g. such as large (I) and small (0). The i-th object is represented by a binary vector m xi E t?", where t? = (0, l}. For xi, xj E t?", xlxj = Z i k z3k is the binary scalar product and (1- x) is the complementary vector of x. This allows us to define the following counters:
ck=,
0
0 0
0
the number of properties common t,o both objects xj) - the number of properties which i has and j lacks c i j = (1-x~)~x,? - the number of properties which j has and i lacks d i j = (1- ~ , ) ~-(xj) 1 - the number of properties that both objects lack ai,i = x:xj
bi,
= xT(1
+
-
~
+
+
where a,j b,, ciJ d,j = m2. For various definitions of similarity measures, a 2 x 2 contingency table is considered for each pair of objects i and j as presented in Fig. 5.1. A number of measures is proposed based on these values; see e.g. [Baulieu, 1989, 1997; Cox and Cox, 1995; Gower, 19861. Examples are presented in Tables 5.1 and 5.2, where the suffices i and j are omitted for simplicity. Such measures are often binary equivalents of other well-known formulations. For instance, in Table 5.1, the first measure is the binary dot product, the Jaccard measure is the similarity ratio, the Ochiai measure refers to the cross-product ratio, while the Pearson2 measure corresponds to the binary correlation coefficient. Gower introduced also two families of binary similarity coefficients depending on a parameter 6' arid defined as 'Although d is used to denote both the counter and the dissimilarity, its use is a p parent from the context.
T h e dzssimilarity representation f o r p a t t e r n recognition
218
(the suffices i and j are dropped) [Gower, 19861:
s,= a + d a++0d( b + c)
and
TQ=
a
+Q (b+ c)'
(5.1)
0,
For particular values of Q tlie above measures reduce to some of the forms 1 corresponds to the simple matching prcsented in Table 5.1. For instance, S similarity and T; refers to the Dice similarity. The metric and Euclidean properties of tlie dissimilarities 1- S Q ,1~ T and B their square roots depend on 8. Thcy are sumniarized below:
Theorem 5.1 (Gower) [Gower, 19861 ( I ) (1
~
SQ)and (1 - To) are metric for Q 2 1. (1 - So)i an,d (1 - T Q3)
Q 2 113. ('2) If (1 - S Q )i s~ Euclidean, then so is (1 - S ,)h for q5 2 8. The same relation, holds for TQ. (3) (1 S Q )is~Euclidem for Q 2 1 and (1 - To)$ i s Euclidean for 0 2 1 - So and 1 TQare not necessarily Euclidean. w e metric for
i.
~
~
Measures for categorical data. Let X be a categorical n x m data matrix and let the feature fk take values in c k categories such that c = c k . Dissimilarity measures defined for binary data, Tables 5.1 and 5.2, can now be adapted for the categorical data, as well. To achieve that, one has to code each m-dimensional data vector xi into a c-dimensional . . . X(,)]. x ( k ) is a vcctor of'the length c k consisting binary vector x i = [x(l); of all zeros except for 1 at tlie j - t h position assuming that xik belongs to the ;j-th category [Esposito et al., 20001.
z'FLl
Measures for ordinal data. Lct X be an ordinal nxm data matrix such that the feature f k has ck categories and c = CT="=,k. In case of ordinal variables, the dissimilarity measure should take into account the positions of categories in the ordering, and it should be larger for more distant Categories than for close ones. Here, a generalization of the Jaccard dissimilarity, Table 5.1, can be used for a comparison of the objects p i and p j , as follows:
Another approach relies on coding the ordinal vectors into the binary ones. The object pl can be represented as a c-dimensional binary vector yi = is a binary vector of the length c k consisting of [ Y ( ~.). . y(,)] T, where
Dissimilarity measures
219
Table 5.1 Similarity and dissimilarity measures for dichotomous data. The counters a , 6, c and d are defined on the previous page. ‘hf’ denotes the metric properties, while ‘E’ stands for the Euclidean behavior. The numbers below refer to the following authors: (1) Russel & Rao, (2) simple matching, ( 3 ) Kulczynski, (4) Jaccard, (5) Dice, (6) Sokal & Sneath, (7) Anderberg, (8) Rogers & Tanimoto, (9) Kulczynski2, (10) Andcrberg2,
(11) Hamman, (12) Yule, (13) Pearson, (14) Pearson2 and (15) Ochiai.
-
Dissimilarity D Similarity S
Range
a+b+c+d
a+d
b+c
a a+b+c a a+ i(b+c)
U t d
a++(b+c)+cl a a 2(b c)
+ +
a+d
a
+ 2(b + c) + d
l a 4(=+&
~
(a
+ d ) (6 t c ) + b + c+ d -
a,
bc
ad
-
ad
+ bc
+-+-) d c+d
d b+d
S psd D
= (1-S)t
M --
E
D = 1-S M E -
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
No
No
No
No
No
No
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
No
No
Yes
No
No
No
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
No
No
No
No
No
No
NO
No
No
No
No
Yes
Yes
Yes
Yes
No
No
No
No
No
No
Yes
Yes
Yes
No
NO
Yes
Yes
Yes
No
No
Yes
Yes
No
No
Yes
-
-
T h e dassamilarzty representation f o r p a t t e r n recognition
220
Table 5.2 Dissimilarity nicasurcs for dichotomous data. The counters a , 6, c and d are dcfincd on the previous page. 'M' denotes the metric properties, while 'E' stands for the Euclidean behavior Ref.
Dissimilarity D
Binary Euclidean
(b
Hamming
b+c
M
Range
+c)i
Variance Bray-Curtis
Binary size diff.
Binary pattcrn diff.
Binary shape diff.
2a+b+c (b
~
c)"
+ + c + d)'
(a b
bc
+ b + c + d)2 (a + b + c + d ) ( b+ (b (a+ b + c + d)2 (a
C) -
-
c ) ~
E
Yes
__ Yes
Yes
No
Yes
No
No
No
No
No
No
No
No ~
No ~
first h k ones, followed by ( c k - h k ) zeros. The observation X i l ; takes hk-th o f thc c k ordered values for the feature fk. Now, any binary dissimilarity can be applied.
Measures for quantitative data. Many measures exist for quantitative variables, mostly constructed in an additive way after counting the differences for each variable separately; see [Everitt and Rabe-Hesketh, 1997; Gower. 1986; Esposito et al., 2000; Cox arid Cox, 1995; Borg and Groenen. 19971. Some of them are presented in Table 5 . 3 . The basic measures come from the family of &distances. The tpmetric, for p > 1 is defined as dp (x.y) = = [ C ~ l ( x i - ~ i ) p ] l / which p, forp = 1 becomes the city block distance arid for p = 2, the Euclidean distance; see also Example 2.5. A second order statistical dependence among m quantitative variables can be described by their covariance matrix C. Then, the Euclidean distarice can be generalized into the Mahalanobis distance d2ni(x,y) = (x Y ) ~ C - (x ' - y). If C is unknown, its sample estimate C based on n objects is used. C is then estimated either as C = C:'"=,xt- X)(x( - X)T or, when k classes of the cardinalities ni are known, it becomes: C = 1 ri-k C':' J = 1 (xj - Xci))(xj - X ( ~ I )where ~, X(()is the mean for the i-th class. For the transformed data with the identity covariance matrix, d:, becomes Euclidean.
xf=l
Dissimilarity measures
221
Table 6.3 Dissimilarit,y measures for quantitative data in RTrL.‘M’ denotes the properties, while ‘E’ stands for the Euclidean behavior. __ D M Ref. Dissimilarity d(x, y) __ Yes Euclidean - YIT(X - Y ) -~
J(x
metric __ E __ Yes
Weighted Euclidean
Yes
Yes
City block
Yes
No
Max norm
Yes
No
ep or Minkowski
Yes
No
Yes
Yes
Median distance
No
No
Correlation-based
No
No
Correlation-based
No
No
Cosine
No
No
Divergence
No
No
Bray and Curtis
No
No
Soergel
No
No
Ware and Hedges
NO
NO __
Mahalanobis
J(x
- Y ) ~ C -(x I
7; C is psd
Measures for symbolic data. Symbolic objects are described by m variables f,,each on the domain D f , arid a logical statement of the form [ f z E X z ] .where X, C D f b ,e g [color E {red,green,yellow}] or [weight E (10,20)]. A symbolic object 2 is expressed as the Cartesian product of the ) the total event being a conjunction of all the fcatim values 2%= f % ( zwith events. The dissimilarity between two objects J: = [fl E X L ]A . . . A [,fill E X m ] and g = [ f l E Yt] A . . . A I f m E Ym]can be defined with respect to the components due to position ( d p ) ,span (d,) and content (&), all riornializrd
The dissimilarity representation for p a t t e r n recognition
222
to [O, l]?as [Gowda arid Diday, 19911:
c rri
d(z,y) =
[ d p ( % , Y i ) +ds(zi,vi)+dc(zi,yi)l,
(5.3)
i=l
The component d,, valid for quantitative variables only, indicates the relative positions of two variable values. By writing Xi = [z::z?] and Yi = [y:,yy] with the lower zr arid upper limits, one has d,(zi,yi) = . 1 - y:l/lDfzl, where IDfiI is the range of f i over all the objects. The remaining two measures, d s and d, are defined for quantitative, symbolic or ordinal attributes. The component d, indicates the relative sizes of the variable values without referring to the common parts between them as &(xi, yi) = 11, - 1,1/span (zi, yi). For quantitative values, 1, = lxy - zfI and 1, = IyP-y/fI, and the span, the length of the minimum interval containing both xi and yi, equals to span (zi, yi) = I max{zy, y?} - rnax{x:, yf}l. For other features, 1, = IXiI, 1, = lJ’i1 and the span becomes lXi u J’il. The component d, measures the common parts between the variables: &(xi, y i ) = 11, 1, - 2 length (Xin J’i)l/span ( x i , yi). For other dissiniilarity measures for symbolic objects, see for instance [Ichino arid Yagiichi, 1994; de Carvalho, 1994, 1998; Malerba et al., 20011.
XY
:
+
Gower’s generalized dissimilarity coefficient. A classical measure for data of mixed types is the Gower’s [Gower, 19711 dissimilarity. First, a general similarity measure for m variables is introduced as:
where s i j k = s ( p i , p j ) k is the similarity between objects p i and p j based on tlie k-th variable only, and S i j k = 1 if the objects considered can legitimately be compared and zero otherwise, as e.g. in case of missing values. For the dichotomous variables, b i j k = 0 if x i k = xjk = 0 and 6 i j k = 1, otherwise. The strength of feature contributions is determined by the weights W k , which can also be omitted if W k = 1 for each k . The similarity s i j k ? for i, j = 1 , .. . , R and k = 1 , .. . , m is then defined as:
1
Iw-.%kl
rk
,
Z(xik = xjk = l), Sijk =
fk
is quant.itative,
fk
is dichotomous,
fk
is categorical,
fk
is ordinal,
s (Pi,Pj)k= Z(.ik
1- g
=Zjk),
(
,
‘zzkT~zjk’)
(5.5)
Dissimilaritv measures
223
where ?“k is the range of the k-th variable arid g is a chosen monotonic transformation. Let SG = ( s i j ) , then the Gower’s dissimilarity matrix DG is defined as DG = (llTSG)*$. The Gower’s distance is Euclidean if no missing values occur [Gower, 19711.
Cox and Cox’s generalized dissimilarity measure. Cox and Cox proposed an extension to the Gower’s measure [Cox and Cox, 20001. It can be used for both mixed and non-mixed data, producing simultaneously dissimilarities between pairs of objects and dissimilarities between pairs of’ variables. As for the Gower’s dissimilarity, additional feature weights should be supplied by the user, while here, they are determined in an automatic manner. Let dij and d$ be the dissimilarity measures between objects arid between variables, correspondingly. Let the unweighted dissimilarity between the objects p i and p j , as measured by the k-th variable, be denoted as cuijk = d ( p i , p j ) k : . Its value can be found, e.g. based on Gower’s suggestions in Eq. (5.5). Let the unweighted dissimilarity between the variables f k and f i with respect to the object pi be ,&li = dF(fk, fi)i. Assume F = { f l ,f i , . . . , f m } is a set of all variables. The dissimilarity measure P k l i between the variables f k and f i (for the object p i ) will depend on their types: f k
is quantitative and
fi
is
quantitative or ordinal (suitably scaled). PkZz
=
,
f k
and
fi
are both dichotomous/categorical,
fk
and
fi
are both ordinal.
fk is quantitative or ordinal (suitably scaled) and f i is dichotomous or categorical, (5.6)
where p = 1 , 2 . n: is the number of objects for which thc k-th variable has the category A, nk is the number of objects for which the I-th variable has the category B and nilB is the number of objects with shared properties. f~ is the mean of the fk: values for which f i is recorded as the category A. Other measures between variables can also be considered, but they should be properly scaled. since they have an impact on the scaling of dissimilarities between the objects.
224
The dissimilarity representation f o r p a t t e r n recognition
As the unweighted dissimilarities between objects and features can be computed. one can derived the overall dissiniilarities as
k
k
2
i
(5.7)
where wk arid IU: are the weights for a variable arid for an object, respectively. It is assumed that a t 3 k = a j i k , Pkli = Plki and Q i i k = P k k i for all i , j , A:, d . The weights u i k arid w : are chosen to be proportional to the sum of the dissimilarities which refer to the variable fl or to the object p j , respectively. Therefore, for c, and c b being constants, the following equations are obtained:
1
The weights
1
2
WE
and w: can be scaled, e.g. by imposing that = The abovc equations arc solved iteratively: as explained in [Cox arid Cox, 2OoOl.
xi
wk
( 7 1 1 ~= ) ~ 1.
Difference metrics. Wilson and Martinez proposed three heterogeneous distance measures: heterogeneous value difference metric (HVDM), interpolatcd value difference metric (IVDM) and windowed value difference metric (WVDM), which can handle missing data and nominal variables [Wilson and Mart,inez, 19971. They use class information. First, the value difference rnetric(VDM) was introduced in [Stanfill and Waltz, 19861 for nominal variatblcs. The unweighted VDM distance between two values .?;I; and yk of the fcature f k is defined as:
(5.9) where K is the number of classes, n k , z k is the number of instances that take value of Z k for the feature f k , ng,Zkis n k . Z k restricted to the class c arid ;up.,, = P ( c / f ~=; zk) is the conditional probability that the output class is c given that fk has the value of 21~.Based on the VDM distance,
Dissimilarity measures
225
the HVDM metric is defined as: (5.10) where dfk returns a distance between two values for the feature
I
if x k or
yk
fk:
are missing
and for i s k being a standard deviation of f r ; . If there are no nominal variables, the HVDM distance reduces to tlie Euclidean one. Continuous values can be discretized into specified s equal-width intervals; K s << 'n. [Wilson and Martinez: 19971 indicates that thc choice of s might not be critical, provided that K 5 s << n. The width 7111, of the discretized interval for f k equals to wk = rangef,/s. The continuous value 2 can be now discrctized as discrk(x) = s , if 2 = inax(fk) or as discrk(z) = [(x - rniii(fk))/wcL]+1, otherwise. The IVDM distance is defined as:
<
m
IVDILf(x,y) =
C ivdrnk(xk, '
(5.12)
gk).
k= 1
where i v d m (a, uk) =
{
vdrrlk(Zk, ~ k ) ,
C,"==, &(xk)
if
fr;
is discrete
(5.13) -
pk(ylc)l2, otherwise
and pt(xk) is an interpolated probability of a continuous value z k for tlie z-midk feature f k and the c, i.e. P g ( ( z ) = P i , 1 ~ + ( ~ ~ ~ i d ~ , , , + ~ ~( P, li ,lui + d~ l -,p,i), u ) . midk,, and rnidk,++l are midpoints of two consecutive discret'ized ranges such that midk,7L5 x < midk,,+l. pi.u is the probability of the discretized range Y , where 2~ is found as u = discrk(u) - Z(x < midk,,) and rnidk,?,,= rnin(fk) wk (u+ 0.5). The IVDM distance can be interpreted as sampling the value of p i . , at the midpoint midk;, of each discretized range u. The probability p i is now interpolated based on the fixed number s of sampling points. Instead, a Parzen window centmeredat a given point can be used, giving rise to WVDM distance measure; see [Wilson and Martinez, 19971 for details. zi
+
226
T h e dissimilarity representation f o r p a t t e r n recognition
Other heterogeneous measures. Many other measures can be designed for rnixturc types, e.g. by combining the coefficients from Tables 5.1 5.3. either with, or without appropriate weighting.
~
Model of Tversky. Some models studied in cognitive sciences assume that human similarity assessment is based on tlie measurement of a distance in a psychological space [Goldstone, 1999. 1998, 1994; Wharton et al., 1992; Borg and Groenen, 19971. Objects are treated as points in a perceptual space and their difference is expressed by a metric. Tversky argued that from a liunian perception’s point of view, metric requirements are not verified in practice [Tversky, 19771. He claimed that a comparison of individuals is described by different sets of attributes. Hence, a feature contrast model was proposed, where instances are characterized by sets of features, instead of interpreting them as points in a metric space. Assume feature sets F?and .Fj given for the instances xt and z3,respectively. Then, the similarity between z, and x3 can be evaluated as
where ,f is a non-negative function. This measure describes the contrast between the corninon and distinctive features. Depending on the choice of a,P and f , different models can be obtained. An underlying assumption is that objects are characterized either by binary features or by features whose values correspond to the presence or absence of some attributes. Consequently, if Fi and Fj are sets of dichotomous features and f is the cardiiiality of a set, f ( P ) = IP/, then the Tversky similarity can be expressed as s T ( x i ,z j ) = a i j / ( a i j + (1 a ) bi.7 (1 + p) c i j ) , where a i j , bij and c i j are the counters defined before in the paragraph on dichotomous features. For suitable choices of a and p, some of the similarity measures presented in Table 5.1 can be obtained. Alternatively, the Tversky similarity can be expressed in the following form
+
s ~ ( z z,?) ~ , = f (Tin F j )
~
CY
+
f (Fi - Fj)- p f
(F? - EL).
(5.15)
The feature information can also be graded. This is achieved by tlie use of fuzzy features [Santini and ,Jain, 1997, 1996, 19991, represented as membcrship functions pf : D + degree, where each legal value of the domain 2) has a dcgree indicating to what extent this value is true. The membership functions can be subjected to arbitrary simplifications; usually continuous fiimctions are used such as logistic, Gaussian or piecewise
Dissamilanty measurea
227
linear. Let 4%correspond now to a set of measurements of the object xi and pk(&) be the k-th fuzzy feature. Given 711 feature, p(di) = 3, = {pi ( h )p,2 (4i);. . . , p m (&,),}, Let' 11s denote ,u& = , u k ( b j i ) . The intersection and the difference between 3 i and .Fj can be then defined as: -5n-Tj = { m i n ( p ~ k ~ , p ~ k j ) )ainj dk 3~i~- 3 j = {max(pl,rcz- p k j . O ) } l < k s m . Let rn$ = p k i - pk3. The Tversky similarity Eq. (5.15) becomes then: m
m
(5.16) and the dissimilarity is given as &(xi, z j ) = m - S T ( Z ~:c,i). , The Tversky similarity relies on considerations from set theory. Still, the niiri arid inax operators can be approximated by smooth functions. If h is the Heaviside function: h ( z ) = Z(z 2 0), then a logistic functioii h,(z) = l + e x p1( - n s ) , approximates h with any desired error for any non-zero ic (for :I:= 0: the error is 0.5, independently of g). The niin and max operators can he approximated3 by the functions s,(z,y) = zh,(y - x) y h , ( z - y) and l D ( x y) , = J: h,(z - y) :y h,(y - x), respectivcly. So, in Eq. (5.16). these operators can be replaced appropriately. This leads to the following dissimilarity d T :
+
+
m
m
m
k=l
k=l
k l
The Tversky's idea can be adopted to define new dissimilarity measures for continuous features. For instance, the similarity between two instances with respect to the feature f k can be measured as
where f i n is the value of the Ic-th feature for the i-th object arid r'k is the range of f k . If cy = /3 = 0, then the original Tversky's model simplifies to f ( F zn F j ) / f ( 3 % u F7).Let f be a weighted linear combination of the features, then we can define the total similarity ST as (5.19) 3This approximation holds thanks to min(z,y) = z h ( y max(z, y) = z k ( z - y ) y h(y - z).
+
-
z)+ y k ( z
-
y ) and
228
T h e dissimilarity representation for p a t t e r n recognition
wlierc wk are suitable weights assigned to the features. An asymmetric similarity can be obtained in the same way as above, but for Q = 0 and = -1. Yet, such a similarity should express the degree of inclusion of x, into x3 by using wk in the denominator of Eq. (5.19) instead.
xzLl
5.2
Measures between populations
To analyze tlie differences between populations described by vectors in a feature space, a number of dissimilarity measures can be considered. If the mean vectors are used to represent entire populations, they can be used to compute the between-group dissimilarities according to formulas from Table 5.3. Another possibility is to characterize a population by a multivariate probability distribution function (pdf) F(x). Then, the difference between two populations is measured by the dissimilarity between two pdf’s Fl and F2. A Kolmogorov metric [Gibbs and Su, 20021 is commonly used. For two distribution functions Fl and F2 it is defined as
(5.20) For some general probability measures and their relations, see [Gibbs and Su, 20021 and the following sections. As an extension, the evaluation of the inter-population dissimilarity niay also rely on describing each distribution as a point in a Riemann space with tlie coordinates specified by the population parameters. Example: a population characterized by a normal density function is defined by the coordinates of ( p ,C) in a rn m(m 1 ) / 2 dimensional space. Populations described by similar parameters will be mapped into neighboring points in this space. Provided that a suitable metric can be defined, a dissimilarity between tlie groups is the geodesic length (the shortest path connection two points on a manifold) between the points representing the populations.
+
5.2.1
+
Normal distributions
The assumption of data being drawn from a normal distribution is often made in practice. Hence, there is a need for proper dissimilarity measures. A classical measure between two normal distributions N ( p 1 C) arid N ( p zC) , with the equal covariance matrices C is the square Mahalanobis
Dissimilarity measures
229
distance LIM between their means:
Since the true distribution parameters are hardly known, in practice they 11 i are replaced by sample estimates: = C,i=l xj, i = 1 , 2 arid C = ( ( n l- 1)Cl+ ( n 2 - l)Cz),where n, denotes the sample sizes and 5 ' 1, 111 +nn -2 and Ci, i = 1 , 2 , represents the sample mean vectors and sample covariance matrices, respectively. The estimated Mahalanobis distance becomes then D z ( X l ,X2; C) = (XI- Xz)TC-1(5T1 - %a). If C = I or C = diag (oi),then the D R becomes the Euclidean or weighted Euclidean distance between the mean vectors, correspondingly. Note, however, that if the Mahalanobis distance is considered with respect t o the space X = N ( p , C ) , then the space ( X ,d ~ is )premetric; see Example 2.7. The Mahalanobis distance is based on the assumption of equal covariance matrices. For heterogeneous covariance matrices, its generalization leads to the normal information radius [Jardine and Sibson, 19711. Given two normal distributions NI = N ( p l ,El) and N2 zz N ( p 2 &), , one has:
5
(5.22) Another distance measure between normal distributions, suitable for heterogeneous covariance matrices, was proposed in [Anderson and Bahadur, 19621. Let b, = ( a x 1 (1 - a ) C 2 ) - ' ( p I - p 2 ) for a ~ ( 0 , l Then, ).
+
As before, the distribution parameters are replaced by sample estiniates. Other measures for normal distributions are presented in the next section. 5.2.2
Divergence measures
Many classical measures expressing the difference between two probability distributions Fl and Ez with the density functions .fl and f 2 are special cases of the &divergence proposed by CsiszAr [CsiszAr,19671, which is based
230
T h e dissimilarity representation for p a t t e r n recognition
on the likelihood ratio X(x) = fro. f l (X) ’
where 4(X) is a real, convex function defined on R+ such that 4(1) = 0, and p is a measure over the domain 23. Note that by inverting the argmnents Fl and Fz of d d ( F 1 , F z ) , another &divergence is obtained, i.c. d$(Fz,F l ) becomes dx4(1/x)(Fl, Fz). Moreover, the symmetric divergence, dq(F1,Fz)+dd(Fz, F I ) , can be considered as d$(x)+x4(1~~j(F~, Fz) [Esposit,o et d., 20001. Some well-known divergence measures for continuous and univariate histogram-like distributions are given below, together with the equivalent formulations for two normal distributions. Formulations for discrete distributions are omitted since they are straightforward generalizations of the continuous ones; by using summations instead of integrals. The presentation follows [Esposito et al., 20001. The study of relations between preseiited divergence measures as well as their generalizations can be found e.g. in [Taneja; 1989, 19951 or in the on-line book [Taneja]. For brevity, lct us denote Ni= N ( p i ,Xi), for i = 1 , 2 , and C = C1 = C2, for equal covariance matrices and the square Mahalanobis distance by O i l . The histogra,m-like distributions f1 and f z are constant on disjoint intervals I;’),. . . ,I:: and I;’), . . . ,I$:, respectively such
xTLl N
that fi(x) = h p ) Z ( z ~ I : ) )i, = 1 , 2 , where hp) are positive weights. J,t = I,:” n I j 2 )stands for the intersection of the two intervals I:” and It(2) arid p(,JSt)is the length (Lebesgue measure) of J,st.
Kullback-Leibler divergence. This measure, known also as inforrnatiori distance or relative entropy [Esposito et al., 20001, is obtained for 4 ( X ) = X log(X), X>0 and $(0) = 0: (5.25) Thc usual convention is log(0/b) = 0 for all b and log(a/0) = m for all non-zero a. Hence: ~ K yields L values in [O, m]. The Kullback-Leibler measure is based on the concept of information gain. If two populations are described by the probability distributions, d K L
Dissimilarity measures
231
expresses the average information for rejecting the first population in favor of the second one, when x belongs to the second one. This measure is asymmetric, hence non-metric. For two m-dimensional normal distributions, ~ K becomes: L
(5.26) or ~ K L ( N2) N ~=, D K ( p l ,p 2 ;C) when the covariance matrices arc equal. For two histogram-like distributions, dKL is given as
J-coefficient. For Leibler divergence:
#(A)
=
(A
-
1) log(A), we get a symmetric Kullback-
For two m-dimensional normal distributions, dJ becomes:
or d~(N1 ,N2) = D i J ( p l ,p 2 ;C ) , when the covariance matrices are equal. For two histogram-like distributions, one has:
Information radius. +l
+A)
log(1
+ S):
This is a symmetric measure obtained for
4(A) =
For two normal distributions, d l R becomes the normal information radius, as given by Eq. (5.22).
T h e dissimilarzty representation for p a t t e r n recognition
232
X2-divergence. This asymmetric measure (thus not a metric) is obtained for 4(A) = ( A - 1)’:
For two normal distributions, with (2.E;’ dX2 becomes: dX2
-
C,’) being positive definite,
(% , h ’ 2 )
+ ~L(CLlrO;C1) ~
2G/&,O;W)
1
-1
(5.31)
dX2 (nil,h”2) = exp { D;,(pl , p 2 ;C)} - 1, when the covariance matrices are identical. For two histogram-like distributions, d,z equals to:
or.
(5.32) Hellinger coefficient. This similarity measure is obtained for 4(X) = At, where I E ( 0 , l ) :
s g ( F 1 , F 2 )=
L
f2(Xy
fl(x)l-tdx.
(5.33)
For two m-dimensional normal distributions. sg’ becomes either
(5.34)
matrices are the same. Chernoff and Bhattacharyya coefficients. For t = $, the Hellinger similarity becomes the Bhattacharyya symmetric coefficient [Fukunaga, 19901. The Bhattacharyya distance is then given as:
dm(F1,Fz)= -log(sg)(FI,F2)).
(5.35)
Dissimilarity measures
233
For two normal distributions, it becomes:
(5.36) The Bhattacharyya distance is a special case of the Chernoff distance [Fiikunaga, 19901:
d&FI,F2)
=
-1og(&F1,F2)).
(5.37)
The Chernoff and Bhattacharyya distances are important in the classification area since they provide upper bounds on the Bayes error of two classes described by normal distributions [Fukunaga, 1990; Duda et al., 20011.
Variation distance and the l 2 distance. For the choice of $(A) = 11 A / or 4(X) = 11 - XI2, symmetric equivalents of the I I - and 12-distances are obtained: -
dP(F1.F2)=
L
For two m-dimensional normal distributions. d2
P = 1.2.
l f 2 ( x )- f l ( X ) I P d X ,
d2
(5.38)
becomes:
(Ni N z ) %
-
1
2m7rY
1
1 + ((det(C1))h ( d e t ( C 2 ) ) i )
-
2 (27r)T (det(C1
+&))$ (5.39)
or. when the covariance matrices are equal, onc has:
5.2.3
Discrete probability distributions
Let us consider n objects, described by rn categorical variables arid belonging to two groups. The groups are then treated as separate distributions. Let p p = be the relative frequency, where is thc riurriber of instances belonging to the j - t h category present of the k-th variable for thc i-th group, where i = 1 , 2 . Let p i = [ p 1 1 . . p ~ c 1 p , 2 1 . . p ~ " 2 . . pand ~ C 7 Cn k] be
ny/n
ny
The dissimilarity representation for pattern recognition
234
the number of different categories for the k-th variable and c = The inter-group distance can be computed as follows:
c;=!=, ck.
(5.41) Another possibility is to extend the Mahalanobis distance by replacing the continuous variables by the categorical ones. If C is a cxc sample covariance matrix, such a measure is given by: Dn-lkat(P1,Pz) 2 (P1 - p J T C - l (P1 - P2).
(5.42)
The affinity coefficient can be used as well. It is related to the Hellinger similarity, Eq. (5.2.2), arid it measures the resemblance between two categorical or modal features. or two histograms. Let p? = nkt 3 / n ,as above. Thus, those frequencies generate a discrete probability distribution. The affinity between two frequency distributions for the variable f k is expressed as afA = Cq:, the groups:
( p y p?)
'.
This leads to the affinity dissimilarity between 71L
(5.43)
where
5.3
wk
are appropriate weights.
Dissimilarity measures between sequences
Let A be an alphabet, i.e. a finite collection of symbols, also called letters, from which sequences or strings are composed.Let s = ~ 1 ~ 2. s,. . be a sequence of letters from A. An empty word is denoted by E and it has a null length. Such strings are used in the pattern recognition and machine learning areas for encoding objects of relatively homogeneous structure. Here, we will briefly introduce the most common distance measures.
Hamming distance. This is one of the most simple measures: for two sequences of equal length, it counts the symbol positions in which they differ; sec also Table 5.2. Without loss of generality, let s = ~ 1 . ~ 2. .s,., and t = tlt2 . . . t , be binary sequences. The Hamming distance is then defined as dHam ( s ,t ) = Z(sk # t k ) . It is not a flexible measure as it assumes sequences of a fixed length. In many problems, however, the sequences have a variable length and, moreover, there might be no fixed correspondence
xi=,
Dissimilarity measures
235
between their symbol positions. A small shift of the position in one of the two nearly identical sequences can lead to exaggerated values in the Hamming distance.
Fuzzy Hamming distance. A fuzzy Hamming distance has beeri proposed to make the Hamming distance be sensitive to local neighborhoods [Bookstein et ad., 20011. This is a type of an edit distance for sequences of equal length. Edit distance relies on transforming one sequence into another by using the so-called edit operations. The following edit operations are introduced: insertion: deletion and shift, with the costs (:ins, cde1 and Csub assigned to them, correspondingly. The shift operation allows to transform a 1-bit in one string to the nearest 1-bit in the other string at smaller costs than by both deletion and insertion. The operations are now used to transform one string into another and the resulting dissimilarity d f H r L m is computed by adding up the costs of the operations such that it has a total minimal cost. The fuzzy Hamming distance is metric if cdel = tins arid for the absolute size of a shift h 2 0, Csub ( h ) 2 0 and c,,b(h) = 0 iff / L = 0, c,,b (h,) incrcases monotonically and it is concave on the integers [Bookstein et al.) 20011. Levenshtein (edit) distance. The most popular edit distance is the Levenshtein distance [Levenshtein, 1966; Wagner and Fisher, 19741, expressing a local similarity between the sequences of arbitrary lengths. It is based on three edit operations: insertion, deletion and substitution. The costs cins,cdel and csub are associated to each of them, correspondingly, giving rise to a weighted version of this distance. In the edit distance, c , , b > Cde] Girls, meaning that a deletion of a and an insertion of b are preferred to the substitution of a by b. If all the costs are such that a single one is not larger than the sum of two other costs: then c l is ~ a metric ~ ~ ~Levenshteiii , dis[Bunke et al., 20021. Similarly to d f ~ the~ weighted tance d L is determined by the minimal total cost related to the Operations transforming a sequence s into t . (Note that the solution might not be unique). Assuming that such a transformation requires &,b substitutions, 72ins insertions and n d e l dilations, d L is expressed as:
+
d L (s,
t)=
rriin
(nsubCs,,h +rLins Ciris + n d e l C M ) .
(5.44)
nsub Inins .ndrl
The traditional edit distance with all costs equal to one is often considered. The probleni is, however, that d L depends then on the lengths of cornpared sequences and may be badly influenced by comparing two scquerices, where one is short and the other is very long. To make it independent of thc
T h e dissimilarity representation for p a t t e r n recognition
236
lengths, a normalization can be used, yielding the nornialized Levenshtein distance [Marzal and Vidal, 1993; Vidal et al.; 19951: (5.45) However, since the triangle inequality may not hold4, dnL is quasimetric.
Other related distances. Two sequences can also be compared based on the coninion longest prefix, suffix or just a subsequence. Assume we are given two sequences s and t of the length n and m 5 n, respectively. Then, the distance between them can be defined as d ( s , t ) = m n - 2 /common(s,t)l. The problem of finding of the common longest subsequence is complementary to determining the edit distance. It can also be solved by the use of dynamic programming; see also [Stephen, 19981. A survey to approximate string matching can be found in [Navarro, 20011.
+
Information distance and its approximation. Assume a set of binary strings. The Kolmogorov complexity K ( s ) of a binary sequence s is the length (in bits) of the shortest computer program of' a fixed reference computing system that produces s as a result. The change of a computing system changes this value by an additive fixed constant at most [Li and VitBnyi, 19971. A possible interpretation of K ( s ) is the length of the ultimate compressed version of s from which s can be recovered by a deconipression program. To measure the difference between two strings, s and t . the normalized information distance was proposed in [Li et al., 20031:
(5.46) Note that K ( s ,t ) is the length of the shortest program that prints S and t and a description how to tell tliern apart. Since the NID distance is imcomputable, an approximation was suggested to use data compression programs to approximate K . This leads to the normalized compression distance defined as [Cilibrasi and VitBnyi, 20041: (5.47)
4Consider three sequences s, t and u consisting of 9, 10 and 15 zeros, respectively. Assume all the costs equal one. Then dTLT,(s,t ) = d n L ( t , u ) = and d n T 2 ( u , s ) = 35 ' Clearly, < hence the triangle inequality is violated.
&+
2,
&,
4
Dissimilarity measures
237
where C is the chosen compressor and C ( s )is the length of the compressed string. Any strings (after proper recoding to binary strings) can be now compared by using this distance, such as DNA sequences [Li et al.; 20031 or binary files such as music pieces in MIDI format [Cilibrasi et al., 20041.
5.4
Information-theoretic measures
In an information-theoretic sense, a universal definition of similarity. applicable to the domains which have a probabilistic model, was proposed by [Lin, 19981. It is based on the common sense observation that the similarity between two objects is connected to their commonality and their difference and that two identical objects reach the maximum similarity, This leads to the following assumptions [Lin, 19981:
(1) The commonality between A arid B is measured by I ( c o m ( A , B ) ) , where I is the amount of information, usually the negative logarithm of‘the probability of the event it refers to. (2) The difference between A and B is measured by I(desc(A,B)) I(corn(A, B ) ) 2 0, where desc(A, B ) is a proposition that describes what A and B are. ( 3 ) The similarity is a function f :R: x R + + [0,1] of commonalities and differences given as sim(A, B ) = f ( I (com(A,B ) ) ,I (desc(A,B ) ) ) such , that f ( z , z )= 1 and f ( 0 , y ) = 0. (4) The overall similarity of two objects is a weighted average of their similarities computed from different perspectives. The similarity derived from these assumptions is measured as the ratio between the amount of information needed to state the commonality of two objects and the amount of information needed t o describe them. It is given as sirn(A, B) = log P (com(A,B)/log P (desc(A,I?))). [Lin, 19981 presents how this general definition is applied to a number of domains, resulting in a similarity between strings, words or concepts in taxonomy. A general and universal distance metric was proposed in [Bennett et al.; 1998; Li et al., 20031 and further explored in [Cilibrasi and Vitiinyi, 2005; Cilibrasi et al., 20041. As the authors claim, their metric is general as it can be applied in many domains such as: music, text, genomes, executable programs or natural language descriptions and it does not focus on particiilar features or commonalities between instances, but it takes them all simultaneously into account. The basic idea is to express the closeness of two objects if they can be significantly ‘compress’, one given the infor-
The dissimilarity representation for pattern recognition
238
Figure 5.2 Illustration of the Hausdorff distance between sets ‘4 and B: d H ( A , B)= E .
mation about the other. This is formalized by the notion of Kolmogorov complexity. As a result, a normalized information distance was defined as in Eq. (5.46), which minorizes other normalized distances [Li et al., 20031. In practice; it is approximated by the normalized compression distance, given in Eq. (5.47). The same principle is further used to define a Googlebased distance measuring comparing two search ternis x and y as indexed by Google5 [Cilibrasi and VitAnyi, 2005; VitBnyi, 20051. 5.5
Dissimilarity measures between sets
Dissimilarities can also be considered between two closed and bounded subregions of a (Euclidean) space, sets of points or elements. Let us first formally introduce the Hausdorff distance [Robinson’s notes, site; Klein and Thompson, 19841.
Hausdorff metric. Let ( X , p ) be a metric space and C ( X ) C: X be a space of nonempty, closed and bounded subsets of X . Let N,(A) = U L E A B E (be z )thecoverofAEX byopenc-ballsB,(z) = EX: p ( z , y ) < E } . Since B E ( x )is a neighborhood of .r. Theorem 2.6, then NE(A) is the neighborhood of A according to Def. 2.14. The Hausdorff distance between A and B is defined as the smallest &-neighborhood of A which covers B and the other way around; see also Fig. 5.2. On the other hand, the directed Hausdorff distance between A and B , dD,(A, B ) can be expressed as the maximum taken over the collection of minimum distances between elements of A and the set B. Then, the Hausdorff distance ~ H ( A B), is the maximum over the two directed distances. Formally, one has: Definition 5.2 (Hausdorff distance) In a (semi-)metric space ( X ,p ) , the Hausdorff distance with the base p is defined for all A, B EC(X)in one ‘http://ww.google.com
Dissimilarity measures
239
of the following ways: (1) d H ( A , B )= inf{A
c N,(B) & B c N,(A)}.
E>O
( 2 ) ~ H ( B) A , = max{d&(A, B ) ,dD,(B,A)}, where dorff distance & ( A , B) = sup inf p(a, 6).
dg
is a directed Haus-
aEA bEB
If the domain of d g is restricted, then supremum becomes maximum and infinium becomes minimum, namely d g ( A ,B) = max min p(n. b). aEA b E B
Corollary 5.1 The two formulations of the Huusdorff distance given in
Def. 5.2 are equivalent. Proof. We start from definition (1) and by equivalent transformations, the formulation of definition (2) is reached. inf,s {A c N E ( B ) }= infc>O{ V a t A a E N E ( B ) }= inf~>O {VaEA a E U ~ E B (p(z, ~ 6) &)} = infE>o{ v I a E ~ i n f b a ~ ( a b ),< E } = { ~ u P , ~infbcB A p(a, 6 ) ) = & ( A , B ) . Based on this, we have: ~ H ( A , B =)inf,,o{B c N,(A) & A c N,(B)}= max{inf,,O{B c NE(A)},infE,o{A c N,(B))} 0 inax { d R ( A ,B), d&(B,A ) } ,which finishes the proof.
Theorem 5.2 If ( X ,p ) is a metric (semimetric) space, then (semi.metric).
dH
as metric
Proof. First, we will prove that if p is semimetric, then d, is semimetric. We will make use of the second formulation in Def. 5.2. Since for all a E A , infOEAp(a, a) = 0, then ~ H ( AA,) = 0. The max operation is symmetric, so d~ is symmetric. Let A , B , C E C ( X ) .Let p ( n , B ) = infbEBp(a:b).If ~ E A , then there exists b such that infbcB p(a, b ) 5 supaEAp(a, B) = & ( A , B)5 d,(A, B). Given such b, we can also write p(b, C ) = infc,,c p(b>c ) 5 ~ H ( C). B , By applying the triangle inequality to p, for each a E A the following holds: p(a, C ) 5 p(a, B)+ p ( B , c ) 5 ~ H ( AB) , + ~ H ( C). B, Since the above inequality remains true for all a E A, then d g ( A , C ) = sup,,p(a,C) 5 ~ H ( A , B ) + ~ , ( B , CBecause ). the ordering of A arid C is arbitrary, we also know that dg(C,A) 5 d,(A,B)+d,(B.C). Hence: ~H(C A ), I ~H(A B ), + ~ H ( C B ), . To prove that d H is metric if p is, the definiteness axiom, Def. 2.38, should be considered. Let ~ H ( B A ), = 0. Then dD,(A,B ) = &(I?, A) = 0. Consequently, for each a E A, infbcB p(a,b) = 0. This means that every neighborhood of a contains an element from B. We know that a € (-B) = B , since B is a closed set. Since this holds for all a E A , then A c B. By symmetry of our definition, we also get B c A. Thus. A = B.
240
T h e dissimilarity representataon for p a t t e r n recognitaon
The Hausdorff distance is invariant with respect to a transformation only if the base metric is invariant; see also Theorem 3.12. Thereby, every isometry in the base metric is an isometry in the Haiisdorff metric. Moreover, two sets are within the Hausdorff distance d from each other if any point of one set is within the distance d from some point of the other set. Such a distance is sensitive to single outliers. For instance, think of a case where a point a is a t some large distance d, to all points in the set A . Then, ~ H ( A , B= ) d, is determined by this point. Therefore, generalizations of the Hausdorff distance have been considered, which are more robust against outliers or noise.
Variants of the Hausdorff distance. Let ( X , p ) be a metric space (usually Euclidean) and C ( X ) C X be a space of nonempty, closed arid bounded subsets of X . Let A, B E C ( X ) be sets of T L A and n~ elements, correspondingly. The distance between an element a E A and the set B can be defined as: d ( n , B ) = d ( { n ) , B )= m i n p ( a , b ) . btB
(5.48)
The directed dissimilarities between two sets can be then found as [Dubuisson and Jain, 19941:
(5.49)
where hl,.,, is k-th ranked distance such that k = s n ~ For . instance, for becomes the median of the distance sequence d(x, Y ) and .s = 0.5, for s = 0.75, this is the upper quartile. Since the values d b ( A , B ) arid d D ( B , A ) are usually not identical, the symmetry is imposed by applying one of the following operators: ,fmm ( 2 ,Y) = r w z , Y j , fmax .( Y) = max{z, Y j , f a v r ,.( Y) = ;(z Y) or .fwa7v ( G Y ) = nnfrLn ( nI(: ~ TLBy). Combining them with the distances defined by (5.49), 24 symmetric dissimilarity coefficients can be obtained, which all but one are non-metric. Two of them are of a significant importance, especially for the purpose of object matching in binary images [Dubuisson and Jain, 19941, namely the Hausdorff distance (the only met-
+
+
Dissimilarity measures
241
ric), already introduced in Def. 5.2, and the modified HausdorE distance. The latter, although non-metric, has been found useful [Dubuisson arid Jain, 19941 and more robust against outliers. Also other variants obtained by replacing the max operation in the Hausdorff measure by a k-th rank are often less noise sensitive [Huttenlocher et al., 19931.
Definition 5.3 (Modified Hausdorff) In a (semi-)metric space ( X ,p ) , the modified Hausdorff distance with the base p is defined for all A, B E C ( X ) as:
(5.50)
Measures on fuzzy sets. A Hausdorff-like distance can also be defined for fuzzy sets; see [Chaudhuri and Rosenfeld, 1996, 19991 for details. Consider two non-empty fuzzy sets Af and B f on a support set S in a metric space. Let ;c* = max{Af(t): A f E S } be the maximum membership of .c. Let A,,,, = {t:A f ( t )= x*} be a non-fuzzy set and let A, be a non-empty, non-fiizzy subset of S such that A,,, c A , and such that for two fiizzy sets A, and Bf, A, = B, iff A,, = B,,,. Define the family of non-fuzzy sets A,, ~ L [0, E 11 by if p 5 IC* A,,= { t : A f ( t )E [p,z*]}, if p > z*.
(5.51)
Note that A, = A,,,,, if p = z*for z*# 1. Assume that the fuzzy sets can take only values from a discrete set of the membership values p1, p2,. . . , pC. Let ~ H ( A Bbt) , ~ , be the crisp Hausdorff distance between sets Apsand Elpt. Then, the fuzzy Hausdorff-like distance between Af and B f is defined as: (5.52) which is metric [Chaudhuri and Rosenfeld, 19991. dHf can be seen as a membership-weighted average of the Hausdorff distances between the modified level sets of the fuzzy sets considered. Note that the fuzzy modifiedHausdorff-like distance can be defined by using the d M H instead of d~ in the formula above.
242
5.6
The dissimilarity representation for pattern recognition
Dissimilarity measures in applications
There exists a large arsenal of various proximity measures developed for the purpose of data organization, image and text retrieval, clustering and classification. Before presenting a brief survey of the measures, we want to emphasize the importance of invariance in tJhe process of their design.
5.6.1
Invariance and robustness
Invariarice is an important issue for designing informative dissimilarity ineasurcs. To compare two objects, one wishes to focus on their basic underlying characteristics. This may be difficult to be handled by automatic means as objects often in somewhat different forms and sizes, reflecting all variability. If one starts from sensory measurements of objects, noise is likely to be present, which is also contributing to tlie overall variability. In general. the comparison of objects should not be influenced by their location, somewhat different scale, or rotation. Moreover, the measure should be robust with respect to srriall distortions arid abberations of the objects measurements. This leads to the study of invariant and robust measures, especially for the sensory measurements. Alt,hough we do not thoroughly discuss this problem here, as our focus is on learning methodologies, we want to emphasize its importance. Invariant measures will lead to compact class descriptions, wliicli will assure good generalization abilities of statistical learning functions. For a general review on invariant pattern recognition, see [Wood, 19961. More applied methods can be found in [Rodrigues, 20011. For examples of invariant distance measures, the reader is referred to [Hagedoorn and Veltkamp, 1999a,b;Simard et al., 1993, 19981. Since kernels, see Sec. 2.6.1, are interpreted in our framework as specific similarity representations, the study on invariant support vector machines is related to invariant ‘kernel’ reprcscntations [Scholkopf et al., 1998b; Mika et al., 20031. The forma’lization of irivariarice is presented in Sec. 3.3.3.
5.6.2
Example measures
Exccpt for the nieasurcs already presented in the sections above. examples of dissimilarity measures will be presented in some application areas. The list is by no means complete.
Feature type data. For data represented in feature spaces. a number of distance measures have been designed to account for the distribution of
Dissimilarity measures
243
points in local neighborhoods. Such distances are then used by the k-NN rule or by some variant of locally weighted learning; see [Atkeson et al., 19971 for a survey of methods. We will mention a few. Some techniques for flexible metric construction are proposed in [Friedman, 19941. These methods are based on a recursive partitioning strategy to adaptively shrink and shape rectangular neighborhoods around the test point. Also Hastie and Tibshirani developed an adaptive NN rule that uses local discriminant information to modify the neighborhoods appropriately [Hastie and Tibshirani, 19961. The distance metric is the square Euclidean distance weighted by a product of suitably weighted between- and within-sum-of-squares matrices. They show that this metric approxirriatcs a chi-squared distance between true and estimated posterior probabilities for spherical Gaussian classes. Generalizing both previous approaches, a flexible nietric for computing neighborhoods based directly on the Chisquared distance is estimated in [Dorneniconi et al.: 20021. The property of the neighborhoods is such that they are elongated along less informative features arid compact along most influential ones. Also Avesani and colleagues proposed two metric measures for the NN rule: a local asymmetrically weighted similarity metric and a minimum risk metric based 011 a probability estimation that minimizes the risk of misclassificatiori [Avesani et al., 19991. They found experimentally that the 1-NN rule based on their measures performs well. Lowe introduced a variable kernel classifier based on a similarity metric, by combining the k-NN rule with smooth weighting defined by the Gaussian kernels [Lowe, 19951. The Gaussian kernels are based on a weighted Euclidean distance, where the weights are learned in the cross-validat#ionprocedure. In the classification problems, the nearest neighbor (NN) rule is usually based on the (weighted) Euclidean distance. However. other dissiniilarity measures can be computed for data of mixed types; see Sec. 5.1. All these approaches can be encompassed by a general framework based on similarities computed between the features, as proposed in [Duch et al., 1998; Duch, 2000; Duch et al., 20001. Such a niodel involves the steps of selecting distinctive features, weighting them and scaling appropriately, and computing a distance suitable for the feature type and the problem at hand.
Text. Many of the information retrieval models make use of statistical properties of text [Manning and Schutze, 19991. For a collection of text documents, a vocabulary set is often chosen for the indexing purposes.
244
The dissimilarity representation for pattern recognition
Text documents are then represented as vectors of term weights for every term from the vocabulary set. The term weight is often proportional to the frequency of occurrence within the document, and inversely proportional then number of documents the term occurs in. The similarity measure between the documents is often an appropriately weighted variant of' a cosine similarity which measures the cosine of the a,ngle between the document vectors or an Yp-distance [Strehl et al., 20001. Many weighting schemes can bc used, as well as binary measures focusing on the word occurrences see proceeding of the SIGIR conferences [SIGIR, site]. Examples of statistical word similarity measures can be found e.g. in Terra and Clarke [2003]. Another possibility are also information theoretic measures as described in [Lin, 1998; Bennett et al., 1998; Li et al., 20031. When document collections are described by graphs, various graph dissimilarity related to the maximum corrirnon subgraph [Bunke and Shearer, 1997, 19981 as well as to the graph union or minimurri common supergraph [Schenker et al., 20031, can be used. Shapes. In computer vision, image processing and pattern recognition areas, many shape descript>iontechniques have been developed for both quantitative and qualitative measurements. Such descriptions mostly rely either on segmentation followed by external characteristics of the resulting binary shape defined by spatial arrangements of elements such as edges and junctions, or on internal shape characteristics, as texture or intensitybased features, in the given grey-level image. For a general introduction into shape description methods, see [Costa and Cesar, 2001]. Here, we are interested in the comparison of objects, hence in measures of their similarity. Many such measures exist, both general and applicationspecific, mostly dcveloped for solving pattern matching problems. A typical example of a dissimilarity-oriented pattern matching relies on finding geometric transformations (from a specified class) of one pattern (shape, contour, image) into another one such that a predefined cost is minimized. A survey of shape matching approaches can be found in [Veltkamp and Hagedoorn, 19991, while some similarity measures and algorithms are described in [Veltkamp, 20011. For the purpose of matching of binary images (hence also contours), variants of Hausdorff distances can be used, as described in Sec. 5.5. For some practical considerations, see [Dubuisson and Jain, 1994; Huttenlocher et al., 19931. Since these measures are in fact measures between sets of points, some further extensions can be found in [Eiter and Mannila, 19971.
Dzssimilarzty measures
245
Also mathematical expressions for the distance between 2D point sets with known correspondences were suggested in [Werman and Weinshall, 19951. They are invariant to either affine transformations or similarity transformations of the sets. First, images are normalized and aligned by the use of affine transformations, such as rotation, translations and scaling. Next, the square Euclidean distances between the points in images are computed. Since the images are represented as coordinate matrices, all the transformations and the distance can be expressed in matrix notation. To our judgment, this is similar in formulation to the Procrustes analysis [Cox arid Cox, 1995; Borg and Groenen, 19973. A more general metric distance measure, t,he so-called absolute dzfterm c e was introduced in [Hagedoorn and Veltkamp, 1999al. This measure is invariant under affine transformations and deals well with objects having multiple connected components. It is robust against perturbation and occlusion. Human judgments of similarity are tried to be captured in [Basri and Jacobs, 1997; Basri et al., 1996, 19981. For instance, in [Basri et al., 19981, the dissimilarity between image contours is studied as a cost of matching by summing up the costs of local deformations that reflect the differences between two contours. A cost function is proposed which depends on the local curvature and obeys the constraints of continuity, metric properties and invariarice under some classes of transformations. The cost function should also grow with the increase of bending or stretching, but beriding should be less costly at a point of high curvature. Some other ideas of curve matching can be found in [Gdalyahu and Weinshall, 19991, the definition of elastic distance is considered in IYounes, 1998, 19991 and the use of deformable templat,es for handwritten digits in [Jain and Zongker, 1997). A shape descriptor, the shape context, along with a framework for deformable rriatcbirig is developed in [Belongie and Malik, 2000; Belongie et al., 2002; Mori et al., 20011. The shape context at a particular point location on the shape is defined by the histogram of the relative log-polar coordinates of all other points. Since corresponding points of two different shapes have similar characteristics, the alignment of shapes is simplified. The overall distance is given as the weighted average of three contributions: sum of the best shape matching costs, appearance distance due to the brightness differences arid the bending energy. A novel approach to the alignment between two curves, which leads to the derivation of their dissimilarity is proposed in [Sebastian et ul., 20031.
246
T h e dissimilarity representatzon for pattern recognrtzon
Figurc 5 . 3 Chain code representation. (a) Result of resampling. (b) Chain code based on the &connectivity.
As reported there, this method is robust under a variety of affine transformations, as well as viewpoint variations and small deformation and that it can be applied to object recognition problems. Algorithmically, the alignment is solved by dynamical programming [Bellman, 19571. Another possibility of comparing two binary shapes is by the use of a distance transformation. It is the operation on a binary image which transforms it into a gray-level image, a distance map, where non-object pixels have a value corresponding to the distance to the nearest object pixel. Objects can be shapes, but also curves, edges or points. Matching relies on positioning the template shape at various locations of the distance map. The matching cost, hence the dissimilarity between the object shape and the template, is determined by the pixel values of the distance map which lie under the data pixels of the template. The target is considered as detected when e.g. the average distance value is below a chosen threshold. The most common distance is Euclidean, but due to its computational cost, often Chamfer distance, as its best approximation, is used; see [Borgefors, 19861. An example of shape matching using Chamfer distance transform can be found in [Gavrila, 2000; Gavrila and Philomin, 19991. It covers the detcction of arbitrary-shaped objects, either parameterized or not, like pedestrian contours. The corriparison of the shape context and Chamfer matching methods applied to object detection, where objects are described by contours is donc in [Thayananthan et al., 20031. It is reported there, that in case of chittered scenes, the Chamfer matching based on a number of templates is more robust than the shape context approach. Shapes can also be described in a structural way. A chain code represents a digital boundary as a sequence of direction vectors based on the 4- or 8-connectivity principle [Freeman and Glass, 19611; see also Fig. 5.3. In general, it is not unique, since it depends on the starting point. However. given a starting point, it reconstructs a shape perfectly. Unfortu-
Dissarnilarity measures
24 7
nately, chain codes become very long for complex objects, but more importantly, they reflect all the noise present on the boundary c.g. due to small disturbances. Still, for a comparison of two shapes, their chained codes can be compared. Since their starting points can be arbitrary, tlie matching should be performed between all their cyclic permutations. Let s = ~ 1 ~ 2. s,. . and t = t l t 2 . . . t,, be the chain codes of two contours. Let S and T represent sets of all cyclic permutations of s and t , respectively. Then, the comparison of two chain codes is bascd on tlie weighted Levenshtein distance as follows: d r f , a j n ( t~)>= rnin { d D ( s ,t ) ,d D ( t ,s)}, where dD(s,t ) = mins* d 7 , L ( s * t*) , is a directed distance. In this way. t* dchain is robust against rotation of shapes: however, not against scaling. Alternatively, a contour can be represented as a sequence of points s = ( z m ym) ; in a two-dimensional space, resarnpled if necessary such that tlie distances between any consecutive pair of points are identical. Then, a string z = z1 . . .,,z describing a contour, is derived such that z i is the direction vector pointing from (xi,yi) to (:ci+l,yt+l). The distance between the strings is an edit distance with fixed insertion and dcletion costs and some substitiition cost. Different substitution costs, c.g. bascd on an angle or the Euclidean distance between vectors lead to different distance measures; see [Bunke et al., 2001, 20021. It is claimed in [Buiike et al., 20021 that such an approach has a number of advantages such as higher angular resolution; robustness to shape distortion undcr rotation and invariarice under scaling. Also Fourier descriptors for closed contours are found for which distance measures can be defined, such as distances from the tp-farnily [Zahn arid Roskies, 1972; Persoon and Fu, 19741. A structural description of cornplcte shapes is based on a coarse dcscription of the geometric relations between the parts that compose them. Similarity between tlie shapes can be, therefore, evaluated as a metric edit, distance between shock graphs representing the shapes as advocated by Kimia and his colleagues [Sharvit et al., 1998; Klein and Thompson, 1984; Sebastian et ol., 2001, 20021. This measure is computed as the optimal cost of the deforniation path between two curves and it is robust against small deformations, occlusions and boundary disturbances. Also. a coinparison between retrieval based on shock graphs (structural approach) arid curve matching (metric approach) is presented in [Sebastian and Kiniia: 2001, 20031. Some other approaches based on the representation of shapes by medial axes can also be found in [Liu and Geiger, 1999; Torscllo and Hancock, 2003; Zhu and Yuille, 19961.
s,
248
T h e dissimilarity representation f o r p a t t e r n recognition
Finally, the statistical properties of the object’s shape can also be used for comparison. This means that shape information can be encoded by moment descriptors, which describe center of mass, elongation aspects and overall orientation. Also, other, more specific features with respect to the overall shape can be found, such as: perimeter, area, boundary straightness, curvature in terms of the zero-crossing of the curvature around the shape contour or bending energy. All these quantitative features may be used to construct a dissimilarity measure as e.g. given in Table 5 . 3 . Histograms and spectra. Emission and reflectance spectra become more popular for the identification of certain materials, e.g. types of plastics or minerals and rocks. Also autofluorescence is emerging as a useful tool for the detection of cancer e.g. in oral cavity or in the bronchi. It relies on the spectroscopy of the tissues of interest. The measurements are usually performed on healthy and diseased tissues (in various stages of cancer) at several excitation wavelengths. The emission spectra are then analyzed to support the diagnosis of a doctor. Histograms and spectra can be interpreted in the probabilistic framework, where their normalized versions are considered as probability distributions. This allows one to use divergence measures or general measures between distributions, where some them are mentioned in Sec. 5.2. Since the structure of such data is organized by the underlying factor, such as the order of bins, wavelength or time, it might be beneficial to incorporate such knowledge into the measure. This is somewhat possible e.g. by computing the difference, such as the !,-distance, between the approximated derivatives of the histograms or spectra [Paclik and Duin, 2003b,a; Pekalska et ul., 2004aI. For instance, the distance between the first-order derivatives emphasizes the difference in positions between the local minima and maxima of the histograms. Also the distance between the cumulative histograms ca.n be used a,s well e.g. as we used for the comparison of chromosome band profiles in [Pckalska and Duin, 2002a1. Images. Assume that grey-level images are represented as vectors in a space. A tangent distance, which is locally invariant to any set of chosen transformation (such as rotation and thinning) and relatively cheap to compute is proposed in [Simard e t ul., 1993, 19981. It was found to be especially effective in the domain of handwritten digit recognition (Simard et al., 19931. When an image is transformed (e.g. scaled and rotated) with a transformation that depends on some parameters (like the scaling factor and rotation angle), the set of all transformed patterns create a manifold of
Dissimilarity measures
249
a dimension at most equal to the number of free parameters in the vector space. The distance between two image patterns can be now defined as the minimum distance between their respective manifolds and by this invariant with respect to the considered transformations. Such a distance is hard to compute. The compromise is offered by the tangent distance which is defined as the minimum distance between the tangent subspaces tha.t best approximates the non-linear manifolds. See [Simard et al., 19931 for details. Since two gray-value images can be considered as fuzzy sets (by rescalirig them to the range [0, l]),for their comparison the fuzzy Hausdorff (or modified-Hausdorff) distance can be used. Also binary images can be regarded as fuzzy sets in the following manner: white pixels have zero membership values and a black pixel takes a value of k / ( K 2 - 1) if it has k black neighbors in its K x K neighborhood. In this way, noisy black pixels will have either zero or very small membership value. If the binary images are converted to the fuzzy sets as described, [Chaudhuri and Rosenfeld, 19991 reports that the noise has much less effect on the fuzzy Hausdorff distance than on the original Hausdorff distance between binary images. Consequently, the fuzzy Hausdorff distance is relatively robust to noise. On the other hand, grey-value images can be interpreted from thc probabilistic point of view, e.g. as bivariate histograms. This allows one to use various divergence measures or general measures between distributions as presented in Sec. 5.2. Since the intensity values of the images might differ, some normalization might be crucial. The description of images can also be simplified to univariate histograms, for instance intensity histograms. Then, the distance between two images A and B can be computed e.g. based on the intersection between min(h,(A),h, ( B ) )
two intensity histograms with b bins is d l ( A ,B ) = 1, # pixels where hi(A) describes the number of pixels whose intensity equals to the value assigned in i-th bin. Note that the intersection is the estimation of the Bayes error, i.e. the overlap between two probability density functions P ( A ) and P ( B ) approximated by histograms. An extension of such a measure is proposed in [Cha and Srihari, 20001, which takes into account the similarity of both overlapping and non-overlapping parts. There exist a number of dissimilarity measures to support the contentbased image retrieval. For a brief summary, see e.g. [Vasconcelosand Kunt, 20001. In the probabilistic framework, usually measures defined between distributions, such as the Kullback-Leibler divergence, Bhattacharyya distance, or Mahalanobis distance, are used. A brief analysis of their inter% ' =,
250
The dissimilarity representation
for
pattern recognition
relations is reported in [Vasconcelos and Lippman, 20001. Also, the earth mover‘s distance [Rubner et al., 1998b] is designed to evaluate a dissimilarity between two distributions based on the so-called ground distance nieasiirc between single features. Loosely speaking, one distribution can be interpreted as a mass of earth spread in space, while the other distribution as a collection of holes in the same space. Then, the earth mover’s distance defines tlie least amount of work needed t o fill the holes with earth. Computing this distance is based on a solution to the transportation problem [R.ubner et al., 1998b]. This measure is successfully applied for an evaluation of texture arid color similarities in images [Rubner et al., 1998b,a; Rnbrier, 19991. It has, however, a rigorous probabilistic interpretation, as shown in [Levina and Bickel, 20011. In tlie probabilistic framework, also Puzicha and colleagues empirically investigated some dissimilarity measures for the purpose of texture segmentation and image retrieval [Puzicha et al., 19971 a,nd for color and texture [Puzicha et a,l., 1999aI. In both papers, images are compared by dist,ribiition-based dissimilarity measures, of Gabor coefficients in the filtcred images in the first paper, and between histograms in the latter. An approach to incorporate human similarity assessnient in the dissiniilarit,y measure is based on the extensions of the Tversky’s model by fuzzy logic as presented in [Santini a,nd Jain, 1997, 1996, 19991.
5.7
Discussion and conclusions
This brief overview of similarity and dissimilarity measiires indicates not only their variability, but also their different origins and underlying principles. The use of dissimilarity (proximity) is especially popular in computer vision and pattern matching applications, information retrieval and the evaluation of human judgments. In the pattern recognition area, it is widely accepted to use the k-nearest neighbor rule (usually considered for a given feature representation), at least as a reference method when solving a classification task. Still, more and more attention is devoted to the assessment of dissimilarity as a natural means for comparison of objects. For instance; in corriputcr vision, Edelman recognized the importance of proximity by stating that (representation is representation of similarities’ [Edelman et al., 19981. He advocated the use of dissimilarities in [Edelman et al., 1996, 1998; Edelman and Duvdevani-Bar, 19971 and in his book [Edelman, 19991.
Disszmzlarity measures
251
The universality of a dissimilarity lies in the fact that it can be approached from both a statistical and structural point of view. Conventionally, one tries to develop either a measure based on statistical, and hence quantitative or metric properties of object representations (examples are measures in feature spaces, between sets of points, and probabilistic rneasures) or based on structural, hence qualitative, properties (examples are measures based on chain codes, graphs, and trees). Various attempts have been made to combine these two reseasch lines, as addressed already in [Fu, 19821. They are, however, often hybrid in the sense that subproblems of a larger problem are tackled separately by either one or the other approach and the complete system is optimized part by part. The significance of finding new measures unifying these two approaches is emphasized in [Watanabe, 1974; Duin et al.. 20021. A number of researchers attempted to define universal or general dissimilarity measures. as proposed in the following frameworks: [Duch et al., 1998, 2000; Duch: 2000; Grifitlis and Bridge, 1997; Lin, 19981 or [Bennett et al., 1998; Li et al., 20031. Most of the dissimilarities are defined for the problem a t hand. Still, there is a number of them which allows for the existence of some free parameters, like weights of particular contributions, to be learned (adopted) in the (usually off-line) training process. Assume that one deals with objects that possess such a structure, such as spectra, time signals, irna.ges, or text documents. A completely novel way of thinking, trying to unify the statistical arid structural lines, has been promoted by Goldfarb. A dissimilarity measure is determined in a process of inductive learning realized by the so-called evolving transformation systems [Goldfarb, 1990; Goldfarb and Deshpande, 1997; Goldfarb and Golubitsky, 20011. Such a system is composed of a set of primitive structures, basic operations that t,rarisforni one object into another or which generate a particular object and some composition rules which permit the construction of new operations from existing ones [Goldfarb and Golubitsky, 2001; Goldfarb et al., 1995, 1992, 2000al (which is a structural contribution). The statistical component is defined by means of a dissimilarity. Since there are costs related to the operations, the dissimilarity is determined by the rninirnal total cost of transforming one object into another. In this sense, the operations play the role of features and the dissimilarity, dynamically learned in the training process, combines the objects into a class. A simpler approach is to first define a small set of fundamental structural detectors, yet general enough to be applicable to many problems; independent of specific expert knowledge of the application. This means that such
252
T h e dissimilarzty representation for pattern recognztzon
dctcctors work for the given measurement dornain, e.g. spectra or images. The useful subpatterns should then be identified by the detectors when applied to the consecutive measurement values. The inter-relationships between the subpatterns should be captured in some relatiorial intermediate rcprescntation (e.g. by a graph or a string). These are the basis for the matching process and the derivation of the final dissimilarity. The learning relies then on the learning of proper weights (contributions) assigned to the identified subpatterns such that the specified dissimilarity is optimal for the discrimination between the classes. The most simple example is the edit distance between string descriptions of objects; however, more general approaches need to be devcloped. Note that also statistical feature extractors (such a5 wavelets or Gabor filters) may be considered, which work on the corisecutive measurements, as the building blocks of the dissimilarity learned. How to learn such measures is open for futurc research.
PART 2
Practice
In t h e o r y , there i s no difference between t h e o r y and practice, but in practice,there is a great deal of difterence. ANONYMOIJS
This page intentionally left blank
Chapter 6
Visualization
... when you are describing A shape, or sound, or tint, Don't state the matter plainly, But put it in a hint; And learn to look at all things With a sort of mental squint. "POETA FIT, NON NASCITUR", LEWISCARROLL
This chapter begins the experimental part. in which dissimilarity data are practically studied. As we would like to present a systematic analysis. we start from the most basic questions. In order to gain insight into data, usually various tools t o represent them and their relations in some visual forms are used, subjected to human judgment. We will investigate a number of well-known visualization techniques and their usefulness for dissimilarity data. The most simple representation of dissimilarity relations is achieved by plotting a dissimilarity matrix as an intensity image, where the increase in pixel intensity corresponds to the increase in dissimilarity values (going from black to white). If data items are grouped, then potential clusters are emphasized by dark rectangular areas. An example is given below:
Figure 6.1 Intensity images of a symmetric square dissimilarity representation. On the left, the order of objects is random, while on the right, the matrix is permuted such that the objects are grouped. This allows observation of cluster tendencies. The black diagonal line corresponds to the zero dissimilarities.
255
256
The dissimilarity representation f o r pattern recognation
Dissimilarity relations can also be represented in a low-dimensional space, usually a two- or three-dimensional one. This can be achieved by continuous spatial models, which rely on linear and nonlinear projections of pairwise dissimilarities such that the configuration determined in an output space preserves all (or some) dissimilarities under a specified criterion. Usually, a Euclidean space is used, but other C,-normed spaces can also be considered. The basic theory of spatial representations realized by the means of multidimensional scaling (MDS) techniques and more general models referring to pseudo-Euclidean spaces was discussed in Sec. 3.6. For completeness of the overall presentation, the basics of MDS are briefly recapitulated in Sec. 6.1. The focus is, however, on illustrative examples. Other types of spatial representations are obtained by nonlinear mappings concentrating e.g. on the preservation of dissimilarities in local neighborhoods or the approximation of geodesic distances in a manifold. These and other alternative projection methods are briefly summarized in Sec. 6.2. Dissimilarity relations can also be represented by weighted, fully connected graphs, where the vertices refer to individual objects, and weights coincide with the given dissimilarity values. This can be structured further by tree distance models, usually understood in terms of the shortest paths between the vertices. Hcre, particularly important are the additive and iiltrarnetric distance trees, which are discrete spatiaI models. They are widely used in data analysis, since they support hierarchical clustering schemes arid by this, they enhance the process of structuring of the data. The tree models are presented in Scc. 6.4. An overall summary is given in Sec. 6.5. Although this chapter partly relies on [Pqkalska et al., 1998a,b,c; Pqkalska and Duin, 20025 Pqkalska, 20021, this study is mostly new. The following issues are discussed here: the nonlinearity of variants of Sammon mappings, the formulation of the MDS techniques for missing data, and explanations of generalization possibilities (adding new objects to the existing maps) for a number of projection algorithms, including the Sammon mappirig. Additionally, the use of LLE and Isomap for non-Euclidean distances is considered by correcting the local Gram matrices by adding a suitable coristarit to the diagonal. In short, this chapter visualizes dissimilarity relatioris and tries to provide some intuition as to how dissimilarity data can practically be explored.
VZsualizatZon
257
W
1
i
'dU
(a) Sammon map
.
'J
'8
'9
'R ' 5
'H 6
.i * I ,
So
3 c 2.
.p'%
(b) LSS map with p2
Figure 6.2 MDS maps of auditory confusion measurements for letters and numerals. For the LSS map, & j = p 2 ( h q 3 ) ,where p z is a second order polynomial.
6.1
Multidimensional scaling
Multidimensional scaling (MDS) is a collection of techniques providing spatial representations of the objects by representing them as points in a lowdimensional space [Kruskal and Wish, 1978; Cox arid Cox, 1995; Borg and Groenen, 19971. This is achieved by (non)linear projections which aim to preserve all pairwise, symmetric dissimilarities between data objects. A spatial configuration is usually found in a Euclidean space, although any other !p-normed space ( p 2 1) can also he considered. Such a map is believed to reflect significant characteristics, as well as 'hidden structures' of the data. Therefore, objects perceived as similar to one another result in points being close to each other in a projected space. The larger the tlissirnlarity between two objects, the further apart they should be in the resulting map. In general, the dissimilarities describe the relations between objects. originally represented in a high-dimensional space, measured as costs of pattern matching in a template ma.tching procedure, similarity between text documents or road distances, or just given, like himian judgments. Consequently, MDS is then treated as a dimension reduction technique. MDS methods used here rely on quantitative dissimilarities, assuming that both input data and output configuration are metric. These techniques are realized by linear methods of classical scaling and FastMap, arid nonlinear methods, the variants of LSS and Ssmmon mappings. They were already discussed in See. 3.6. In brief, the goal of a metric MDS is to find a faithful representation X in a low-dimensional space such that
258
-
The dissimilarity representation for pattern recognition '-
I
t
Leewarden
Gropingen
Den veldel
Zwplle
I
Arnste!darn
Leeuwarden
Den Haag Ulrecht Rone,rdam
Ulrgcht Brepa
1
Arnhem
Amsterdam
Rpneeidam
DenHdder
DenHaag
Maastyht
(a) R.etrieved MDS map Figure 6.3
(b) Rotated MDS map
A reconstructed map of The Netherlands.
the approximated distances d i j , i , j = 1 , 2 , . . . n between n points match the disparities & j as well as possible. Disparities are functional dependencies (e.g. continuous monotonic functions) of the original dissimilarities, i.e. & j = f ( S , , ) . Depending on the way the structure of dissimilarity data is preserved, somewhat different techniques arise; see Sec. 3.6 for details. Basically, the MDS methods, called least squares scaling (LSS) mappings, minimize a normalized vcrsion of the raw stress Czij(bi,- di,)2 as:
A similar technique is the Sammon mapping, originally proposed in the pattern recognition area as a method of nonlinear projection to a lowdimensional spacc by optiniizing the normalized square differences between the original and approximated distances [Sammon Jr., 19691. Assuming tkiat S,, = b Z 7 the , variants of the Samnion stress functions are: St
c:<,GT2 c(6;; 1
=
(Sij - d i j ( X ) ) 2 ) :
t
=
..., -2, - l , O , 1 , 2 , ..
Due to the obvious similarity to the LSS techniques, we account thcm as examples of MDS, although this is not practiced in the MDS literature. Since the optiniization of the SamInon stress functions is easier in the variants of gradient descent procedures, Sanimon mappings are preferred.
Visualization
259
Figure 6.4 MDS distances vs. the original road distances bctween Dutch towns
6.1.1
First examples
Map reconstruction. A standard MDS example is the recoristruc,tion of a map of a country, given either the road or air distances between rriairi cities [Manly, 1994; Borg and Groenen, 19971. One important aspect about an MDS map is that the axes are, in themselves, meaningless. In the case of a Euclidean space, additionally, the orientation of the projection is arbitrary, since any rotation of the resulting configuration does riot c:liange the distances. This means that the &IDS map of the city locations need not be oriented such that north is up m d ea.st is right. What is important is the relative positions of the cities, the retrieved configuration may be mirrored or rotated, if needed. The road distances between 1 2 major towns in The Netherlands are considered as an example. The MDS result is presented in Fig. 6.3. A comparison between tlie original map of tlie country (see atlas) and the result given by the MDS technique makes clear that the MDS method is successful in recovering the location of the towns, however, up to rotation. In gcneral, the cities are shown in good relation to each other, maybe with the exception of Den Haag and Rotterdam in relation to Amsterdam. This is also confirmed by Fig. 6.4, showing a plot of the AIDS estimated distances against the original road distances. The latter only slightly deviate from the ideal solution (the line y = x). Such plots may help in data analysis.
Cluster identification. MDS helps in data exploration, e.g. to identify possible clusters as groups of points which are close together iii the represented space. As an example, let us consider auditory confusion (dissimilarities) between 25 letters (all excluding ‘ 0 ’ )and 10 Ara,bic mirnerals, computed by Lee [Lee]. The spatial Sarnmori map and the LSS map, ob-
260
The dissimilarity representation f o r p a t t e r n recognition
-golf *tennis -polo
.track -canoeing -surfing , swimming *skiing
\
.base all solball .rugby -lacrosse -basketball *volleyball
aaression involved
Figure 6.5
MDS map of human dissimilarity judgments on sport.
tained by us, are shown in Fig. 6.2. Although they give somewhat different results (see Sec. 3.6.2),the basic characteristics are the same. Clusters of’ siniilarly sounded letters or numerals can be clearly observed. For instance, we can justify the similarity of T,‘5’,i1’and iY’ since, when spoken, there is an obvious resemblance between their sounds. Interpretation of underlying principles. Another purpose of the MDS is to find rules that would explain the observed dissimilarities and would help to describe the data structure in simple terms. This may be especially useful for data describing human judgments of similarity between objects. In this case, interpreting an MDS configuration requires to make a link between geometrical properties of such a map and prior knowledge about, the objects represented as points [Borg and Groenen, 19971. By identifying points which are far apart, a line between them can be drawn, defining a perccptual axis, which describes a direction of a change between opposite or significantly different characteristics. This involves a data-guided speculation. An example is based on human judgments of dissimilarities bctween some sports [Lee]. Our MDS representation is given in Fig. 6.5. To interpret why humans consider some sports to be more alike than others, we distinguish one perceptual axis, the degree of aggression involved. The axis was added by us as a possible (not unique) interpretation, as a help to understand the relations better. Another possibility could be to emphasize the differences between sports depending whether a ball is used or not
Visualization
261
such perceptual axes do not need to he perpendicular. A more scientific approach would be to find such an axis as a regression line in the projected space and then, given additional knowledge, attach a meaning to it. 6.1.2
Linear and nonlinear methods: ezarnples
To understand the properties of MDS, one needs to study various linear and nonlinear techniques for dissimilarity data with particular structure. The examples given below are meant to illustrate the difference between the linear and nonlinear methods.
Artificial data. Two data sets describe 200 points lying on two circles; both of the radius of 1.0, in a 3D space. The circles are placed either on the planes parallel to the yz-plane with the distance of 1; or on two perpendicular planes. The data and the (non)linear metric MDS projections onto a two-dimensional space are presented in Fig. 6.6. The projections are based on the 200 x 200 distance matrices, either the Euclidean or city block (el-) distance. If the distance is Euclidean, then the mapped result is identical to the principal component projection (PCA) in the original three-dimensional space [Cox and Cox, 19951; see also See. 3.5.1. If thc !Idistance measure is used, the output Euclidean distances approximate the original el-distances, computed in a three-dimensional space. Therefore, the 3D spatial representations are only shown for the tl-distances', since for the Euclidean distances, the retrieved configurations are rotations of the original data points. Note that since the !I-distances have larger values than the Euclidean ones, the projected circles become also larger. In classical scaling, two corresponding points from two parallel circles are mapped onto a single point in 2 0 . It seems, therefore, that the data describe one circle. In the case of perpendicular circles, one of them is reduced to a line. Therefore, some important information is lost in classical scaling, namely the existence of the second oval. Note, howevcr, that FastMap, Sec. 3.6.1, reveals two closed curves. Although for this example, it may seem that FastMap is superior to classical scaling in discovering the structure; it is not true for more complex, multi-class real data; see for iristarice Fig. 6.10. The nonlinear Sarnmon mapping outputs clearly illus'If one computes the Euclidean distances between a set of points in a threedimensional space, an MDS projection of such distances will recover the relative locations of the original data points in 3D, such that the distances are preserved. However, if the tl-distances are used, an MDS projection will find a configuration of points in 3D whose Euclidean distances approximate the original ones.
262
T h e dissimilarity representation f o r pattern recognition
Euclidean distance Classical scaling FastMap Sammon
Data in 3D
S-1
2 Raw stress 0 107
Raw stress 0 086
City block distance Classical scaling FastMap Sammon
S-1 -
2D
u Raw stre88.0 379
Raw stress 0534
Raw slress 0 365
3D
Raw stress: 0 162
Data in 3D
Raw stress 0 166
Raw stress 0 090
Euclidean distance Classical scaling FastMap Sammon S-1 -
Raw stress 0 145
r-
--
Raw stress 0 149
Raw stress 0 0496
City block distance Classical scaling FastMap Sammon SP1
3D
Raw 51ress 0 068
Raw Stress 0 089
Raw stress 0 044
Figure 6.6 Two circles in 3D (left) and their two-or three-dimensional MDS maps based on either Euclidean or city block distance representations. The scales within 2 0 and 3 0 maps are identical.
Visualization
263
Figure 6.7 MDS maps of the three straight non-crossing lines in 5D represented by Euclidean distances. The LSS map is provided as a reference. The scale is preserved in all the subplots.
tratc two ovals with similar shapes. In general, nonlinear mappings tend to reveal more ‘hidden’ structure in the data. To illustrate the differences between the stress measures and the nonlinearity aspects of the projections involved, an artificial example of points lying on three non-crossing and non-parallel lines in a five-dimensional space is considered with the Euclidean distance representation. Fig. 6.7 shows 2 0 linear MDS maps and 2 0 nonlinear MDS maps obtained by optimization of various St stresses. On the basis of either the classical scaling or FastMap results, one can draw a false conclusion that the data set represents three straight crossing lines in a higher-dimensional space. On the contrary, the Sammon maps suggest that the data consist of three non-crossing curves, but of course; not necessarily straight lines. Therefore, linear and nonlinear mappings are useful while studying them together, as they coniplemcnt each other. Sarnmon maps are ordered with respect to the nonlinearity involved in projections. By minimizing the S-2 stress, one focuses on preserving very small distances, by which local perturbations may appear as observed in Fig. 6.7, top row, third plot from the left. By optimizing the S2 stress, on the contrary, one tries to preserve large distances, and as a result, the curves start to become ‘more straight‘. The stress So keeps a balance hetween preserving small and large distances. The choice of a stress function depends on the required geometric properties that an MDS map should
264
The dassimdarity representation for p a t t e r n recognition
Classical scaling
Sammon map S-2
-+
~ a w s t r ~ ~ ~ 3Rawe stress t ~ 6 02et03 __ ~
Sammon map
So
Raw stress ______ 5 46e+03 __
' p
Sammon map
S2
Raw stress 5 95et03 ~
L p
Figure 6.8 MDS maps of the el-distance representation of the Pump data. Three operating states are distinguished: normal, marked in circles, imbalance, marked in squares and bearing failure, marked in crosses. The result of the FastMap is not presented. since it looks very similar to the classical scaling output. The scale is preserved in all the plots.
hme. When no preferences are given, from our experience, we will recommend the stress So. Pump vibration data. The Pump data set consists of 500 observations with 256 spect,ral features of the acceleration spectrum as described in Appendix E.2. The data have a low intrinsic dimension [Ypma et al., 19971. The MDS projections based on the city block distances are shown in Fig. 6.8. Both classical scaling and FastMap reveal three non-overlapping clusters, while the Sarnmon mappings with the stresses So and S, reveal much more structure in the data. From the Sammon results, new information can be obtained: the class of bearing failure (marked by 'f') is composed of two or even three subclasses, corresponding, in fact, to the three operating speeds used.
Figure 6.9
So stress versus the dimension for the Pump data.
MDS maps can provide additional insight into the data, especially when data are highly nonlinear. In practice, this means that many dimensions are necessary to explain a high percentage, like 80% or 90%, of the total
Visualization
Classical scalinr
265
Sammon map SO
FastMap 4
I Sammon map S2
6
LSS map with
p2
LSS map with p3 ~-
5
7
Figure 6.10 MDS outputs for the Zongker dissimilarity data, describing dissimilarities between digit images; see Appendix E.2. The disparities in the LSS maps are modeled by p 2 and p 3 , i.e. polynomials of the second and third orders, respectively. The scales in the plots are not comparable.
variance in the data, when the classical scaling is used. For the [I-distance representation of pump vibration, two dimensions explain about 36% of' the total variance and 107 dimensions would be needed to reach 80%. Basically, in order to judge the intrinsic dimension, a series of thr MDS mappings should be performed to spaces with a growing dimension. Then, the plot of the stress as a function of dimension can be obtained, as in Fig. 6.9. From such a figure, one can determine the intrinsic dimension by a point where the rapid decrease in the stress function stops. For the Pump data, it is around six or seven dimensions. Another indicator of the intrinsic dimension can be provided by the number of significant eigenvalues in classical scaling, however, nonlinear MDS techniques usually need much less dimensions. Zongker digit data. This data set describes the NIST digits, originally provided as 128x128 binary images [Wilson and Garris, 19921. Here, thc similarity measure, based on deformable template matching, as defined in [Jain and Zongker, 19971, is used; see also Appendix E.2. For visualization purposes, a random subset of 25 digits per class is chosm. Fig. 6.10 presents
The dissimilarity representation f o r p a t t e r n recognition
266
No within-circle distances
Missing 64% distances 1
No within-circle distances
r-
Missing 64% distances ~
--
~~~
Figure 6.11 MDS outputs of the Sammon mapping S-1 based on the el-distance representations between two circles, either parallel (top row) or enclosing each other (bottom row), with missing values. The first two plots, starting from the left, present the results where only distances between the circles where supplied. The differences are due to different initializations. The rightmost plots show the results, when around 64% of distances where randomly removed from the data.
Figure 6.12 MDS output (left) of the Sammon mapping So on the el-distance representation with 50% of missing values for the P u m p data and the corresponding dissimilarity matrix presented as an image (right), where white pixels denote the missing information.
2D MDS maps. It can be observed that according to the classical scaling rcsult. the classes of ‘0’. ‘1’and ‘6’ digits are the most distinguishable. The first two classes are also mostly separated in the other MDS results. On the other hand, in nonlinear MDS outputs, the class of ‘ 2 ’ is the most scattered, in overall.
Visualization
267
Missing values. Any nonlinear MDS can handle missing values. This can be implemented by incorporating extra weights w,]of zeros and ones, as given in formula ( 3 . 3 3 ) , such that zeros account for thc missing information. Provided that the data items are labeled, it is w e n possible to consider a case, where only the dissimilarities between classes are available; see Fig. 6.11. In general, even a large amount of the data can be missing, as illustrated in Figs. 6.11 and 6.12. 6.1.3
Implementation
Initialization. Starting configurations are important as they infliicnce the resulting projections. Each initialization gives potentially a possibility to end up in a different configuration. It follows from our experience that initializing a Sammon projection by classical scaling (CS) often gives good results [Pekalska et al., 1998a,b,c]. Another advantage is that the minimization process is also relatively short. Therefore, such an initialization is applied in most cases. It is, however, always useful. to analyze the MDS result based on a pseudo-random initialization. The optimization procedurc initializcd by the classical scaling may, in some cases, got stuck easily in a local minimum. To avoid that, one may also add noise to the output of classical scaling and only then use for the initialization. Malorie et al. argue in [Malone and Trosset, 20001 that a better initialization can he considered by finding a proper term 0.which scales the classical scaling output: see also Sec. 3.6.2.
Algorithms. There exists a number of different MDS implementations ready for use; see [Borg and Groenen, 1997; Cox and Cox, 19951 for an overview. In our experiments with the variants of' Sammon mappings [Pqkalska rt al.. 1998a,b,c],we found out that both the Newton-Raphson ininirriizatiori technique [Press et al., 19921 with a line search algorithm and scaled corijugate gradient method [Mmler, 19931 providc good results. Besides the gradient information, the Newton-Raphson method uses also the second order information, approximating the full Hessian by its diagonal matrix. The scaled conjugate gradient approach is a combination of a nonlinear conjugate gradients technique [Press et al., 19921 with a trust-region variant. In the beginning, it attains a very large decrease of the stress fimction, slowing down considerably after the first few iterations. Thereforc, it might be beneficial to start the optimization process from this technique and switch to the Newton-Raphson algorithm after some time for a better
268
T h e dissimilarity representation f o r p a t t e r n recognition
combination of the efficiency and performance. Below, for completeness, the first and second partial derivatives of the Sammon stresses are given. Assume that c = CZc3 ';:6 and t = . . . , -2, - 1 , O , 1 , 2 . . . . Then
To stop the iteration process, the following criteria can be used: Criterion 1:
~ t ' S,Z+'
Criterion 2:
lIXi+'
-
+~
~ X ~ + ~ ~ ~ ) ,
where E ~ stands ~ ~ for~ a chosen : precision and ( 1 11 is the Euclidean or max-norm. The superscript indicates the iteration number. All our results presented here are based on Criterion 1 with E~~~~ equal to l o p 6 or lop7. '
6.2
Other mappings
In real applications, large high-dimensional data can be modeled as points lying close to a nonlinear low-dimensional manifold or a linear subspace. Examples include image vectors of the same digits, scaled, thickened and tilted or image vectors of the same objects under different camera positions arid lighting conditions. Another example is given by document vectors in the complete database related to a specific topic. Usually, such feature representations live in very high-dimensional spaces (described e.g. by the number of image pixels or the number of terms/phrases in the vocabulary of the text database). The intrinsic dimension, however. is often limited, e g . due to physical constraints or the degrees of freedom of measuring tools. This observation has recently led to a growing interest in developing algorithms for finding nonlinear low-dimensional manifolds (or subspaces) from data represented in high-dimensional spaces. This can serve the purpose of data visualization as well as identification of the underlying variables, such as the degree of tilting, angle of elevation or direction of light, given the high-dimensional data. Two main directions can be identified: one based on the preservation of the geodesic distances between the data points (or objects in general)
Visualization
269
with respect to the assumed underlying manifold and the other direction describing the global structure in terms of (overlapping) local structures. The latter research line follows the already established methodology of selforganizing maps (SOMs) [Kohonen, 20001, generative topographic mappings [Bishop et al., 1996, 19981, principal curves [Hastie and Stuetzle, 19891 or topology-preserving networks [Martinez and Schulten, 19941, however, with emphasis on simple and reliable implementation. Two recent examples of' both research lines will be discussed.
Locally linear embedding (LLE). A conceptually simple, but powcrful visualization method was developed in [Roweis arid Saul, 2000; Roweis et al., 2002; Teh and Roweis, 2003; Saul, 20031. Given a set of points in R", this technique constructs a manifold such that local geometric structures are preserved, when collectively analyzed. Although its main goal is unsupervised learning, it can also be applied to classification, as suggested in [de Ridder et al., 2003bl. Assume a collection of N points X = {XI , x2 , . . . , XN} E Iw". The following three major steps give the summary of the LLE algorithm: (1) For each data point x, among the points of X , find its k nearest neighbors x t J ,J = 1 , .. . k . (2) Determine the weights that best reconstruct each point from its k neighbors by constrained linear fits. (3) Firid the vectors in a low-dimensional space R" which are best recom structed by the derived weights in terms of a constrained least-square problem.
.
In step ( 2 ) , the differences between the points x, and their linear reconstructions from k nearest neighbors wj')xt3,such that C:=,w:') = 1, should be minimized. This leads to the following cost function:
which can alternatively be expressed as:
where Q(') is a n x n local Gram matrix, determined for the i-th point, with the elements Q!? = (xt- x,~)~(x, - x t z ) . Weights are determined
270
The dissimilarity representation for p a t t e r n recognition
by solving the least-square formulations of Eq. (6.3) with the constraints C:=,w:')= 1. This is done for each &$ separately, giving
where I?(') = ( Q ( ' ) ) - ' . As in practice the matrix Q(') may become singular (c.g. if Ic > m ) , its regularized version R?Jg = (Q(z) X I ) - l may be used instead. In step (3). the weights are fixed, arid an rn-dimensional configuration { y l , y2,. . . , yN} is sought, which minimize a similar reconstruction error:
+
As the weights can be stored in a sparse N x N matrix W and denoting M = ( I W)T(I - W), the above formulation simplifies to: ~
N
E~
N = tr
=
(YMY~).
? = I 7=1
To find the miniinum of t r ( Y T M Y ) .an additional constraint is introduced to prevent the zero trivial solutions. It requires that the covariance matrix of Y is the identity matrix. By using Lagrange multipliers a , and setting the derivative of the Lagrangian to zero, one gets the following eigen-equation ( M diag(a,))Y = 0. As the eigenvector of M corresponding to the smallest eigenvalue is the mean vector of Y (which is the result of the added constraint), it is disregarded. The eigenvectors of M Corresponding to the next m smallest cigenvalues give the sought low-dimensional representation. In brief, the solution to LLE resolves into the problem of finding eigenvectors of' a large, but sparse, matrix M , which encodes information on local neighborhoods. Thanks to this sparsity, the implementation can be madc efficient. The difficulty, however, arises sirice the weights in step (2) rely on the inverses of the local Gram matrices, which should be regularized t o avoid singularities. In our experience the value ofthe regularization strongly influenced the results. This is also confirmed by [de Ridder et al.. 2003bI. Since the computation of weights is based on local Gram matrices, there exists a straightforward implementation of the LLE method based on Eu~
Visualization
271
clidean distances [Saul, 20031; see also Sec. 3.4.2 on the linear relation between the Gram matrix and square Euclidean distances. Hence, the illner product matrix Q(') for the point x, can be derived from the square distances only. Hence, its elements are:
Based on the same principle, non-Euclidean distances can also be used, giving rise to indefinite local Gram matrices. For small neighborhoods. non-Euclidean distances will approximate the Euclidean ones well. For large neighborhoods ( k is large), however, the deviation from the Euclidean behavior might be significant. Still, the derived information can be used as an approximation. Here, our proposal is to use the local corrections of the indefinite Gram matrices to make them positive definite by adding proper constants, as discussed in Secs. 3.5.2. We will denote this modification as the corrected- L L E method. Isomap. The Isomap technique [Tenenbaum et al., 20001 shares sonie virtues of LLE. Its philosophy is, however, different, since it is based on the notion of geodesic distances (the shortest distance between two points on a manifold) . Geodesic distances can be approximated by surnming up a sequence of' distances between neighboring points. These approximations are computed efficiently by finding the shortest paths in a graph with edges connecting neighboring data points. The Isomap algorithm takes as input all pairwisc distances di, between N points, which are assumed to lie in a high-dimensional space R'L. The distance measure d is either Euclidean or problem-specific. The output is a collection of points y i in a low-dimensional space Rv, such that they best represent the intrinsic (path-connected) geometry of the original distance data. The Isomap method can be summarized in three steps:
Construct a neighborhood graph 9 by connecting the points i and j if di, is smaller than the chosen E (E-Isomap) or if j belongs to the group of the k nearest neighbors of i. Initialize df7 = di,? is i and ,j are connected by an edge. Set all other values to 00. As a result, is a graph, weighted by distances between neighboring points. Estimate geodesic distances between all pairs of points by cornpnting their shortest-path distances. This is done by replacing d; by min{dE, dyk+dz,} for each k = 1 , 2 , . . . N in turn. The final distance values d;. yield the geodesic distance matrix DG.
272
The dissimilarity representation for pattern recognition
( 3 ) Apply classical scaling, Sec. 3.5.1, to the shortest-path distance matrix DG to derive the m-dimensional configuration.
Isomap is a nonlinear extension of the classical MDS, in which embedding is optimized to preserve geodesic distances. Isomap is asymptotically guaranteed to recover the true dimension and geometric structure of a class of nonlinear manifolds, whose intrinsic geometry is of a convex region of Euclidean space (provided that Euclidean distances are used as inputs) but however, t,he manifold might be highly folded, twisted or curved in a highdimensional space; see [Tenenbaum et al., 20001 for proofs. A continuous characterization of Isomap is discussed in [Zha and Zhang, 20031. Both Isomap and LLE construct low-dimensional manifolds in a nonlinear way. Their applicability to general data represented by dissimilarities will be, however, limited due to their underlying assumption of a densely sampled manifold. Moreover, the choice of a proper neighborhood size, i.e. the nuniber of neighbors, or equivalently, the &-neighborhood,might be problematic. We observed that especially LLE is sensitive to this. A possible failure of LLE is then to map far away points to the nearby outputs in the projected space. On the other hand, Isomap is dominated by the preservation of far away (geodesic) distances at the expense of distortions in local geometry. This holds since the classical MDS minimizes the raw stress. Consequently, their usefulness is justified for densely sampled data. From that point of view, traditional MDS techniques might be preferable to get insight into the structure of the, possibly undersampled, data. Still, we think that the preservation of geodesic distances can reveal additional aspects of the data. Then, we would propose to perform a nonlinear MDS on the approximated geodesic distances instead of' classical scaling as done in Isornap. The reason is to put more emphasis on local geometry. We will denote this modification as Sam,mon-Isomap.
Kernel PCA. Another technique trying to discover an underlying spatial structure of data is the kernel PCA method [Scholkopf et al., 199713, 1998a]. It finds projections onto principal directions in the space defined by the kernel map. It starts from an N x N positive semidefinite kernel K , a reproducing kernel, interpreted as a generalized inner product matrix. Such a kernel defines a reproducing kernel Hilbert space I C K , where the data vectors are mapped as K ( x i ,.); see Sec. 2.6.1. We will assume that the kernel K is centralized, i.e. the mean vector in ICK lies at the origin. Given any kernel K, this can be achieved by K = JKJ, where J = I - ;1lT is a centering matrix, as in Sec. 3.5.1. The covariance matrix in ICK becomes
Visualzzation
273
then C = $ El=,(K(xj,.), K ( x j ,. ) ) K ~ The . principal components cannot be directly determined if only the kernel K is given. However, they lie in the span of the da,t'a vectors in ICK defined by K . Hence, there exist coN efficients ~ 1 ~ , 2 , , N N , such that u = a z K ( x , , .) is an eigenvector of C. By straightforward operations, the eigen-equation X'u = C,u can be rewritten into N X K a = K 2 a . Provided that K is non-degenerate, this simplifies to N X a = K a . The vectors a are found as eigenvectors of the kernel K . They describe the coefficients of the linear combinations of data vectors K(xi, .) in K K . They are normalized such that Xk(ak)'ak = 1, where ak is the k-th eigenvector. The k-th principal component, of a test, point x is extracted as C;f'=, ~;K(X,~,X). So, kernel PCA is used to find the principal directions of the reproducing kernel Hilbert space it describes; see also Sec. 2.6.1. If one starts from a square Euclidean distance matrix, the kernel PCA perfornied on the appropriately derived Gram matrix is similar to an approximate embedding of the distances t o an underlying Euclidean space, as discussed in Secs. 3.5.1 and 3.5.6. For that reason we will not investigate the kernel PCA here. The techniques of LLE, Isomap and kernel PCA provide lowdimensional representations which are determined as solutions to (generalized) eigenvalue problems defined on specially derived matrices. These matrices should reflect the local geometry or similarity between the data instances. The methods are nonlinear extensions of the traditional PCA and classical scaling approaches. Since 2000 many techniques based on variants of this principle have been proposed. An example is a Hessian eigerirnap [Donoho and Grimes, 20031 or a Laplacian eigenmap [Belkin and Niyogi, 2002a,b]. Given a weighted neighborhood graph, emphasizing the local geometry in the data, the latter technique defines a spatial reprcsentatiori by eigenvectors of the graph Laplacian2. The way the weights are selecting should reflect the local geometry of the data. A list of papers on various manifold learning techniques is collected by Law [Law]. N
cz=l
Curvilinear Component Analysis. Another research line focuses on unfolding a nonlinear structure present in data. It is started by Curvilin'Let B = (V, E ) be a an undirected, unweighted graph, without graph loops ( 2 , i ) . V , IVI = N is a set of vertices and E is a set of edges. The graph Laplacian L is an N x N symmetric matrix defined for each pair ( i , j ) of vertices by L,,7 = degz, i.e. the vertex degree given by the number of edges tha.t touch i , if i = j , L,, = -1, if i # j and ( i ,j ) E E , and L,, = 0, otherwise. If the graph is weighted, then each edge ( i , j ) has a weight wy . The Laplacian is computed as L = diag ( W l ) - W, where W is a symmetric weight matrix and diag ( W l ) is a diagonal matrix consisting of the elements C,"=,w,,.
2 74
T h e dissimilarity representation jar pattern recognition
ear Component Analysis (CCA) [Demartines and Hkrault, 1997; Hitrault et al.. 1999; Gukrin-Dugu6 et al., 19991, which draws an inspiration from the MDS techniques and Kohonen self-organizing maps (SOMs) [Kohonen, 20001. Similarly t,o MDS, the least-square loss function is minimized, but an additional weight function F is sued, which depends on the current estimates of the approximated distances in a Euclidean space. F is a decreasing a i d bounded function of its argument, si1c.h as exponential, sigmoid or a step function. Its particular choice is used to favor local topology preservation, similarly to SOM. Consequently, the CCA tries to reproduce the short distances first and then, the large ones. An additional value is the efficient implementation; see [Demartines and Hitrault, 19971 for details. Basically, the loss function is given as E ( X ) = !j ( S i j - d i j ( X ) ) 2 F ( d i j ( X )A,x ) , where Si.7 are the given dissimilarities, di, are Euclidean distances in the projected space and A x is the neighborhood parameter. By the focus on distances in local neighborhoods, unfolding of a manifold is reveled more significantly than in the case of the MDS techniques. This means that on average, d i j tends to be larger than S i j for large dissimilarities. It is empliasizcd in [Demartines and H6rault, 19971 that due to the special loss function, the CCA method is able to better preserve local topology when mapping data from dissimilarities to a Euclidean space. On the other hand, although the CCA might be very beneficial for well sampled manifolds, it niay locally get into too much details of reproducing the dissimilarity structure, especially for data yielding some clusters. The information may be lost.
Curvilinear Distances Analysis. Curvilinear Distances Analysis is an extension of the CCA method [Lee et al., 2000, 20021. The novelty relies on the use of curvilinear distances, expressing the distance measured along the structure, instead of the original distances S i j . Such curvilinear distances are computed as the shortest path between two chosen prototypes, after their quantization and linking.
6.3
Examples: getting insight into the data
We will present embedding examples of artificial and real dissimilarity data. The LLE and Isomap routines come from the dedicated web pages, see [LLE, site; Isornap, site]. Since, in general, lsomap or LLE are suitable for locally linear. but globally nonlinear, embeddings, the dissimilarity data should be more complex than representing the distances between two cir-
Visualization
CCA
Classical scaling ~~
~
~
Isomap, k = 5
. .. .. ... .
~~
~
.......
*< * *.
FastMap
.*. . . ... .
Sammon-Isomap, k = 5
~-
~
275
-
I
'*
.... i
Sammon map S--2
~~-
I
. . ....... .
.:
-
I
..... .. +$*,? . .: ..:Y ......... *:- ... *:.
.....
Sammon map So
-
LLE, k = 100
Sammon-Isomap, k = 50 ~ _ _ _ _ ~
Figure 6.13 Outputs of various projection methods based on the Euclidean distance representations of the Hypercube data in 100D. k denotes the numbers of neighbors used to define local neighborhoods. The scales are not comparable.
The dissimilarity representation f o r pattern recognition
276
LLE, k = 100
Isomap, k = 100 ~~
0
QI F
--Sammon map So
Sammon-Isomap, Ic = 100
corrected-LLE, k = 100
Isomao. k = 200
LLE. k = 200
CCA. random start o
+
CCA
I
I
Sammon-Isomap, k = 200 -----
~-
0
corrected-LLE, k = 200
Figure 6.14 Outputs of various projection methods based on the !I-distance representation of the Pump data. k denotes the numbers of neighbors used to define local neighborhoods. For k < 100, the LLE projection reduces to three points, while Isomap determines the geodesic distances between 300 vibration spectra only. The scales are not comparable.
Visualization
277
Isomap, k = 3
Sarnrnon-Isomap, k = 3 ~ ~ _ _ _ __ _ _
3
33
I Isomap, k = 10
Isomap, k = 100 2
corrected-LLE, k = 5
I
d
corrected-LLE, k = 30
-
~
~
_
_
_
_
_
.
Figure 6.15 Outputs of various projection methods for the Zongker data. The dissimilarities between the digit images are computed in a template matching process. k denotes the number of neighbors used to define local neighborhoods. The scales are not comparable.
The dzssimilarity representation f o r pattern recognition
278
Classical scaling ~_________
Sammon map So
CCA i
l
1
I
Isomap, k = 10
Sammon-Isomap, k = 10
Isomap, k = 20
LLE, k = 100
corrected-LLE. k = 100
~
+**
f
corrected-LLE, k = 20 ~~
CQ
0 0
i
I-
L
Y
i
Figure 6.16 Outputs of various projection methods based for the News-cor data defined by the correlation-based dissimilarities between the text newsgroups: ‘camp.*', marked in crosses, ‘rec.*’. marked in circles, ‘xi.*’, marked in squares and ‘talk.*’, marked in stars. k denotes the number of neighbors used to define local neighborhoods. The CCA result is presented after 200 iterations, however, even 2000 iterations did not change the results significantly. The scales are not comparable.
Vzsualization
279
cles. By Sammon-Isomap, we mean that t,he embedding procedure follows the Isomap routine until the estimation of geodesic distances, but then it uses the Samrnon mapping 5‘0 instead of classical scaling to find the lowdimensional representation. The CCA is initialized by the classical scaling followed by 50 - 100 iterations. We also noticed that in a number of cases, when the random initialization is used for the CCA, 2000 iterations were not sufficient to discover the structure in the data; see Fig. 6.14. Let us consider the Euclidean distance representation of the Hyperrube data as described in Appendix E.l. The data points are generated inside two enclosing hypercubes in a 100-dimensional space. The results of various mappings are presented in Fig. 6.13. Concerning the distance data as shown in Fig. E . l , right plot, according to our judgment, tlie Sammon mapping and Isomap reveal the data structiire most appropriately. They discover one compact cluster (corresponding to a smaller hypcrcubc) with points around, possibly suggesting another cluster. Of course, there is an inherent side effect of the Samrrion stresses to give sphere-like shapes than square-like shapes, which has been already pointed out in Sec. 3.6.2. Pump vibration spectra represented by the l1-distances (see Appendix E.2 for the data description and Sec. 6.1 for the MDS results) is a difficult case for both LLE arid Isoniap techniques. The reason is that the data describe well separated classes. The sampled ‘manifold’ is riot continuous. hence many nearest neighbors have to be taken into account in ordcr to discover such a structure. For less than k = 100 nearest neighbors, tlie LLE method collapses t o the result of three points in a 2 0 space. Also Isomap projects points on the top of each other. Marly nearest neighbors have to be included and, as a side effect, both methods become more costly than the nonlinear MDS mappings. The results of various mappings are shown in Fig. 6.14. Note that the CCA focuses on the locality so much that it looses the ability to show the separateness of the classes. It has also difficulties to present a good solution when the initialization is random; see second plot, top row in Fig. 6.14. From all the plots, the Samrriori niap So is the only one which detects three subclusters in the bearing failure mode of the pump; see Fig. 6.8 for the MDS results. This example clearly shows that the assumption of a reasonably continuous underlying manifold is essential for LLE: Isomap arid CCA methods. While they are shown to be successful in such cases, they have to be carefully used, if one wishes t o analyze possible cluster tendencies. Spatial representations of t,he Zongker dissimilarity data are present,ed in Fig. 6.10 (MDS maps) arid Fig. 6.15 (other maps). Note that these data
280
T h e dzsszmilarity representation f o r p a t t e r n recognition
FastMap
Classical scaling
- -
I
I
L
~
Isomap, k = 5 __
Sammon map SO ,
. .... .
Figure 6.17 Illustration of generalization abilities of the following projection methods: classical scaling, Sammon mapping, LLE, Isomap and CCA. Each two subsequent plots correspond to one method and their scales are identical. From each pair, the left plot presents the projection of the Euclidean distance representation of the Hypercube data bascd on all points, the right plot shows the result when first the map was established by 200 randomly selected points (marked by dots) and then the remaining 400 points were added to the existing map (marked by circles).
are an example of significantly non-metric dissimilarities. While the MDS methods find, in general, that the classes of ‘0’ and ‘1’ digits as the most confined, Isomap considers the classes of ‘ 3 ’ , ’5’ (and ‘0‘ for k = 10) as the most distinguishable. The remaining classes are heavily overlapping as judged from the Isomap result. LLE could not detect any sensible structure in the data, also for larger neighborhoods (which is not presented here). Depending on the neighborhood size, the corrected-LLE distinguishes the classes of ‘5’,‘9’,‘4‘and ’0’ digits. Still, the results vary tremendously with the increasing size of the local neighborhood, hence it is hard to draw clear conclusions. According to the CCA map, ‘8’ is the central class, similar to all other classes, ‘1’is the most compact class and ‘2’ is the most
Vzsualization
281
confusing, as single examples of ‘2’ appear over all places. Since the CCA method ‘unfolds’ the data, it is hard to judge which classes are potentially overlapping. The last example refers to the News-cor data, the newsgroups data, for which the non-metric correlation-based dissimilarity representation was computed; see Appendix E.2 for the description. The results of the mappings are presented for randomly chosen 100 objects per class. They can be observed in Fig. 6.16. In general, the newsgroup ‘rec.*’ is the most well-defined class, followed by the ‘talk.*’ group, as revealed by thc MDS maps and Isomap. The corrected-LLE seems to detect the cluster of ‘rcc.”’, however for a large neighborhood, only.
Generalization abilities. Classical scaling and Isomap can be naturally extended such that the new data are added to an existing map. Such a generalization relies on an orthogonal projection, which can be easily applied; see Sec. 3.5.5 for details. The possibility of adding new points to the Sammon map, by an iterative minimization of a modified stress fuuction, has already been discussed in Sec. 3.6.3. The extension of LLE is straightforward by finding for each object its k nearest neighbors and determining the weights in a lower-dimensional space such that the projected point can be in the best way represented as a linear conibination of its neighbors. The generalization of the CCA is also apparent and described in [Dernartines and Hkrault, 19971. In principle, this suggests that any of the mappings described so far, can be used for the classification purposes. An example of their generalization abilities is presented in Fig. 6.17. In our case. however, the CCA does riot seem to generalize well. 6.4
Tree models
A tree structure of the dissimilarity data enhance a natural interpretation of relations between the objects. It is a useful tool utilizing the understanding of the data structure, as by the inference of the organization of objects, especially for a smaller number of them. Moreover, trees support the hierarchical clustering scheme based on proximities. Such discrete models can be considered as complementary to the continuous spatial representations obtained e.g. by the MDS techniques. The key discrete model is the additive tree model, which represents objects by nodes of a tree and defines distances as path metrics between two nodes. An additive tree is a connected, undirected graph where each pair of
282
The dissimilarity representation jor p a t t e r n recognitzon
nodes is joined by a unique path. An n x r i dissimilarity rriatrix D defines a unique additive tree if D is additive, hence f1-embeddable. This means that the distance between two points is a path metric realized by the sun1 of positive weights alorig the path connecting the points; see Sec. 3.2. From an algorithmic point of view, the additivity of D stands for D being a metric and fulfilling the four-point iriequality as presented in Def. 3.10. A special case of an additive tree is an ultrametric tree, which is intimately related to hierarchical clustering method of the data. It is an additive rooted tree in which the distance from tlie root to every leaf is identical, as in deridrograrris. Forrnally, an n x r i distance rriatrix D defines a unique ultrarrietric tree if the ultrarrietric inequality as in Def. 3.11 holds. The root is riot tleterrriiried in additive trees, hence different interpretations may be suggested by choosing different roots. Basically, the root helps in distinguishing clusters in the data, so it could be chosen to enhance the interpretability of tlie data. This, however, requires some prior knowledge. Ariotlier possibility is to place the root at a node which minimizes thc variance of the distances from the root to the leaf nodes, so it splits the data into homogeneous clusters. In practice, there might be no tree metric coinciding exactly with the given dissimilarity matrix D , hence no representation by an additive or ultrametric tree. This means that a tree metric D can be sought which provides the best approximation of D under some criterion, e.g. given by a loss fiirictioii such as the !I, f z or norms. This is a formulation of the nurrierical taxonomy problem; see e.g. [Barthdemy and Guknoche, 1991; Kirri arid Warriow, 19993. Such tasks of fitting an additive or ultrametric tree are kriowri to be NP-hard under the f , arid t 2 loss [Sneath and Sokal, 1973; Chepoi and Fichet, 2000; Kim and Warnow, 19991. In case of the f, riorrri, the same holds for an additive tree [Agarwala et al., 19991, however the optimal ultrametric tree can be computed in a polynomial time [Farach et al., 19951. There exists a number of other methods trying to construct such trees so that the path distances approximate the given distances as well as possible; see c.g. [Agarwala et al., 1999; Cohen and Farach, 1997; Farach et al., 1995; Gascuel, 1997, 2000; de Soete, 1984c; de Soete and Caroll, 19961 for spccific algorithms. Below we briefly mention some of such tree fitting techniques.
Approximation under the f 2 norm. The dissimilarity data D can be approximated by an additive or ultrarrictric tree in terms of the least square error. Given D = ( d t J ) , the distances D = (&), dcfining either an additive
Visualization
283
or ultrametric tree, are sought such that (in the terminology of MDS), the - d i j ) 2 is minimized. This can be formulated as: raw stress
c,<,(&j
Additive tree Min. s.t.
L (8)= Ci<j(& - dij)' dij & 5 niax {& + d , l ,
+
d,l
+ 2,?k}
Ultrametric tree
(8) =
Min.
L
s.t.
dij
5 rriax {&,
(d,j - d i j ) 2
cijk),
' ~ ij ,, k , 1
A practical algorithm to solve these constrained optimization problems by transforming them into a series of unconstrained problems is proposed in [de Soete, 1984b,a]. Approximation under the too norm. It is known [Gower and Ross, 19691 that given a distance matrix D , there exists a unique ultrametric distance matrix Du such that D u ( i , j ) 5 D ( i , j ) for all pairs ( i % jarid ) D~J is maximal, i.e. all other ultrametric distance matrices are dominated by Du. One way to find Du is to construct a minimum spanning tree3 To on the complete graph whose weights become the distances of D . Then, Du is built from maximum weights of the edges in T . The same tree is obtained in a greedy agglomerative approach of the single-linkage (SL) algorithm. which is of quadratic complexity. It first starts with all objects in their own clusters. Then, repetitively, it finds the two clusters with the closest distance and merges them into one cluster until there is one cluster left,. After every merging, the distance between the new clusters is recomputed and all other distances are reduced. Due to its simplicity, the SL algorithm has become popular and it is widely used in cluster analysis. A possibility to fit an additive tree to a given dissimilarity matrix D is by using the neighbor joining heuristic [Saitou and Nei, 19871. Conceptually, the method is related to the SL algorithm, but without resorting to the assumption of an ultrametric tree. The idea here is to join the clusters that are not only close to one another, but are also far from the rest. The method begins with all objects in their own clusters (leaves). In each ~
3 A minimum spanning tree (MST) is a tree TnLst that spans all the nodes and miniwe. An MST constructing algorithm starts mizes the total weight of the tree, i.e. CeET from an arbitrary root node and grows until the tree spans all the nodes. The algorithm is greedy since the tree is augmented, step by step, with an edge that contribut,es the minimum amount possible to the total weight cost. MSTs can be used to solve tree optimization problems.
T h e dissimalaraty representation f o r p a t t e r n recognition
284
Ultrametric tree
ice hockey hockey softball baseball polo golf
'r_d z&g , canoeing 08
06
04
02
0
Additive tree I
Additive tree I1 golf
football lacrosse
Lwimming
1~
I
track
canoeino
-volleyball +basketball ice hockey 'hockey Lsoflball baseball
skiing
canoeing gymnastics
Figure 6.18
Tree models for human dissimilarity judgments on various sports.
step, the algorithm attempts to firid the direct parent of the two nodes in the tree. For the i-th node, its average distance to the other nodes is estimated as mi = CjfiD ( i ,j ) . In order to minimize the sum of all branch lengths, the nodes i and j that are clustered next are those for which D ( i . j ) - mi - mj is smallest. The distances between the nodes are recomputed appropriately. The algorithm stops when all objects belong to one cluster. Its time complexity is O ( n 2 ) . Another approach to fitting an additive tree relies on the property that an additive metric D A can be characterized by an associated ultrametric via a centroid metric. A c e n t r o i d m e t r i c D c is a metric which is realized by a wcighted tree with a star topology (i.e. a tree with all leaves but onc) and edge weights wi. Then, Dc(i,,j ) = iui wJ. More formally, for a chosen a , let ma = maxi DA(u,i). Then, the centroid metric is defined by the weights U J = ~ m, - D A ( u , ~such ) that D c ( i , j ) = wi w,j = 2m, -
+
+
Vasualitation
285
D ~ ( a , i-) D ~ ( a , , j )D. A is an additive metric iff D A + D c is ultrametric [Agarwala et al., 1999; Deza and Laurent, 1994; Chepoi and Fichet, 2000]. Since, tlie nearest ultrametric can be found in a quadratic time by tlie SL algorithm, this suggests a general strategy for fitting an additive metric D A to D in a quadratic time. Loosely speaking, given D , a centroid irietric DC is chosen and added to D . Then, an ultrametric Du approximating D + D c is found. The additive metric D A is determined as Du D c , which should serve for the reconstruction of the tree; see [Agarwala et ul., 1999; Cliepoi and Fichet, 20001 for specific algorithms. ~
Generalization abilities. It is not clear to us how new objects can be added to the existing trees. To our knowledge, this aspect is not, discussed in the literature, although it is possible to think of‘ constructing additive arid ultrametric distance trees for rectangular dissimilarity matrices D ( T ,R ) , wliere the sets R and T are distinct. Conceptually, tlie most reasonable approach would be to construct again a tree based on all the dissirnilerities, including the ones of newly coming objects. This is, however, not a true generalization. Surely, one can think of approaches which add objects to the existing trees, e.g. by appending therri to the objects for which the distances are the srnallest, but then the complete additive structure of tlic tree will be destroyed. So, this remains an open issue.
Two examples. Let 11s consider the auditory confiision (dissirnilnrity) measurements for letters and numerals and the human judgments on sports. The fitted ultrarrietric and additive trees are presented in Figs. 6.19 and 6.18, and are made by using the routines of Lee or Strauss [Lee; Stranss]. The same figure contains also a representation of a rnininiurn spanning tree pictured between the points of the MDS map. In an additive tree the root is not determined. Choosing different roots may suggest, different interpretations. Therefore, two different additive trees are shown in the figure: the first one (I) is found such that the root is placed at a node which minimizes the variance of the distances from the root to the leaves and the second tree (11) is unrooted and determined such that it has three or four apparent clusters (or in fact internal nodes). All the presented trees agree in basic interpretations. These are: the existence of a clear cluster consisting of ‘I’,‘ 5 ’ , ‘R’, ‘l’,’Y’ and a bit more remote ‘9’, identification of remote objects such as ‘4’ and ‘W’ in the case of Fig. 6.19 and a basic division of sports into tcam sports and individual sports. as observed in Fig. 6.18.
The dissimilarity representation f o r pattern recognition
286
~
..
-$
I
R9q
7
..-
Additive tree I1
~
I
-5
MST pictured over the MDS result
MST-based grouping
-
O
L
Figure 6.19 The plots in the top and middle rows show tree models of the auditory confusion measurements for letters and numerals; see also Fig. 6.2. The plots in the bottom row show the minimal spanning tree models drawn on the 2D &IDS maps applied t o the same data.
Visualization
6.5
287
Summary
Dissimilarity data can either be modeled spatially by linear and nonlinear projections to an output space or by tree representations.
In the first group of methods, multidimensional scaling (MDS) teclrniques play a special role, since they aim to preserve all pairwise, syrrimet,ric dissimilarities, resulting in a faithful, low-dimensional representat,ion of the geometrical relations between the points. They usually do this is in a Euclidean space. Other methods concentrate on the preservation of dissimilarities in local neighborhoods, like locally linear embedding and curvilinear component (distance) analysis, or on the preservation of locally estimated geodesic distances, like Isomap. Nonlinear methods can reveal more structure and cluster tendencies than linear ones. However, they also consume much more time. To understand data better, one should use both linear arid nonlinear techniques, since they reveal complementary information. Classical scaling ( a linear projection), accompanied by the Sammon map 5’0and Isomap can provide good insight into the data. Due to the inherent property of nonlinear MDS techniques of projecting data onto spherical shapes, some judgments might be biased. Also a Sammon map and Isornap complement each other. Isomap preserves far-away geodesic distances at the expense of distortions in local geometry, while Sammon mapping tries to penalize large distances to maintain the local geometry. Our conclusion, therefore, is that for general dissimilarity representat,ions of possibly undersanipled problems, the most revealing projections are those based on MDS principles (including the kernel PCA technique) and Isomap. Other methods, such as locally linear embedding and curvilinear component (distance) analysis, seem to need dense samplings and a clearly identifiable low intrinsic dimension, hence their usage is limited. Tree models focus on the organizational aspects of dissimilarity data. They enhance an understanding of data in terms of hierarchical or nested structures and, niorcover, they are easy to interpret. However, to make the interpretation process feasible, the objects should be distinct from cadi other and, obviously, there should riot be too many. Trees naturally support evolutionary processes in which all the objects have a common initial structure, arid additional distinctive features are developed later on. Examples are the evolution of a species or language over time.
This page intentionally left blank
Chapter 7
Further data exploration
If you torture data suficiently, it will confess to almost ccn,ythzrLg. FREDMENCER
Understanding data is essential in the process of designing arid validating learning algorithms. Visualization is often the first step in data exploration. In the previous chapter, we described continuous and discrete spatial representations of dissimilarity data, attained either by vector configurations in low-dimensional spaces or by weighted, fully conn ected graphs. Such techniques facilitate visualization. Subsequent steps in analysis require a more profound comprehension of the relations among data instances. Therefore, this chapter focuses on methods that help in exploration of dissimilarity data enabling an assessnient of their organization and (underlying) structures. Three main issues are discussed here, which are related to the structure and complexity of a dissimilarity data representation. These are clustcring techniques, estimation of intrinsic dimension, and sampling. Initially, all given objects are candidates for the representation set, so the analysis starts from an n x n dissimilarity matrix D ( R ,R). The first question. investigated in Sec. 7.1, deals with cluster tendencies in the data. Since the clustering problem has gained a great deal of attention over the years, wc are not able to study numerous methods that are available, as this would be a research issue in itself. We will, therefore, focus on essential algorithms, naturally related to dissimilarities. The second question of this chapter involves the estimation of intrinsic dimension. This is a difficult problem and may be relative to the given learning task. Some ideas are discusscd in Sec. 7.2. The third question is about sarnpliiig issues, i.e. whether the given dissiniilarity data are represented by a sufficient number of objects. Some proposals, together with an empirical study, are presented in Sec. 7.3 and rely on [Duin and Pqkalska, 2001, 20051.
289
290
7.1
T h e dassimilaraty representation for p a t t e r n recognition
Clustering
Clustering has becn addressed in many contexts and disciplines, reflecting its significance in exploratory data analysis. The purpose of clustering is to improve understanding and to enhance interpretation of the data by organizing thein in meaningful groups such that examples within one group are more closely related than those from different groups [Jardine and Sibson, 1971; Sneath and Sokal, 1973; Hartigan, 1975; Jain and Dubes, 1988; Jain et al.. 19991. Therefore, such techniques are often used to analyze the structure iii the data. Some of the most important applications are image segmentation. data mining, information retrieval and categorization. The clustering task is subjective, since the data can be partitioned differently depending on what is taken into account. It basically reflects the user’s needs. For instance, one may be interested in finding ‘natural clusters’ in the data, representatives of homogeneous clusters, other types of useful (i.e. easily interpretable) data groupings or even outliers. Consequently, there is no universally applicable technique that is able to uncover the variety of all structures present in the data. Depending on the final aim. a suitable method should be used.
7.1.1
Standard approaches
Two basic strategies are developed for clustering: hierarchical arid partitioning methods, both encompassing a variety of algorithms. Most of them rely on the notion of a (dis)similarity and an additional criterion specifying how the clusters are formed. The dissimilarity is not relative, i.e. between pairs of objects, but conceptual, comparing objects (or concepts) to concepts. Objects are grouped according to their fit to the specified concepts; see also Sec. 4.2. The concept of a cluster is represented by a modcl, such as a specified density function of a cluster or an average dissimilarity within the clustcr.
Hierarchical clustering. Hierarchical clustering proceeds successively either by merging smaller groups into larger ones or by splitting larger groups into smaller ones. The methods are then either agglomerative or divisive. The final result is a tree of nested clusters. a dendrogram, such that tht, complete set is represented by the root, while the leaves arc the individual examples. The internal nodes are defined as the union of their children. As a result. each level of the tree represents a partition of the set into several
Further data exploration
291
(nested) clusters. By cutting a dendrogram at a specified level, a clustering into disjoint groups is obtained. The way the current clusters are merged (or split) depends on the criterion which defines the (concept'ual) dissiniilarity between the clusters. Divisive methods often rely on constructing neighborhood graphs such as the minimum sparining tree and using some principle to remove edges and create the clusters. Agglomerative methods start frorn a partition, in which each example forms a single cluster, and proceed by a repetitive merging of two clusters with the smallest conceptual dissimilarity until one cluster is left or a specified number of clusters is reached. Due to the sequential nature of such algorithms, i.e. object,s once assigned to a cluster cannot change its label later on, they will not, necessarily produce tlie optimal clustering, even with the prior kiiowledgp of a desired number of clusters. Hierarchical methods are often applied in Euclidean feature spaces by using the square Euclidean distance as a basic measure. The reason I-diind this is the interpretability of the results, since tlie Euclidean distancc captures the (imposed) geometry between the clusters in a Euc1idea.n space. Yet, the techniques can be applied to any dissimilarity measure. Partitioning clustering. Partitioning methods usually operate in (Euclidean) vector spaces. They split tlic objects into ( a pr%ori spccified) tk groups according to some criterion. They are often model-based techniqiies. Clusters are characterized by parametric or non-parametric distributions. by some representatives or by assunied specific types of geometrical structures, such as planes, spheres, etc. The conceptual dissimilarity is then a goodness of fit of an object to the assumed cluster model. The primary difference to hierarchical methods is the need of' specifying k beforehand. Given a hypothesized number of clusters, a general representative-hased partitioning procedure chooses the cluster representatives with some strategy. The remaining objects are then assigned to the clusters according to their conceptual dissimilarity, which may be calculated based either on tlie initial cluster members or on their merged versions such as the weight~ed average. New representatives are estimated for each cluster and the whole procedure is repeated until a stable solution is reached. Methods differ primarily in the choice of initial representatives, the assignment of objects to clusters and the estimation of representatives. A typical method is the k-means algorithm [MacQueen, 19671. New representatives are estimated by cluster mean vectors and the conceptual dissimilarity is the Euclidean distance of an object to the cluster means. The
292
T h e dissimilarity representation for pattern recognition
EM-clustering, based on the expectation-maximization (EM) algorithm (a general maximum likelihood optimization procedure for problems with hidden variables or missing data' [Dempster et al., 19771) is an extension of this basic approach; see also [Bilmes, 19973). It computes probabilities of cluster memberships based on the assumed probability distribution models. Then, the goal is to maximize the overall probability or the likelihood of the data, given the (final) clusters. Usually, one hypothesizes the number of clusters and the Gaussian cluster models [McLachlan and Basford, 19881, hilt other probability distributions, like multinomial, can be used. Note that in contrast t o the k-means algorithm, the EM-clustering uses 'soft' assignments (memberships) to clusters. More details on EM algorithm can be found in Appendix D.4. EM-clustering can also be realized by an iterative 'self-improving' classification. Starting from an initial partition to k clusters, a normal density based classifier is trained and the initial assignments are changed according to the new estimates of the posterior probabilities. This proceeds iteratively until stable assignments are obtained. A generic EM-clustering can be considered when any probabilistic classifier is employed instead of the normal density based classifier. One can go even further, by using any arbitrary classifier (e.g. a logistic discriminant, decision tree, support vector machine) with crisp label assignments. We will denote this approach as the classz~er-clustering,in particular, the NMC-clustering, NQC-clustering, etc. In this light, the k-means can be called the NMC-clustering. In all such approaches, one must realize and take precautions in judging the obtained partition, as the results of the EM- and classifier-clustering depend on initialization. In practice, the initial labels are often provided by another clust,ering algorithm such as a hierarchical clustering.
Cluster validity. Finding the right number of clusters to retain is difficult, since the answer depends on the scale (size of clusters) one is interested in. One usually chooses to optimize a criterion function capable of recogniz'To maximize the likelihood, the EM algorithm iterates between the E-step and t,he M-stcp until convergence. In the E-step, a posterior probability distribution on the hidden or unobserved variables is estimated, which serves for a further estimation of the model parameters in the M-step, where the likelihood is maximized. EM is usually employed for finding the parameters of the distribution estimated by a mixture of Gaussian densities (MoG). Assume K Gaussian models, where the i-th model is given as M , = { p i , C , , 7rt} and the total model structure is M = { M I ,M 2 , . . . , M K } . Thcn, the MoG is described as p(x,M ) = C,"=,7rtp(xIpi, &), where 7ri 2 0 and C , 7rz = 1. Given a population X = {XI... . , x ~ } , the optimized log-likelihood is then L L ( X ) = l o g p ( X I W = c : ~ ' l o g P ( x z , ~ r= ) z~~llo~{z~l;r,P
Further data ezplorution
293
ing the ‘correct’ number of clusters. This is recognized when the optimum is reached, when the criterion is evaluated for an increasing number of clusters. If the true cluster labels are known, various evaluation measures can be used. They rely on a confusion matrix, classification accuracy, average entropy or mutual information. Some cluster validity proposals can be found in [Bezdek and Pal, 1998; Fraley and Raftery, 1998; Guo et al., 2002; Halkidi et al., 2001; Fred and Leitiio, 20031. An interesting statistics used to estimate the number of clusters is the gap statistics [Tibshirani et al., 20011. We will briefly describe it. Suppose the data are clustered into k clusters. The cardinality of the i-th cluster is ni. Let WA?= C:=l d:,) denote the within-cluster sum of square distances, usually the square Euclidean distances coniputed in a vector space. The idea is to standardize the graph of log(Wk) for a growing number of clusters by comparing it to its expectation under some reference distribution of the data. The gap is then defined as gapM(k) = E ~ [ l o g ( W h ) ] log(Wk), where E M [ . ]denotes the expectation over the sample of size M . Efi~[log(Wk)]is estimated as an average of M copies of log(W,*), where each of them is a Monte Carlo saniple [Metropolis and Ulam, 19491 drawn from the reference distribution, which is a uniform distribution either in the range of observed feature values or in the observed values of the principal components. The quantity
&(xr+
+
stdk(log(W;))~’(l &) denotes the standard deviation of the 2l.f Monte Carlo replicates corrected for the simulation error [Tibshirani et al., 20011. Given a set of potential cluster sizes { 1, 2, , k,,,}, the cluster size k is determined as the smallest k for which gap(k) 2 gap(k 1) - s k + l holds. In probabilistic approaches to clustering, the likelihood ratio measures are used; see also Appendix D.3 for more details. In the framework of the k-means and EM-clustering, new patterns can be assigned to the known clusters, so in fact, classification rules are indirectly derived. The clustering may proceed in an N-fold cross-validation fashion to determine the average distance to the cluster means for the k-means procedure or the average log-likelihood for the EM-clustering. Such values may indicate the right number of clusters in agreement to the assumed cluster distributions; see also Banfield and Raftery [1993]. For hierarchical approaches (except for the centroid linkage), the change in the dissimilarity between the nierged clusters (the gap) can be inspected, as it will grow. A large value indicates that two dissimilar clusters are merged, as exploited in [Fred and Lcitiio; 20031. sk =
+
294
The dissimilarity representcxtion. f o r pattern recognition
Cluster ensembles. Ideally, a clustering algorithm should posses a number of useful properties, such as an ability to discover clusters of arbitrary shapes, easily determined input parameters, handling noise and outliers, an ability to find the right number of clusters, interpretability arid usability. The basic difficulty of clustering algorithms lies in their limitation to focus on clusters of specific shapes or structures (e.g. spherically shaped), failing to reveal clusters whose shapes do not match the assumed models. To address tlie above-mentioned requirements more adequately, cluster ensembles are an appealing alternative. Indeed, there has been a growing interest in studying cluster ensembles to discover clusters of variable shapes and to improve the robustness of the clustering techniques. Examples of such a work can be found in [Ayad et al., 2004; Fred and .lain, 2002a,b, 2003; Strehl, 2002; Strehl arid Ghosh, 2002b,a; Topchy et al., 20041. An interesting approach is to transform data partitions resulting from various clustering methods into co-associations, as proposed in [Fred and Jain, 2002a,20031, encoding the co-occurrences of pairs of objects in the same cluster. In fact, a new higher-level similarity representation is created in this way, where each similarity value is the numerical vote towards gathering a pair of objects together. The final grouping is then derived from such similarities, by a single linkage, for instance. This approach has been proved to be able to discover clusters of arbitrary shapes. See [Fred and .Jain, 2002a, 20031 for details. Other views on clustering. The division of clustering techniques into hierarchical and partitioning methods is not the only possibility. Another way t,o inspect tlie arsenal of clustering approaches is to consider them as hard and fuzzy algorithms. In a hard clustering process, each object is allocated to a single cluster, which is indicated by a crisp label. A fuzzy clustering method assigns to each object the degrees of cluster memberships [Bczdek et al., 1999; Hoppner et al., 19991. The final crisp result is obtained by assigning objects to the clusters which yield a rnaximuin membership degree. One niay also distinguish deterministic arid stochastic approaches, applicable to criterion-based minimization techniques. Stochastic approaches are able to firid a near-optimal partition reasonably fast arid guarantee convergence to an optimal partition asymptotically [Jain et al., 19991. Deterministic methods are often variants of greedy descent techniques and EM. while stochastic methods often rely on simulated annealing or mean field annealing [Buhrnann and Hofmann, 1995; Puzicha et al., 199913; Jain
Further data exploratron
295
rt al., 19991. Another possibility are incremental vcrsus non-incrcmental algorithms. The former methods are especially important for the organization of huge data sets when they are designed t o be efficient with respect to both the execution time and memory. 7.1.2
Clustering o n dissimilarity representations
In this section. we will describe clustering techniques derived for dissimilarity representations. This is not a thorough investigation into the subject, but rather a brief survey and adaption of the basic existing techniques. The dissiriiilarity representations will be interpreted in three frameworks: neighborhood-based, embedding approach and dissimilarity space approach. For the sake of simplicity, symmetric representations D ( R .R). R = { P 1 . ~ 2 , .. . , p n } , are considered. The elements of D arc the pairwise dissirriilarities d ( p , ,p J ) between the objects p , and p,.
Neighborhood relations. The rationale is to group these objects which are characterized by small dissimilarities to other objects or which are in close neighborhoods of the selected representatives. Concerning hierarchcal clustering, agglomerative methods are popular. They begin with each object being considered as a single cluster. The closest two clusters, as judged by the conceptual dissimilarity measure p are merged and this process continues until all objects belong to one cluster or a specified number of' cluster k is reached. Let Ck and Cl be two clusters of the cardinalities n k and nl, respectively and let p k l be a measure of their dissimilarity. The clusters are conibiried based on the following criteria: 0
Single linkage (SL). The dissimilarity pkl between two clusters is the pkl = min min d ( p , , p j ) between their nearest neighbors.
dissimilarity
P, €CI. P , €C1
0
This rule emphasizes cluster connectedness, resulting in elongatcd, chainlike clusters. Complete linkage (CL). The dissimilarity p k l is defined by the furthest neighbors of the two clusters, pkl = max max d ( p i , p j ) . This usuP L E C k PJ E c 1
0
ally performs well when the objects form naturally distinct clouds. since it emphasizes the compactness. It is inappropriate if tlie clusters are somehow elongated or of a chain type. Average linkage (AL). The dissimilarity pkl is tlie average betweencluster dissimilarity, ,okl = CpL!=ck. CPJEC1 d ( P Z > P j ) .This performs well in both cases, when the objects form natural distinct clouds arid
-&
The dissimzlarity representation for pattern recognition
296
when they form elongated clusters. It tends to produce clusters of a similar spread (or variance in a vector space). Density linkage. This criterion derives a new dissimilarity &ens based on the density estimates and adjacencies, which is then used by the single linkage clustering. For instance, in the k-nearest neighbor approach, the estimated density , f ( p i ) at pi is the number of objects within the k-nearest neighbor ball divided by its volume. The new
;(m &)
is computed as &ens(Pi,P3) = + nlaX{dA--NN((pi),dk-"(pj)} and 00,otherwise. 1
&ens
if d ( P i , P , )
I
The methods mentioned above work directly on dissimilarities. Two other popular criteria require a Euclidean vector space representation, sirice they work with the estimated cluster means. These are [Anderberg, 1973; Everitt et al., 20011: 0
Centroid linkage. The dissimilarity pkl between two clusters is the square (Euclidean) distance between their estimated mean vectors, Xk and % l , p k l = d2(Xk>Zi) (Isfk - Zl11$. Ward's linkage. In each step, the two clusters are merged that give the smallest increase in the within-cluster suni of squares, which is the sum of the squared Euclidean distances between vectors and their clusXk,Xi) = I IXk - Xl 1.; Ward's method ter means, p k l = ,Lk+nld ( m+nl is known to join clusters which maximize the likelihood at each level of the hierarchy under the assumptions of a mixture of normal distributions with equal spherical covariance matrices and equal sampling probabilities. This tends to create clusters of similar sizes.
=
The centroid linkage and Ward's linkage are mentioned here, as we will propose their generalizations, when only dissimilarity representations are provided. This can be done in the three interpretation frameworks:
Generalized centroid linkage (GCL). The extension of the centroid linkage may refer to: (a) rieighborhood relations. Cluster centers are used instead of the vec-
tors means. The centers c k and el of the two clusters are defined as objects for which the maximum distance to all other objects within the clusters is minimum2. The conceptual dissimilarity becomes then pkl = d 2 ( C k . C l ) .
'Alternatively, cluster centers can be chosen as the ohjerts which minimize the average square dissimilarity within the clusters.
Further data exploration
297
(b) pseudo-Euclidean embedded space by Proposition 4.4 and Eq. (4.31); the square pseudo-Euclidean distance between the cluster means can be approximated (without deriving the embedding explicitly) as Pkl = P a , r ( C k . C l ) - ;Pavr(Ck,Ck) ;pavr(cl,cl) where pavr is the average square dissimilarity pavr(Ck.C,) = 1 EpJtCL d 2 ( p ~ , p j )This . is the merging critcrion. (c) dissimilarity space. The centroid linkage is applied in a dissimilarity space. Hence, p k z = lldk - dllli, where d k and & are mean vectors of the two classes computed in a dissimilarity space D ( . ;R ) . ~
zPLtCh
0
Generalized Ward’s linkage (GWL). The extension of the Ward‘s linkage may refer to: (a) neighborhood relations. Cluster centers are used instead of the vectors means, leading to p k l = =d2(ck, cl). n k f n l (b) pseudo-Euclidean embedded space by Proposition 4.4 anti Eq. (4.31). Based on Eq. (4.28), the pseudo-Euclidean distance of a single point to the mean of the cluster C k in an embedded space is determined its
&
* c,,
d2( P i , nlek) = CP,E C k d2 (Pi ,P.7 1 ECk CP, E C k d2 ( P z ,Pt 1. The GWL criterion relies now on the estimated within-cluster sum 2 1 of squares as CpzEGk d (Pz,mek) = CP,ECkC P t E C kd”PZlPt). (c) dissimilarity space. The Ward’s linkage is applied in a dissimilarity space. If a hierarchy of nested clusters is produced in an agglomerative clustering process, a dendrogram can be built. Remember that a dendrograrn is an additive distance tree (which becomes an ultrametric distance tree for single linkage) approximating the original dissimilarity matrix. These issues were discussed in Secs. 3.2 and 6.4. In most cascs: the conceptual dissimilarity (hence the rriergirig criterion) monotonically increases as the agglomerative methods progress from rriaiiy to few clusters. This holds for all the methods mentioned above cxcept for the centroid linkage. If this happens in the latter case, a true deridrogram cannot be produced. Concerning the implementation issues, a general recurrence formula for agglomerative clustering methods has bcen developed. The iterative approach starts with N clusters, each consisting of a single object and a given dissimilarity matrix between all object pairs. This is the initial conceptual dissimilarity between the clusters, i.e. p k l = d ( p k , p l ) . At any level of the hierarchy, the conceptual dissimilarity between newly created cluster and
The drsszmrlanty representatron for pattern recognrtaon
298
Table 7 1 Parameters for the recurrence formula Eq. (7.1) for the hierarchical clustering mcthods Clustering method
Qk
Single linkage Complete linkage Average linkage nI;
+
711
n E
Centroid linkage Ward's linkage n1c
+ R1 + n,
other existing clusters can be computed from the current grouping by using the following recurrence formula:
where p k l is the dissimilarity between the clusters C k arid Cl, p z , ( k l ) is the dissirnilarity between the cluster C, and the new cluster formed by joining ck.arid together, and a k , a /,p, and y are the constants which are set for particular hierarchical method. Table 7.1 shows how these parameter are set for some clustering approaches. See [Lance and Williams, 1967; Everitt et al., 20011 for details. Coricerning partition methods, thc k-centers [Ypma et al., 19971 and the mode-seeking [Cheng, 19951 will be described.
cl
k-centers. This technique works directly on a dissimilarity representation D ( R . R ) . It looks for k objects from R such that they are approximately evenly distributed with respect to the dissimilarity information. The algorithm proceeds as follows:
1. Select an initial set J = { & ) , p Y ) , . . . , p b ) } of k objects, e.g. randomly chosen from R. 2. For each object p , E R find its nearest neighbor in J . Let J,, be a subset of R consisting of objects that yield the same nearest neighbor p,'"' in J , 1: = 1:2 , . . . k . This means that R = uf=;=,.li. 3. For each Ji find its center ci, i.e. an object in J, for which the maximum distance to all other objects in Ji is minimum (this value is called the radius of J,).
Further data exploration
299
4. For each center cz, if c, # p:’), then replace pz(.l) by c, in c J . If any replacement is done, then return to 2, otherwise STOP. Except for step 3, this routine is identical to the k-means algorithm, performed in a vector space. The result of the k-centers procedure heavily depends on initialization. For that reason, we use it with precautions. To determine the set J of k objects, we start from a chosen center for the entire set and then, gradually, more centers are added. At any point, a group of objects belongs to each center. J is enlarged by splitting the group of the largest radius into two and replacing it5 center by two other mernbcrs of that group. This stops, when k centers are determined. The entire procedure is repeated M times, (say 50) resulting in M potential sets from which one yielding the minimum of the largest final subset radius is selected. Note that if we continue with the splits. the k-centers may become a hierarchical divisive method.
Mode-seeking.
The niode-seeking technique [Cheng, 19951 focuses on modes in dissimilarity data, which are determined by focussing on a specified neighborhood size s. The algorithm proceeds as follows:
1. Set a relative neighborhood size to an integer s > 1. 2. For each object p , E R find the dissimilarity d ( p i , nn,(pi)) to its s-th nearest neighbor. 3. Find a set J consisting of all p i E R for which d(p,, n n , ( p J ) )is minimum within its set of s nearest neighbors. The objects from the set J are the estimated modes of the class distribution in terms of the given dissimilarities. They are used to constitute the modes. The final number of clusters k depends on the choice of s. The larger s. the smaller k .
Embedded spaces. Symmetric dissimilarity representations can be crm bedded in coniplete or approximated (pseudo-)Euclidean spaces: where standard partition methods, such as the k-means and classifier-clustering can be used. The embedding may focus on the preservation of all original dissimilarities or on the preservation of the dissirnilaritics only in local neighborhoods. Such spaces are determined by the use of rnultidimensional scaling methods or other techniques, such as Isomap or local linear embedding, described in Sec. 6.2. In fact, by performing an approximate embedding, some information, possibly reflecting the noise in the data, is neglected. This might be seen as a purification of the dissimilarity infor-
300
The dissimilarity representation for pattern recognition
mation3. Here, we will use an approximate linear embedding to a pseudoEuclidean space. Dissimilarity spaces. Traditional clustering algorithms can be applied in a dissimilarity vector space. A reduced representation D ( R ,RT), where R,. c R is recommended to use both from efficiency (computational) and representational (using only informative objects as the representatives) point of view. The cardinality of R, can be specified either as a fraction of the total number of objects, such as 5 - 20%(RI,depending also on the hypothesized number of clusters to be retrieved, or as an estimated intrinsic dimension of D ( R , R ) . R, can be selected randomly or by using the k-centers or mode-seeking procedures. Additionally, to ensure that the objects in R, convey various dissimilarity information, they niay be chosen in the following way. First, for each object in R, the average dissimilarity to all other objects is computed, resulting in the sequence ai = a ( p i ) = 1 , R l C P z E R D ( p i r p z This ) . sequence is then sorted in a decreasing order and the objects are selected, which correspond to each q-th value of this sequence. A first few objects may be disregarded as possible outliers. We will call this a sparse average selection . Alternatively, one may also retrieve principal components from the dissiniilarity space (treating it, as a usual vector space) reflecting e.g. 90% of the variance. Wc will call the resulting space as a PCA-dissimilarity space. Generic EM-clustering or classifier-clustering approaches in a dissimilarity space are advantageous for reasonably-well sampled clusters of significantly different radii (i.e. the maximum dissimilarity between the objects within a cluster) or where a t least one cluster is very sparse in comparison to other compact clusters. In such cases, the neighborhood-based clustering approaches (e.g. AL or CL hierarchical clusterings or the k-centers) tend to fail. In a (reduced) dissiniilarity space, clusters might be well separable. Note, however, that if the dissimilarity between two objects does not capture thc cluster characteristic, the dissimilarity space will not help in detecting such clusters4. A possible solution is to consider a flexible non“It is also possible to re-compute the dissimilarity representation derived from the approximate embedding. Hence, the embedding can be treated as a de-noising step in obtaining a more discriminative dissimilarity representation, which will be further used by neighborhood-based clustering approaches. 41magine c.g. artificial banana data in a two-dimensional space with a Euclidean distarice representation, Fig. E.2. The curved banana clusters will be even more pronounced in a distance space, so no EM-clustering algorithm would he able to find such a structure without a perfect initialization. To detect curved structures, the dissimilarities should be recomputed appropriately, e.g. along the path.
Further data exploration
301
linear monotonic transformation of the given dissimilarities. The role of such a transformation is to emphasize the importance of local neighborhood and diminish the influence of outliers. An example of such a function is a sigmoid fslgm(z) = 2 / ( 1 + exp{-S}) I , applied in an element-wise way. Such a transformation will change the neighborhoods perceived in a dissimilarity space, although it will not influence the methods based on neighborhoods relations, such as hierarchical methods or the k-centers. Another transformation that may enhance local neighborhood derives new dissimilarities as d:J = da, ~
$CC,d % k + C k d k l ) ’
Related work. An interesting approach to a general proximity-based (neighborhood-based) partitioning, both partit>ioningand hierarchical, has been advocated by Buhmann and colleagues, where clustering is formulated as a combinatorial optimization problem; see e.g. [Buhmann and Hofniann, 1994, 1995; Hofmann and Buhmann, 1997; Puzicha et al., 1999bl. The authors specify an objective function, incorporating a suitably wciglited average of the within-cluster and between-cluster dissimilarities, a,nd derivc some optimization heuristics based on annealing. Another idea, proposed in [Fischer et al., 20011, discusses a path-based pairwise clusteriiq, which emphasizes the within-cluster connectivit,y by the use of gra.pl-i methods. Two objects are considered as similar if there exists a within-cluster path between them without any edge of a large dissimilarity. As a result, a new dissimilarity is developed that is further used for grouping. Another proximity-based algorithm, called evidential clustering (EVCLUS) is proposed in [Denoeux and Masson, 20041. The method relies on the evidence theory and attaches to each object a mass function such that the degree of conflict between the masses of any two objects reflect their proximity. This proximity is measured by a suitable stress function from the metric multidimensional scaling (see Sec. 3.6.2). Practically, it relies on the optimization of the stress function penalized by an entropy measure, added to prevent the resulting model from being too complex. The applications of spectral graph theory t o the clustering problem results in spectial clustering algorithms; see e.g. [Ng et al., 2002; Belkin and Niyogi, 2002al. Such procedures rely on finding the eigenvectors of some similarity matrix derived from a vectorial (feature-based) representation of a set of objects. This is a suitably scaled Gaussian similarity matrix (based on t,he Euclidean distances). The interesting property of spectral clustering is the ability to pull out non-convex or even disjoint clusters. In the final stage, however, partitioning clustering algorithms perform the final group-
302
The dissimilarity representation f o r pattern recognition
Random
VAT
TRUE
2 clusters
3 clusters
4 clusters
6 clusters
7 clusters
5 clusters
Ll Figure 7.1 Intensity images of the permuted protein dissimilarity representation. The top leftmost intensity image corresponds t o a random representation of the original data. The second top intensity image is an implementation of the visual assessment cluster tendency algorithm, which reorders the data items with respect t o the withincluster dissimilarity. The third top intensity image shows the true classes present in the data. T h e remaining intensity images are created based on the assignments t o a number of clusters varying from 2 t o 7. The data objects are grouped by the NQCclustering (a MoG-clustering) in the PCA-dissimilarity space. To make a n intensity image, the detected clusters are shown in the order of a growing within-cluster average dissimilarity, which is a simple !visualization proposed by us below. From the intensity image representing the two-cluster clustering, one may already expect that two more clust,ers are present.
ing. The specification of 0 in the Gaussian function, as well, as the number of clusters are the main questions to be solved. In conclusion. such algorithms determine a specific embedded Euclidean space of an appropriately transformed similarity representation used later for traditional partitioning clusteriiig met hods.
Visual cluster validity. Since clustering is subjective, one must not forget to inspect the results visually. The most appealing approach is to represent the dissimilarity representation D as a gray intensity image, in which each pixel value corresponds to a dissimilarity between a pair of
Furth,er data explora,tion,
303
objects. To observe the detected clusters, D shoiild be permuted according to the cluster assignments. If a fuzzy or soft clustering method is uscd. then the objects within a cluster can be sort'ed based on t,lieir membership values. To keep it simple and general, we propose to permute the objects within one cluster based on their growing average dissimilarity to all othcr objects. Assume that P is the final permutation matrix. Hence, one needs to display P D P as an intensity image. Example displays for a growing number of clusters are shown in Fig. 7.1, where the results for the protein dissiiriilarity data, grouped by tlie NQC-clustering in the PCA-dissimilarity space are presented; see next section for details. A more profound way to visualize the cluster validity methods was proposed in [Bezdek and Hathaway, 2002; Hathaway and Bezdek, 20031, called a visual assessment cluster teiidency (VAT). Additionally, one may analyze the clustering results by labeling the objects accordingly, while presenting them in tlie two- or three-dimensional spatial maps obtained by multidimensional scaling or other techniques.
7.1.3
Clustering examples f o r dissimilarity representations
Four dissimilarity data sets arc considered here for which true labels are known: the 400 x 400 Euclidean distance representation of the artificial two-class ringnorm data describing two somewhat overlapping Gaussain clouds in a 20-dimensional space, 65 x 65 cat-cortex dissimilarity data (four classes), 213 x 213 protein dissimilarity data (four classes) and 400 x 400 newsgroup correlation-based dissimilarity data News-cor2 (four classes) ; see Appendix D.4.5 for the data description. The protein dissimilarity data set is nearly Euclidean, while the cat-cortex data and the newsgroup data are non-Euclidean; see also Our assumption is that the dissirnilarity measure used is able to capture the underlying cluster difference, so we should perceive tlie clusters as Gaussian-type clouds either in embedded or dissimilarity spaces. We assume that the number of clusters is known and we will try to find out whether the true classes given in the data can be dctected. The following clustering methods are used: evidence clustering (EVCLUS) [Denoeux and Masson, 20041, the standard hierarchical clustering such as single linkage (SL), average linkage (AL) and complete linkage (CL), the k-centers, mode-seeking and the NQC-clustering (which is a Gaiissiari of mixtures EM clustering for soft labels) both in a pseudo-Euclidean ernbedded space and in a dissimilarity space. The NQC has been chosen. since it is an appropriate classifier for detecting all types of Gaussian-like clusters. To avoid singular covariance matrices for small clusters, the NQC is
The dissimilarity representation f o r p a t t e r n recognition
304
Ringnorm 180O7 1600 1400
Cat-cortex %Or-
~
*
T__r
.. 40
1200
*
1000
800 600 4000
10
15
20
2OL 0
Protein 300-
2500
200OI
10
20
30
,.
40
50
60
70
Newsgroup
~~~~
I I
10
Figure 7.2 Eigenvalues determined by linear embeddings of four dissimilarity data. The number of most significant eigenvalues describes the effective intrinsic dimension. Here. they lie above the black horizontal line. For t,he Ringnorm distance data, only first 20 non-zero eigenvalues are shown.
slightly regularized with X = lop6; see Sec. 4.4.2 for details. The dirnensiori of an embedded space i s chosen based on a small number of significant eigenvalues determined in the embedding. The dissimilarity space D ( R ,R) is reduced to D ( R . R T )by the sparse average selection, as described above. We will denote this procedure as the NQC-clustering in dissimilarity spaces. Another possibility is to extract the largest principal components in the dissimilarity space. Here, as default, the dimension is chosen based on the preservation of 90% of the total variance. We will denote this approach as thc NQC-clustering in a PCA-DS. If a square dissimilarity is used instead. it will be indicated by DS*2. Squaring the dissimilarity might be useful if one expects for clusters of different spreads. Small-spread clusters will become more compact and large-spread clusters will become even more spread. This might he beneficial for the NQC. EVCLUS is used here, since it was applied to the cat-cortex and protein data in [Denoeux and Masson, 20041. The authors claim that their fuzzylike method performs the same or much better than other state-of-the-art fuzzy techniques. Since EVCLUS is initialization-sensitive, we followed the authors' suggestions by running their code [Denceux et al.] 50 times
Further data exploration
305
and determining the final result as the one for which their penalized stress objective function was minimum. In this way, our results will bc compared to a good method. Since the NQC-clustering depends on a initial labeling, a criterion is needed for the selection of the final result. Let k be the number of clnsters and n, be the i-th cluster cardinality, with n being the total number of objects. Inspired by [Puzicha et al., 1999b], we propose to use the following goodness-of-clustering measure JGOC relating the cluster separability and cluster compactness as
where A,, is the average dissimilarity between the i-th and j - t h clusters. So, Aii is the average within i-th cluster dissimilarity. In our approach, the NQC-clustering is run 50 times in chosen embedded or dissimilarity spaces. The final result is chosen as the one corresponding to the maxinium of JGOC.
Other clustering methods provide deterministic results. Only in the case of mode-seeking, a proper neighborhood size should be detected to retrieve the specified number of clusters. In our clustering approaches, the number of clusters k is assumed to be known. Concerning the k-centers, hierarchical clustering and mode-seek algorithms, a larger number of clusters is sometimes retrieved, as these methods suffer either from the presence of outliers (objects with large dissimilarities) or have difficulties to accommodate sparse clusters. For the NQC-clustering in embedded space and PCA-dissimilarity spaces, we notice that the results depend on the space dimension. In our understanding, the dimension should be chosen close to the effective intrinsic dimension of the problem, i.e. the smallest dimension, which can reveal the structure in the data. (Note that this reasoning is valid for clustering, but not necessarily for classification.) Since in an embedded space and a PCA-dissimilarity space: the determined dimensions depend on the dissimilarities to all: objects, a small dimension is preferred. For each dissimilarity data set, all eigenvalues of a pseudo-Euclidean embedding are found and plotted, as shown in Fig. 7.2. The dimension of an embedded space is chosen according to the number of dominant eigenvalues, which are the eigenvalues that lie apart from the ‘continuous stream’ of eigenvalues. By our visual judgement, the effective intrinsic dinierisiori is
The dissimilarity representation for pattern recognition
306
First two features
-
Y
TRUE. Sammon map So
-
+
I +
EVCLUS
I
+
+
CL clustering
k-centers
T
i
NQC-dust -
Mode-seek
~~
~
I
in PCA-DS*'
7 I
NQC-clust. in DS*' ~
Figurc 7.3 Clustering results of the Ringnorm Euclidean distance data, visualized by a proper labeling of the ZD Sammon map SO obtained on the unlabeled data. The objects are labeled according t o the specified clustering algorithms. The numbcr of clusters is fixed to 2. However for the CL and mode-seek clusterings, the results for three clusters are presented, because the two-cluster groupings find one tiny cluster of a few objects only. TRUE stands for the true class labels. Note that the Sammon map is only a visualization of the dimension effects in the original 20-dimensional space. The first two features of the Gaussian clusters are shown in the top leftmost plot. See text for details.
chosen to be: 10 for the ringnorrn data, 6 for the cat-cortex data, 4 for the protein data and 12 for the newsgroup News-cori? data. We admit that this is not the best approach, sincc it is not automatic, but it makes intuitively sense. We need to develop an automatic procedure, based e.g. on a spline interpolation of thc eigenvalue plot for which the change in the speed of its decline (from fast to moderate steepness) should be determined. This requires future attention. To sirnplify our procedures, the samc dimensions, as reported above. was used for the PCA-dissimilarity space. Note that they
Further data explorataon
~
First two features -1
307
TRUE: Sammon map So r-
_ _ _ _ k-centers ~
~~
AL clustering 1
I
Mode-seek __
-~
NQC-clust. in PCA-DS*' ~
~~
~
+'
* + I
,
i
**
Figure 7.4 Clustering results of the cat-cortex dissimilarity data, visualized by a proper labeling of the 2D Sammon map So obtained on the unlabeled data. The objects arc labeled according to the specified clustering algorithms. The number of clusters is fixed to 4. TRUE stands for the true class labels. See text for details.
might be not optimal at all, in this space. Concerning the ringnorm Euclidean distance data, the two Gaussian clusters are not discovered by the EM-clustering algorithms in the initial Euclidean space. This is caused by the sparseness of one of the clusters. Presumably, the path-based clustering or the spectral clustering should be able to detect these clouds. In a dissimilarity space, however, the clusters are better separated, since the distances to the objects from the compact cloud are discriminative for both clusters. The clustering results of applipd algorithms are presented in Fig. 7 . 3 . Cat-cortex dissimilarity data set is challenging. It is not only small, but also the dissimilarities are ordinal, instead of continuous. This makes it hard to build a NQC in both embedded and dissimilarity spaces. In
The drssimrlarity representation f o r pattern recognition
308
TRUE Classical scaling
TRUE: Sammon map So
EVCLUS
k-centers
CL clustering
Mode-seek
t
+?+
L-
NQC-clust in
,
Eernbed
NQC-clust. in PCA-DS*2
Figure 7.5 Clustering results of the protein dissimilarity data, visualized by a proper labeling of the 2D classical scaling representation obtained on the unlabeled data. The objects are labeled according t o the specified clustering algorithms. The number of clusters is fixed to 4. However for the CL hierarchical clustering, the results for nine clusters are presented, because the groupings found for less clusters, detect two clusters only. TRUE stands for the true class labels. See text for details.
fact, the dissimilarities should be de-noised or smoothed out. Since the dissimilarities are not very discriminative (only five different ordinal values are used: 0 , 1 , 2 , 3and 4). the task becomes difficult. Some of the clustering results are illustrated in Fig. 7.4. Protein dissimilarity data are reasonably-well clustered, so the clusters can be recovered. Clustering results are presented in Fig. 7.5. Newsgroup dissimilarity data are defined on weak and poorly informative vectors. word occurrence vectors. So; the resulting dissimilarities are not discriminative between the clusters and it is difficult to discover tl-iern properly. The within-cluster dissimilarities of the ‘sci.*’ news group
Further data exploration
TRUE
309
VAT
Figure 7.6 Intensity images of the newsgroup News-cor2 dissimilarity data. On the left, the data objects are permuted according t o the true cluster memberships. On the right, the visual assessment cluster tendency (VAT), meant t o detect clusters in the dissimilarity data is shown. These intensity images suggest that there is 110 strong structure in the dissimilarity data.
are of the same order as the between-cluster dissimilarities; see Fig. 7.6. Consequently, majority of these objects is assigned to other clusters. Our clustering results are presented in Fig. 7.7. The overall numerical results are shown in Table 7.2. It can be observed that indeed, EVCLUS performs well. provided that a suitable trade-off parameter is chosen. If the parameter deviates from the optimal value, very bad results are found. Additional disadvantage of EVCLUS is the high computational burden. For instance, for the protein dissimilarity data, thc task of 50 groupings takes about 90 minutes, while the NQC-clustering (with the embedding included) takes about 0.5 min, both run in Matlab. The authors of EVCLUS reported in [Denoeux and Masson, 20041 that their algorithm competes with other state-of-art fuzzy clustering algorithrns. We must, therefore, report that our handcrafted NQC-clustering approach in a PCA-dissimilarity space (or in an embedded space) performs similarly or better than EVCLUS (especially for the ringnorni data). Although our results are preliminary, they indicate that more may be gained if the rnethods will be improved further by designing an automatic selection of the parameters.
7.2
Intrinsic dimension
If a certain phenomenon or process can be described or generated by k independent variables, then its intrinsic dimension is k . In practice, however, due to noise, inaccuracy in measurement devices or procedures or other uncontrolled factors. more variables may seem to be apparent to characterize
310
T h e dassimilaraty represe,ntation f o r pattern reco,qnition
TRUE: Classical scaling ~~~
TRUE: Sammon map So I 7
k-centers
AL clustering
EVCLUS 7
-
Mode-seek
7
NQC-dust. in Eeerr,,,ed
NQC-clust. in PCA-DS*’
Figure 7.7 Clustering results of the newsgroup News-cor2 dissimilarity data, visualized by a proper labeling of the 2D classical scaling representation obtained on the unlabeled dat,a. The objects are labeled according to the specified clustering algorithms. The number of clusters is fixed to 4. TRUE stands for the true class labels.
such a phenomenon. If all these factors are not ‘too prominent’ to completely disturb the original phenomenon, one should he able to re-discover the ‘true’ number of significant variables. Intrinsic dimension is then defined as a niinirnum number of variables that explain the phenomenon in a satisfactory way. In pattern recognition. one usually discusses intrinsic dimension with respect to a collection of data vectors in a feature vector space. Intrinsic dimension (of a given problem) may then be specified as the minimum riurriber of features needed to obtain similar classification performance as by using the total number of them. In unsupervised learning, intrinsic dimension is defined as the number of independent parameters characterizing the data. Usually, intrinsic dimension is determined by the use of (nonlinear)
Further data. exploration,
311
Table 7.2 Clustering results as compared to the true class assignments for four dissimilarity data. The numbers below describe the absolute number of misrnatchcs. Hence, a small value indicates a faithful sroupina
1
Clustering method
I Ringnorni 1
Cat-cortex
1
Protein
1
News ~~
TRUE
65
213
600
EVCLUS Hicrarchical clustering k-centers Mode-seek NQC-clustering in P E NQC-clustering in PCA-DS NQC-clustering in PCA-DS*2 NQC-clustering in reduced DS*'
2 10 14 38 4 2 5 23
5
190 195 402 348 229 253 199 298
168
21
22 41 1 4 2 1
feature reduction techniques, performed either by selection or extraction. In a geometrical sense, this intrinsic dimension can be defined as a tlirnension of a manifold that approximately (due to noise) embeds the data. In practice, the estimation of intrinsic dimension depends on a chosen criterion (e.g. whether one searches a linear or nonlinear manifold) and may vary between t,he criteria. Moreover, intrinsic dimension depends also on the scale, resolution or interpretation we consider. For instance, the iIitrinsic dimension of a noisy spiral in a three-dimensional space can be 2 if we focus on small local neighborhoods, 1 if we are able to detect, a onedimensional curve, and 3 if we notice that the data arc represeritcd in all three dimensions. Therefore, the estimated intrinsic dimension will likely change with a growing neighborhood size. Although the notion of intrinsic dimension is conceptually clear, it, is arnbiguous in practice, as one may define it not only relatively to the task, but also differently with respect to topology, geometry arid with respect to linear or nonlinear structure at different scales. Some ideas in this direction can be found in [Bruske and Somrner, 1997; Pettis et al., 1979; Duin arid Verveer, 19951. At the moment of finishing this book, we have also become aware of the proposal of [Kkgl, 20021, who discusses the use of pa,cking numbers for intrinsic dimension estimation in metric spaces. This may he an interesting approach to follow. In our case we are concern with dissimilarity representations. While studying them, we may choose an ernbedding approach arid try to estimate intrinsic dimension of the embedded space or to perform dimension reduction in a dissimilarity space. Examples of such techniques are briefly explained in Chapter 6, where projections of dissimilarity data are dis-
The dassamalaraty representation for pattern recognition
312
Gaussian sample N(0,2I ) ; noise
Gaussian sample N ( 0 , 2I )
-
-
-
51
I
S
20 50
200 Sample size
I
S
800 3000
20 50 200 Sample size
800
3000
Gaussian sample N (0,2I ) ; noise
Gaussian sample N ( 0 , 2I ) I
2l1
N(0
-
-
2 61
N(0, I ) J
1
5
--
r= iZ
/i
- 5
*
E 1 -
I*
9
10 20 30
5D
+ 100
+ 200 I + 500
1 4'
5
20 50
200
Sample size
800 3000
6'
5
20 50 200 Sample size
860 3000
Figure 7.8 Estimated intrinsic dimension (top row) and variance (bottom row) for various Gaussian samples. Different marks correspond t o different dimensionality, as described in the legend.
cussed. Here, we will focus on two linear techniques: pseudo-Euclidean embedding of a symmetric dissimilarity representation D ( R ,R) and principal component analysis (PCA) applied in a dissimilarity space. Intrinsic dirnension estimated by these methods will be judged globally in linear subspaces with respect to the dissimilarity information. In the embedding process. the number of dominant eigenvalues can be used to estimate iritririsic dimension. Given labeled data, intrinsic dimension niight be judged not only for the complete set, but also for each class separately. Given unlabeled data, one may first determine meaningful groups and then proceed as with the labeled case. In this particular embedding, the estimated intrinsic dirnension cannot be larger than the total number of objects considered. Additional theory concerning eigenvalue spectra of covariance and Gram matriccs that may be used for the estimation of intrinsic dimension can be found in the work of [Hoyle and Rattray, 2003b,a. 2004a,b].
Statistical estimation of intrinsic dimension for a Gaussian sample. We will focus first on the Euclidean distance representation of a Gaussian sample. Assume a Euclidean rn-dimensional vector space.
Further data exploration
313
Let X be a normally distributed variable with a zero mean vector, zero covariances and equal variances in all dimensions. Hence. X “(0, I). Consider now the square Euclidean distance variable r , which for two realizations xk and xl of X is given as T ~ = Z ~,“=,(x -~~ ,1 % ) Since ~ . y = is x; distributed5, then after straightforward calculations one obtains that E[y] = m and E[y2] = (m2 am),where E [ . ]denotes expectation. Hence. E[T]= mi? and, similarly, E[,r2]= (m2 2 m ) d . Using these results. we find that
g
$
5
+
n 2= -
+
EkI m
As a result, both the dimension m and the variance of the spherical Gaussian variable X can be estimated from the square Euclidean distance variable 7 only. Given a sample of X ,i.e. a finite set of examples X = {XI, x2, . . . xn} and the corresponding square Euclidean distance matrix D*2, E [ T ]and E[?] are computed as E [ T ]= &lTD*’l arid E [ T ~=]-1TD*41. Consequently. the dimension m and the variance c2 can be estimated as ~
m=2
( 1TD*21)2 n(n-1)1T~*41
-
( 1 ~ ~ * ~ 1 ) 2
(7.4) 62
1 i T ~ * 4 1- ____ iT~*2i = -~ 2 lTD*’l n(n- 1) ‘
The goodness of these estimates is illustrated in Fig. 7.8; where Gaussian samples drawn from N(0,2I ) in vector spaces of various dimensions. The results are good even for a small sample (with respect to the space dimension); from 12x71. Euclidean distance matrices (for n,> l o ) , an estirna-
-
5A basic statistical fact is that given m independent, one-dimensional variables Y , N(0,l), the variable y = cply,” is x& distributed with m degrees of freedom [Wilks, 19621. The probability density function of & is p x z (y) =
-
JF
xzl
m
-,
2m,2 er(m,2) xp{-q’,
where r(g) =
t y - l exp{-t}&. If Z , N(0,o’), then Z = 2; = 0’ Y,” = o z Y . Consequently, 2 is distributed according to a z ~ & Consider . now two independent m-
X ,y - N ( O , $ I ) . This means that for one dimensional variables X , and yi, we have X,,yi-N(O, $). The square Euclidean distance 7 between X and y is given as T = C,“=,(X, - Y,)2= C z l X,“ + CEl y,”+ 2 X,Y,. Therefore, 7 is 2 = X&-distributed, since the variables X i and yi are independent. dimensional variables
gx$
xzl
314
The dassamalarity representataon far pattern recognition
tiori I ~ Lsufficiently , closc to the true value rn is found. Also in the presence of noise the estimation of intrinsic dimension is not significantly affected, although the estimated variance is (it becomes larger). n variable X - n / ( O , $1). Then the square EuAssume now a Gau C J ; K ~ where , ~i x:. Conclidean distance variable r is described by sequently, 7 is a linear combination of xf distributions with one degree of freedom. Note that if cr; E a;, t,lieri K ~ + K ~ 2 a4x:. One can, therefore, de-
czl
N
-
O ’ K ~ = CJ’ Ef=, ~ i where , = CE, CJ? scribe r approximately as and k: is equal to rn or less depending on the number of dominant variances 0:. So. 7 is approximately distributed as 0 ~ x 2 . Effectively, the number of degrees of frcedom is determined by the dorninaiit variances” 111 summary, if data points originate from a general normal distribution, still a sort of an average intrinsic dimension can be roughly estimated by the use of Eq. (7.4). This will he, however, influenced by the largest variances iii tlie data. Basically, the hyper-ellipsoidal data will be judged by its volume (dcterniined by the given dissimilarities) and the derived intrinsic dimension will reflect the dimension of a hypersphere of the same volume. In grrieral, the above formula can be applied to any dissimilarity measure.
k
CJ‘
Examples. Wc will illustratc tlie difficulty of determining the ‘true’ intrinsic dimeiisiori for the square dissimilarity representations. Here, we are only coiicerried with data describing a single class. Even though artificial Gaussian samples are considered, they already give some an indication of the difficulty to be met in real problems, especially when different dissimilarity measiires are used. Three different Gaussian samples are drawn in a 30-dimensional space:
1. Case 1: a Gaussian sample N(0,diag (v)),where v is such that ‘u1,2,3 = 10 and lii = 1 for i = 4 , . . . ,30. 2 . Case 2: a Gaussian sample N ( 0 ,diag (v)),where v is such that vi = 5 for i = 1 , .. . . 15 and vi = 1 for i = 16,. . . ,30. 3. Case 3: a Gaussian sample N(0,diag (v)),where v is such that wi = 5 for i = 1 , .. . 10, ui = 2 for i = 11,.. . , 2 0 , and vi = 1 for i = 2 1 , . . . 30.
In a11 cases tlie ‘true’ intrinsic dimension is 30, since the data are generated by 30 variables. Since the Gaussian samples are not hyperspherical, the
-
‘For instance, let X N ( 0 ,diag (v)),where v = [ 3 3 3 3 3 3 3 1 1 1IT.Then by the 3 . 2xf = 48x1 and formulation, D.,” = 2 u 2 . The variable T can be described as 6 7x: approximated by 5.33 since 5.33 N 48/9. Effectively, 6’ should then be 2.67. riz and G2 can be derived for each sample realization of X.
xg,
+
Further data exploration
315
hyperellipsoidal data will be treated as such, so they would be indirectly reshaped to a hyper-sphere of a similar. volume. The dimension estimated from the Euclidean distances by Eq. (7.4) will be, therefore, smaller than 30. Other dissimilarity measures will also influence the estimation of intrinsic dimension by the amount of ‘departure’ from the Eiiclideari behavior. One may be also concerned with the effective intrinsic dimension. i.e. the number of significant variables, i.e. variables with the largest variance (spread). Such an effective intrinsic dimension can be thought of as 3: 15 and 10 for the cases 1,2, and 3, respectively. This might be detected by the nurnber of a few the most significant eigenvalues in the pseudo-Euclidean embedding. Figs. 7.9 7.11 show the eigenvalues of the pseudo-Euclidean embeddings for the three cases mentioned above. For each case. three !,-distance representations are considered for p = 2,1,0.8, reflecting the proper Euclidean representation, city block representation (metric and non-Euclidean) and non-metric representation. Also four different sample cardinalities N are taken into account: 20 (undersampled), 50,100 (a small sample), and 500 (a large sample). Additionally, the original samples are also contaminated with a hypothetical Gaussian noise with the variance of 0.5. A few general conclusions can be drawn from the analysis of these figures. As expected, the estimated intrinsic dimension in all cases is smaller than 30, although the smallest is found for the Euclidean distances in the case 1. For non-noisy Gaussian samples in the cases 2 and 3 , the estimated intrinsic dimension varies between 2 1 and 24, provided that the sample is larger than 20. Since the added noise is quite large, it disturbs the estirnations. For sufficiently large samples (such as 500 points) and the Euclideari distances, the most informative directions can be revealed, even in the presence of noise. The number of the most significant eigenvalues identified in the cases 1 3 is 3 , 15 and 10, respectively. When other distance measures are used, 3 significant eigenvalues can be clearly detected in the case 1, but the estimation of the number of informative eigenvalues beconies more difficult in other cases. In general, the case 1 seems to be the easiest; it is possible to distinguish 3 characteristic eigenvalues even for small samples. Although the number of characteristic eigenvalues become less apparent for the cases 2 and 3, still, an indicative value is the point, where an eigenvalue-curve (a curve interpolating the eigenvalues) changes its steepness (or convexity). The first rapid change suggests the major breaking point, after which the ~
~
316
The dissimilarity representation
SOT pattern recognition
D2 representations for a Gaussian sample N ( 0 ,diag (v))
100
' .
D1 representations for a Gaussian sample
N ( 0 ,diag (v))
3m --o-
o- +C5 -
-
i
o
h
d
Y
)
-akrrz7K4w
-
y
o
u
r
n
Figure 7.9 Case 1: Estimation of intrinsic dimension for three dissimilarity representations: D2, D1 and D0.8, derived for a Gaussian sample N ( 0 ,diag (v)) in a 30-dimensional space. v is such that v 1 . ~ , 3= 10 and li, = 1 for i = 4 , . . . ,30. Every plot shows the eigenvalues of the pseudo-Euclidean embedding. The sample size increases from left to right as 20,50,100 and 500.
Further data expcorutzon
D2 representations for a Gaussian sample
1-2'
1
~
317
N(0,diag (v))
':TZZL-] 1 :IrTz1
D1 representations for a Gaussian sample
N ( 0 ,diag (v))
D1 representations for a Gaussian sample N ( 0 ,diag (v)), noise N ( 0 ,; I ) 6''*a
1
...
10.22
....
n
vara9.4
lO=242 , a 3 4 2 1 I
I
........
00.8
representations for a Gaussian sample N ( 0 ,diag (v))
1
....... 00.8
'2.
representations for a Gaussian sample N ( 0 ,diag (v)), noise N ( 0 ,$1) lD=22.2 var=408.6
1 0
.... ....
.-.
Figure 7.10 Case 2: Estimation of intrinsic dimension for three dissimilarity reprcsentations: D z , D1 and Do.8, derived for a Gaussian sample .Y(O,diag(v)) in a 30dimensional space. v is such that u2 = 5 for i = 1 , ., . ,15 and u, = 1 for i = 1 6 , . , . ,30. Every plot shows the eigenvalues of the pseudo-Euclidean embedding. The sample size increases from left to right as 20,50,100 and 500.
T h e dissimilarity representation for p a t t e r n recognition
318
0 2
representations for a Gaussian sample N(0,diag (v))
150 100
.... representations for a Gaussian sample N ( 0 ,diag (v)), noise N ( 0 ,; I )
0 2
xa
\.
1-
\
-\
D1 representations for a Gaussian sample N(0,diag (v)) 4m.
..
'
1wD
...
D1
-2mo--
5
representations for a Gaussian sample N ( 0 ,diag (v)), noise N ( 0 ,f l )
--10 -.
~
00.8
representations for a Gaussian sample N ( 0 ,diag (v)) 1D=23.3 var=286.9
00.8
representations for a Gaussian sample N(0,diag (v)), noise N(0,; I ) 10.24
varS48.1
.....
... '
..
Figure 7.11 Case 3: Estimation of intrinsic dimension for three dissimilarity representations: DzrD1 and 0 0 . 8 , derived for a Gaussian sample N(O,diag(v)) in a 30dimensional space. v is such that v, = 5 for i = 1 , . . . , l o , vi = 2 for i = 11,.. . , 2 0 and v1 = 1 for i > 20. Every plot shows the eigenvalues of the pseudo-Euclidean embedding. The sample size increases from left t o right as 20,50,100 and 500.
Further data exploration
319
subsequent eigenvalues describe the dimensions of a lesser importance. To determine the smallest number of informative dimensions. one may. therefore, estimate it by finding the eigenvalue rank for which the eigenvaluecurve changes its steepness. This is the lower bound for the determiriatioil of the number of significant directions. The upper bound is given by the estimated intrinsic dimension, Eq. (7.4). Although a single Gaussian sample is analyzed, the situation becomes more realistic, when noise is added. When multiple Gaussian saniples are considered, the problem becomes niuch more difficult, since various Gaussian samples might have different numbers of important variables. In such a procedure, all samples are judged in one combined description, so the resulting description might have properties which are very different from each single sample. Such a combined description is analyzed to find significant directions, for which, in principle, the ‘eigenvalue-curve’reasoning can be applied. However, Eq. (7.4) cannot be recommended any longer, since the assumption of a single cloud is completely violated. Another type of estimation must be searched for.
7.3
Sampling density
We will now study criteria which judge whether a dissimilarity representation is sufficiently well sampled. Consider an nxn, dissimilarity matrix D ( R ,R ) based on the representation set R = { p l , p z , . . . , p n } . We assume that R = T . although, in general, R may be a subset of a training set T . The entire set R is represented by dissimilarity vectors D(pi,R ) , i = 1 , 2 , . . . , n. The question to be addressed now is whether the sample size, n = IRI, is sufficiently large for capturing the variability in the data. This question can be reformulated to ask whether only little new information can be gained by increasing the number of representation examples. This is directly related to the complexity of a learning problem, as discussed in [Duin and Pqkalska, 2005). To start the analysis, we will restrict ourselves to a simpler issue concerning sampling of a set of unlabeled objects, possibly forming a single class. Next, the problem will be formulated and some criteria, will be proposed to estimate the goodness of sampling for dissimilarity data. Note that the issue of a good sampling is related to finding out whether new objects can be expressed in terms of the ones alrea,dy present in R or not. The usefulness of such criteria will be experimentally investigated on two data sets. See [Duin and P&alska, 20051 for more extensive experiments.
320
The dissimilanty representation for pattern recognition
Statistics that might be used here are based on the compactness hypothesis [Arkadiev aiid Braverman, 1964; Duin, 1999; Duin and Pekalska, 20011, which states that similar objects are close (similar) in their representation. Consequently, the dissimilarity measure d has to be constrained such that . y) is small if the objects 11: arid 'y are very similar, i.e. it should be much smaller for very similar objects than for objects that are very different. Suppose that the measure d is definite, i.e. d ( z , y) = 0 iff the objects n: and y are identical. This implies that they belong to the same class. This can be ext,ended by assuming that all objects z for which d ( z , z ) < E , E > 0, are so similar to z that, they belong to the same class (cluster) as 2 . Consequently, the dissiniilarities of z arid z to the other objects in the representation set R should be close (or positively correlated, in fact), i.e. d ( z , p i ) M d ( z , p i ) . This implies that their representations d(x,R ) arid d ( z , R) should also be close. We conclude that for dissimilarity representations that satisfy the above continuity, a stronger property than formulated by the compactness hypothesis holds, as similar representations of two objects impose that the objects are also similar. As a result, they belong to the same class (cluster). Such representations will be called the true representations. A representation set R is judged to be sufficiently large if any arbitrary new object is not significantly different from the other objects from t,lie same class. This can be expected if R already contains many objects that are very similar, e.g. if they have a small dissimilarity to at least one other object. All the statistics studied below are based; in one way or the other, on this observation.
7.3.1
Proposed criteria
We will propose a number of criteria and illustrate their performance on an artificial, non-metric dissimilarity example, which is chosen as the & 8distance7 representation derived for a Gaussian sample of 72 points in a kdimensional vector space. Both n and k vary between 5 and 500. If n < k , then the generated vectors lie in an (n- 1)-dimensional subspace, resulting in an undersampled and difficult problem. If n >> k , then the data set may be judged as sufficiently sampled. Large values of k lead to difficult (complex) problems as they demand a large sample size. n. The results arc averaged over 20 experiments, each time based on a new, randomly generated data set. The criteria are discussed below. 7Remember that the to.8-distance is do.8(X,y) =
(c,"=, Iz7-
x,yERk
Further data exploration
321
The proposed sanipling criteria for dissimilarity representations are addressed in three different ways: by judging the dissimilarity values directly (skewness and mean relative rank), in dissimilarity vector spaces (PCA and correlation) and in embedded pseudo-Euclidean vector spaces (intrinsic ernbedded dimension and compactness).
Skewness.
This criterion evaluates the dissimilarity values directly.
A new example added to a set of objects that is still insufficiently well sampled will generate many large dissimilarities and just a few small ones. As a result, for unsatisfactory sanipled data, the distribution of dissirrii1a.rities will peak for srnall values and will show a long tail in the direction of large dissimilarities. After the set becomes ‘saturated’, however, adding new objects will cause the appearance of more and more small tlissimilarities. Consequently, the skewness will grow with the increase of lRI. Tlic value t o which it grows depends on the problem. Let the variable d denote the dissimilarity value between two arbitrary objects. In practice the off-diagonal values d i j from the dissimilarity matrix D = ( d i j ) are used for his purpose. As a criterion, the skewness of the distribution of the dissimilarity variable d is considered as Jsk=E[d
1.
d - E[d] E[d - E [ d ] ] 2
3
(7.5)
where E[.] denotes expectation. The performance of skcwricss criterion as a function of \R( is shown for the Gaussian sets in Fig. 7.12, top left. Small representation sets appear to be insufficient to represent the problem well, as it can be concluded from the noisy behavior of the gra,phs in that region. For large represeiitatioii sets, the curves corresponding to the Gaussian samples of the chosen dimension ‘asymptotically’ grow t,o some specific value of J s k . The final values may be reached earlier for simpler problems in a low-dimensional space, such as k = 5 or 10. In general, thc skewness curves derived for various dimensions show the expected pattern: the simplest problems (in low-dimensional spaces) reach the highest values of J s k ; while the most difficult problems are characterized by the smallest values of JEk.
Mean relative rank. Let dij be the dissimilarity value bet’ween the objects pi and p j . The minimum of dij over all indices j points to the nearest neighbor of p i , say, p , if z = argrninj+(dij). So, in the representation set R , p , is judged as the most similar to p i . Our suggestion now is that a
322
T h e dissimilarity representation for pattern recognition
representation D(p,, R) describes the object p , well if the representation of p z , i.c. D ( p z , R ) is close to D ( p , , R ) in the dissimilarity space D ( . , R ) . This can be measured by ordering the neighbors of the vectors D ( p , , R ) and determining the rank number r,"" of D(p,, R ) in the list of neighbors D(p,,R ). As a result, we will compare the nearest neighbors found on the original dissimilarities to the neighbors in the dissimilarity space. For a well-described representation, the mean relative rank . . n
2=1
is expected to be close to 0. The performance of this criterion for the Gaussian sets is illustrated in Fig. 7.12, top right. It can be concluded that the sizes of the representation set R larger than 100 are sufficient for Gaussian samples in 5 or 10 dimensions.
Principal Component Analysis (PCA) dimension. This criterion is applied to a dissimilarity space. A sufficiently large representation set R will contain at least some objects that are very similar to other objects. This means that their vectors of dissimilarities to the representation objects arc very similar. This suggests that the rank of D should be smaller than lRI, rank(D)
where ncvis such that Q = C,'":,X i / C:=,Xi. There is usually no integer n, for which the above condition holds exactly, so it is found by interpolation. In our experimental results, Fig. 7.12, middle left, the value of Jpca,0,95 is sliown for the artificial Gaussian sample as a function of IRI. Gaussian samples are studied for various dimensions k . It can be concluded that the data sets consisting of 100 or more objects may be sufficiently well Sampled for small dimensions, such as k = 5 or 10. On the other hand, the
Further data exploration
Skewness
323
Mean relative rank 04----
-
1 - 7i n I
I
I x
-OlA--
10
20 50 -A-A 100 200 Size of the representationset
500
PCA dimension
Correlation
Intrinsic embedded dim.
Compactness
--
I,---
- - - - 7
. I
I
+--_--_---
- - - - - - _ I
20
50
100
200
Size of the representationset
500
Size of the representationset
Figure 7.12 Sampling criteria for the !?o.a-distancerepresentations Do.*(R, R ) computed for artificial Gaussian data sets of a vasying dimension k (from 5 to 500). as indicated in the legends.
considered number of objects is too small for the Gaussian sets drawn in larger-dimensional spaces. These generate problems of a too high coniplexity for the given data size.
324
The dissimilarity representation for p a t t e r n recognition
Correlation. The correlations between the objects in a dissimilarity space will also be used. Similar objects show similar dissimilarities to other objects arid are. thereby. positively correlated. As a consequence. the ratio of the average positive correlation p+(D(p,,R ) ,D(p,, R ) ) to the average iiegative correlation p - ( D( p, , R ) ,D(p,, X)) (in the absolute value)
will increase for large sample sizes. The constant of 1 added in tlie denoniinator prevents J p from becoming very large if only small negative correlations appear. For a well-sampled representation set, Jp will be large and it will increase only slightly when new objects are added (new objects arc not expected to influence significantly the average positive or negative correlations). Fig. 7.12, middle right, shows that this criterion works well for the artificial Gaussian example. For less complex problems, ‘Ipreaches higher values and exhibits a flatten behavior for sets consisting of at least 100 objects.
Intrinsic embedded dimension. Another possibility of ,judging whether R is sufficiently sampled is to estimate intrinsic dimension of the underlying vector space, determined such that the original dissimilarities are preserved. This can be achieved by a linear embedding (provided that D is symmetric) into a pseudo-Euclidean space, as described in Sec. 3.5. The resulting representation X is found in an m,-dimensional vector space, in 5 n,, such that it is centered at the origin and it has uncorrelated features. The dominant variances, as captured by the eigenvalues determined in this embedding, should reveal intrinsic dimension (small variances are expected to reflect just noise). Consequently, the dimensions corresponding to small variances can be neglected. (Note, however, that when all variances are similar, the intrinsic dimension is approximately n,.) Let be tlie number of dominant dimensions for which the sum of the corresponding magiiitutles of variances equals a specified fraction a , such as 0.95, of the total sum. Of course: no may not be found exactly, so it is interpolated. Since n, determines intrinsic dimension, the following index is proposed as a critcrion:
,r~r’’
For low intrinsic dimensions, smaller representation sets are needed to describe the data characteristics. Fig. 7.12, bottom left, shows tlie behavior of our criterion as a function of lR(for the Gaussian data sets. With varying
Further data exploration
325
k , the criterion curves clearly reveal different, intrinsic dimensions. If R is sufficiently large, then intrinsic dimension remains constant. Since the number of objects is growing, the criterion should then decrease and reach a relatively constant small value in the end (for very large sets). From our plot, we can then conclude that the data sets of more than 100 objects are satisfactorily sampled for original Gaussian data of a low dimension, i.e. k 5 20. In other cases, the data are too complex.
Compactness. As already mentioned, a synimetric distance matrix D = D ( R ,R) can be embedded in a pseudo-Euclidean space E . When the representation set is sufficiently large, intrinsic dimension is expected to remain constant during a. furt,her enlargement. Consequently, the mean of the data should remain approximately the sa,me and the average distance to the mean should decrease (as new objects do not' 'surprise' anymore) or be constant. The larger the average distance. the less compact t'he class is, requiring more samples for its description. Therefore, a compactness criterion is investigated. It is estimated by the leave-one-out procedure as the average square distance to the mean vector in the embedded space: (7.10) where x i j is a vector representation of the i-t,h object in the pseudoEuclidean space determined by all the objects except for the j-t,h one, and m-j is the mean vector of such a configuration. Fig. 7.12, bottom right,, illustrates the performance of this criterion, clearly indicating a high compactiiess of low-dimensional Gaussian data. The case of k = 500 is judged as not having a very compact description. In general, while studying these criteria curves, one should remember that the height of the curve is a measure of the complexity and that a flat curve may indicate that the dissimilarity data are sufficiently well sampled. For the skewness, mean rank and correlation criteria holds that lower values are related to a higher complexity. For the other criteria, this is the other way around: lower values are related to a lower complexity, An cxception is the compactness criterion, as its behavior is scale dependent. 7.3.2
Experiments with the NIST digits
A set of four classes of handwritten digit shapes of 0 , 1 , 2 arid 3 from the NIST database [Wilson and Garris, 19921 constitutes a representation set R. For each class, n = 200 objects are considered. The modified Hausdorff
The dissimilarity representation
326
for
pattern recognition
distance D h l ~ Def. : 5.3, is computed between the digit contours derived from binary images. Three variants of the distance representations are studied. They are based on the element-wise (Hadamard) power transformation: D&, D h I ~and DiFG. These power transformations do not change the rank of dissimilarity data, but they influence the corresponding dissimilarity arid embedded spaces and, thereby, the proposed criterion values in a non-linear way. Figs. 7.13 7.15 present the performance of the six criteria introduced in the previous section as a function of a growing representation set. The experiments are repeated 20 times for randomly chosen subsets. The following observations can be made: 1. The critcria perform somewhat differently on the four digit sets. In general, the set of ‘ I ) + is the simplest (the most compact one) arid the set of ’3’-s is the most difficult one. 2 . Powcr transformations of the dissimilarity representations influence the criteria significantly. The complexity of the original data increases in the DiFZ representation. This means that more objects are needed to represent the data characteristics well than for the original data D M H . Consequently, Difz-sets seem to be undersamplcd. The complexity of the original data decreases by using D&, Most data sets may be judged as well sampled in these cascs. 3 . The skewness criterion, Eq. (7.5), can provide useful information on a dist,ribution of dissimilarities. Negative skewness denotes a tail of small dissimilarities, while positive skewness dcnotes a tail of large dissimilarities. Large positive values indicate possible outliers, while large negative values indicate that one deals with a heterogenous class, having clusters of various spreads. Skewness ca,n also be noisy for very sniall sample sizcs, as observed in Fig. 7.13. 4. The mean relative rank is a criterion judging thc consistency between the direct nearest (found on the given dissimilarities) and the nearest neighbors in a dissimilarity space. It should converge to zero with an increasing sample size. As non-decreasing transformations do not change the original nearest neighbor relations (although they affect the relatioris in a dissimilarity space), this criterion is riot very indicative for such modifications. Except for the artificial example, the curves behave similarly. Studying Fig. 7.13, one may also observe the linearizing effect of a power transformation, with a small power, applied to the dissimilarity data; the difference in complexity between the digit sets has disappcared. ~
327
Further data exploration
Skewness
Mean relative rank
04
50 100 Size of the representation set
200
50 100 Size of the representation set
200
50 100 Size of the representation set
200
Representations D:t$ 1.Zi
1
I
I
-4
t-
Ik
1
0.2 08
-2
-3 +
-O5
10
20 50 100 Size of the representation set
0-3 200
Representations D M H
Size of the representation set
Representations
DgH
Figure 7 13 Skewness and mean relative rank criteria as functions of IRI for four sets of the NIST handwritten digits represented by the modified-Hausdorff distances Three D M Hand D:FH different power transformations are used D::;,
328
T h e dissimilarity representation f o r pattern recognition
PCA dimension
10
Correlation
20 50 100 Size of the representationset
200
Representations D&?
i l -
50
100
Size of the representationset
200
Representations DMH
i
KU6t 0
i
\
02
- 3 , -
50 100 Size of the representationset
200
50 100 Size of the representation set
2
Representations DGH Figure 7.14 PCA dimension and correlation criteria as functions of (R1 for four sets of t,he NIST handwritten digits represented by the modified-Hausdorff distances. Three different power transformations are used: DZk', D M H and DgI",.
Further data exploration
Intrinsic embedded dim.
-
.
y -
-
. -
O5
10
20 50 - 0 Size of the representation se
200
329
Compactness
-7
0.24-
L __., 10 20 50
O5
100 Size of the representation set
200
Representations Di:$ 0 7,
7
+.
O5
10
.
20
50
-d 100
Size of the representationset
200
Representations D ~ I H
b
~~~-------
*-----
~
I-------
O5
1'0 20 100 Size of the representation set
200
9
10
20 50 100 Size of the representationset
200
Representations D& Figure 7.15 Intrinsic embedded dimension and compactness criteria as functions of IRl for four sets of the NIST handwritten digits represented by the modified-Hausdorff distances. Three different power transformations are used: D&?, D R ~ H and D:;H.
330
The disszmilarzty representation f o r pattern recognition
5 . The PCA dirnensiori criterion; Eq. (7.7). describes the fraction of significant eigenvalues judged in dissimilarity spaces of a growing dimension. If the dissimilarity data are ‘saturated’, then the criterion curve approaches a value close to zero since intrinsic dimension stays constant. If the critclriori does not approach zero, the problem is characterized by many relatively similar eigenvalues, hence many similar intrinsic variables. In such cases, the problem is judged as complex. Studying Fig. 7.14, left, this criterion indicates that for the D$’$ representation, the cardinality of R is far from being sufficient. For the D M Hrepresentation; the set of .1’-s is well sampled and for the representation, all four digit sets are sufficiently large. 6. The correlation criterion, Eq. (7.8), indicates the amount of positive correlations versus negative correlations in a dissimilarity space. Positive values larger than 0.5 may suggest the presence of outliers in the data. From Fig. 7.14, the set of ‘1‘-s is the best sampled with respect to other digit classes. 7. Intrinsic embedded dimension, Eq. (7.9), is judged by the fraction of dominant dimensions determined by the number of dominant eigenvalues in a linear embedding. In contrast to the PCA dimension, it is not likely to observe that this criterion curve will approach zero for large representations sets. Large dissimilarity values influence how the embedded space is determined arid considerably affect the presence of large eigenvalues. Therefore, this criterion may be close to zero if many eigenvalues tend to be so or if there are some outliers. In this case, a flat behavior of the curve may give an evidence of a reasonable sampling. However. the larger the final value the criterion curve reaches, the larger variability in the studied class. One ca,n conclude from Fig. 7.15 that the set of ‘l’-sis relatively simple and well sampled for the original and D& representations. The class of ’0‘-s has the largest intrinsic dimension. Intrinsic dimension of the combined set of all four digits seems to be relatively low for the original data, indicating that the classes share some descriptions. 8. The compactness criterion, Eq. (7.10), estimates the compactness of a set of objects based on the square distances to the mean vector in an embedded space. A flattened behavior of the corresponding criterion curve may not be very indicative. What is more important, is the value to which the criterion curve arrives at: the smaller the value, the more compact the description. The set of ‘O’-s is judged as the most compact class, as it can be observed in Fig. 7.15. The sets of ‘1’-s and ‘2’-s are
DgH
Further data explorutzon
331
much less compact, indicating possible subclasses or elongated distributions. Not surprisingly, this criterion judges the cornbiried set of all characters as more complicated than any single of them. A global coniparisori of Fig. 7.12 to the set of Figs. 7.13 7.15 shows that the characteristics of high-dimensional Gaussian distributions cannot be found in a real world problem. This observation is empirically confirmed in oiir study on complexity and sampling issues [Duin and Pqkalska, 20051. Concerning the criteria used, the following can be finally concluded. All criteria are informative when treated as complementary. The most indicative ones are skewness, PCA dimension, correlation and intrinsic embedded dimension. They are addressed in three different ways: by the dissimilarity values dircctly, in dissimilarity spaces and in embedded spaces. So far, these criteria focused on unlabeled problems. Other proposals using label information may also be considered in relation to classification problems. See [Duin and Pqkalska, 20051 for an additional investigation. ~~
7.4
Summary
This chapter pertains to techniques that enable exploration of dissiniilarity data. Many clustering algorithms already exist in the neighborhood-based framework, often called proximity-based clustering. Our contribution here is to propose the use of both embedded and dissimilarity spaces. As such, the use of a dissimilarity space for clustering is new. One of oiir interesting conclusions is that the dissimilarity space approach might be especially useful for problems in which at least one of the sought clusters is compact and others are widely spread. In general, the NQC-clustering (a mixture of Gaussians or EM-clustering) seems to work well in the PCA-dissimilarity space. Grouping techniques realized in both embedded and dissimilarity spaces give promising results. Intrinsic dimension of the data can be globally estimatcd in both enibedded and dissimilarity spaces. In general, the former methods are based on detecting the satisfactory dimension of an ernbedded space, while the latter rely on various reduction-based techniques. A simple linear pseudoEuclidean embedding as well as the principal cornporient analysis in a dissimilarity space may provide reasonable indications. A number of criteria have been considered that can be used for exarnining whether a representation set contains a sufficient number of objects to describe characteristics of a class. The problem itself is ill-defined as
332
The dissimilarity representation f o r pattern recognition
it depends on a specific application of what ‘sufficient’ means for a single class. One might imagine that classes are well sampled, but positioned with respect to each other in such a complicated way that the classification problem is difficult for most decision functions. As a consequence, the size of the training set should be judged from an evaluation of the classification result using a test set. In the presented study, an attempt is made to find out whether it is possible to judge its sampling density from a dissimilarity matrix. Some criteria are proposed, which work well overall. The most indicative are the ones based on the number of most dominant eigenvalues, either in PCA-dissimilarity space or the pseudo-Euclidean embedding, and on the skewness and correlation.
Chapter 8
One-class classifiers
T h e whole is more th a n the m m of its parts. “METAPHYSICA 1 0 ~ - 1 0 4 5 ~ ARISTOTLE ”.
In the problem of one-class classification’, one of’the classes: called the target class, has to be distinguished from all other possible objects, which are considered as outliers, non-targets or anomalies2. Such a task needs to be addressed in practical applications. Examples are any type of fault detection (machine wear) [Ypma and Diiin: 19981 or target detection (face detection in images), abnormal behavior (intruder attacks on networks, suspicious behavior in surveillance checks), disease detection [Tarassenko et al., 19951, person identification, authorship verification [Koppel and Schler, 20041, etc. In extension, the methodology for handling such situations can also be useful for imbalanced data cases as one-class classifiers can be trained for each class separately [Juszczak and Duin, 20031. The problem of one-class classification is, therefore, characterized by the presence of a well sanipled target class. The goal is to determine a proximity function of‘an object to the target class such that resembling objects will be accepted as targets and outliers will be rejected. It is assumed that a wellsampled training set of target objects is available, while no (or very few) outlier examples are present. The reason for this assumption is practical, since outliers may occur only occasionally or their measurenients might be very costly. Moreover, even when outliers are available in a training stage: they may not always be trusted, as they are badly sampled, with iniknowri priors arid ill-defined distributions. In essence, outliers are weakly defined as they may appear as any kind of deviation or anomaly from the target This term originates from Moya [Moya et al., 19931, 20utliers and non-targets are somewhat different concepts. Non-targets denote examples with characteristics which are ‘opposite’ t o the ones that targets posses (e.g. ‘ill’ versus ‘healthy’), while outliers denote anomalies, i.e. examples that are different than the targets (e.g. a n advanced versus a mild stage of a disease). For simplicity. we will henceforward only refer t o outliers. 333
334
The dissimilarity representation for p a t t e r n recognition
examples. Still, one-class classifiers need t o be trained t o optimize the error on both classes. In principle, one-class classification methods are concept descriptors, i.e. they refer to all possible knowledge that one has about the target class. The model description of this class should be ‘large’ enough to accept most new targets, yet sufficiently ‘tight’to reject the majority of outliers. This is, however, an ill-posed problem since knowledge about a class is deduced from a finite set of target examples, while the outliers are sampled infrequently or not at all. Outlier identification is an old topic in statistical data analysis [Tukey, 19601, usually approached through robust statistics. In general, robust statistics emerged as a family of techniques for estimating the parameters of parametric models while dealing with deviations from idealized assumptions [Hampel et al., 1986; Huber, 1981; Rousseeiiw and Leroy, 19871. It investigates the effects of deviations from modeling assumptions, usually those of normality and independence of the random errors. Robust parameter estimators are proposed that make use of quantiles, ranks, trimmed means, medians, censoring of particular observations, sample weighting, ctc. Deviations include all types of rounding errors due to inaccuracy in data collection. coritarriiriation by outliers and departure froni assumed s a m ple distributions. Outliers are believed to deviate severely from the characteristics exhibited by the majority of the data, usually due to errors in data collection, copying or computation. So, they are often assumed to be caused by human error. Outliers can also arise from sampling error, where some niernbers are drawn froni a different population than the remaining examples, faulty research methodology, or from faulty distributional assumptions. But they can also be legitimate cases sampled from the correct population. As outliers generally increase error variance of the parametric methods and can seriously bias or influence statistical estimates, their identification is irriportant. Mutivariate methods used for their detection often rely or1 a robust estimate of the Mahalariobis distance, Sec. 5.2.1, arid the comparison with critical values of the x2 distribution [Rousseeuw and van Zomcren, 19901. Such methods can be applied in one-class classification, however. the problem as such is more general arid will be described here from a broader perspective. The most cornnion approaches to one-class classification are probabilistic: some of them naturally follow the methods of robust statistics [Barnett arid Lewis, 19941. They model the target class by a probability density fiinction, equipped with a suitable threshold. A test sarriple is judged as a tar-
One-class classzjiers
335
get example if the estimated probability is higher than the given threshold [Chow, 19701. A probability density function can be estimated by a parametric method, such as a Gaussian distribution or a mixture of Gaussians [Scott, 20041 equipped with a threshold based on the Mahalanobis distance, or by a non-parametric method as the Parzen density estimation [Parzen, 19621 or k-nearest neighbor estimators [Tax, 20011. Other approaches rely on reconstruction methods, which fit a model to the data by using prior knowledge. They mostly make assumptions of clustering characteristics of' the data or their distribution in subspaces. Examples include PCA and a mixture of probabilistic PCA, neural networks, such as auto-encoders [Japkowicz et al., 19951 or self-organization maps [Parra et al., 19961 and various clustering techniques, such as k-means or k-centers [Jiang et al.? 20011. These techniques are adopted in one-class classification as they are means to characterize unlabeled data. Finally, boundary methods have been proposed, which are domain tlescriptors; see [Tax and Duin, 1999, 2004; Tax, 20011 for an introduction into the subject. In a vector space, they optimize a closed boundary around the target class. Determining such boundaries is often related to niinirnizing the volume of the model description and heavily relies or1 (Euclidean) distances. Except for variants of the nearest neighbor approaches, another example is the support vector data description (SVDD) introduced in [Tax and Duin, 20041. This OCC finds the smallest hypersphere that encloses (almost) all target points. Other flexible descriptions are enabled by the use of (conditionally) positive kernels, Def. 2.82, in the spirit of the support vector machines. A similar technique, the v-SVM, is proposed in [Scholkopf et ul., 2000b]. It uses a hyperplane to maximally separate the target data from the origin. These two methods are equivalent for the data scaled to a unit norm, as e.g. for the Gaussian kernel [Tax and Duin, 20041. Although different methods are developed for om-class classification, other researchers prefer visual inspection of the data to identify potential outliers. A number of ideas in this direction can be found in [Bartkowiak, 2000! 2001; Bartkowiak and Szustalewicz, 20001. In this chapter, we will first introduce a general one-class classificatiori problem and present some approaches in vector spaces. Then, we will describe a few class descriptors built on dissimilarities. These are constructed based on neighborhoods, in pseudo-Euclidean or dissimilarity spaces. The results presented here are based on [P&alska et al., 2002b, 2003, 2004al.
The dissimilarzty representataon for pattern recognition
336
8.1
General issues
The issues presented here rely on the work of Tax; see [Tax, 20011. Oneclass classifiers (OCCs) are trained to accept target examples and reject outliers. The basic assumption about an object belonging to a class is that it is similar to other examples within this class. Such a proximity can be judged e.g. with respect to an average representative or to a set of essential class rnenibers. In general, an assignment of an object z to the target class WT is realized by a proximity function or a typicality measure h,,(z) equipped with a threshold y. When objects are represented in a feature vector space; h,, is a (non-)linear combination of the original features. Alternatively, when objects are represented by dissimilarities, h,, becomes a function of the given dissimilarities. The role of y is to distinguish among whether an instance belongs to the class or not. One-class classifiers can, therefore, be seen as boundary descriptors, where the complexity (and type) of a boundary is specified by h,,, which niodels data characteristics, while a specific threshold is set by y. In general, the more complex h,,, the more flexible the boundary and the tighter the fit to the target class. This may, however, lead to overfitting, as the boundary will eventually adapt to the noise. The interest is, therefore, to obtain a reliable and smooth boundary, which is as complex as the data permits, but not more. Note that in this context, we when an object is accepted by a one-class classifier, it lies inside it, otherwise it is outside. Moreover, as a one-class classifier is assumed to be determined by a closed boundary, we may discuss its volume. Table 8.1 Four situations of classifying an object in one-class classification. The true positive (q,)and true negative (rt,) correspond to objects which are correctly classified. The false negative (rf,,) and false positive (.a) correspond to objects which are wrongly classified and describe the error of the first E I and the second kind, respectively. Moreover, Ttp r f n = 1 and rip rtn = 1 hold.
+
+
True label target target
outlier r f p or EII/EaUt
Estimated label
Since only the target class is representative and easily available, it is hard to decide how tight the boundary should fit around the data in the input space. As only the distribution p(x1w-r) of the target class is known,
One-class classifiers
337
the error on the target class, the target rejection rate or thc false negative ratio, can be minimized. The outlier acceptance rate, & O u t . or the false positive ratio, can only be estimated when example outliers arc used or when additional assumptions about their distribution are made. Note that those two errors are also called the errors of the first and second kind, E I and respectively. Table 8.1 shows all possible situations of classifying an object in one-class classification problems. The most general assumption about outliers is their bounded uniform distribution (in a bounded area) [Tax, 2001; Tax and Duin, 20021. In such a case, by minimizing the outlier acceptance rate (rfp), a classifier with minimal volume is obtained as a result. So, instead of miriiinizing both E I and ~ l l the , volume of an OCC can be optiniized together with E I . It may: of course, happen that the true outlier distribution heavily deviates from the uniform one. This, however, cannot be predicted without example outliers, and moreover, as the distribution of outliers may be ill-defined, the assumption of uniformity seems to be the less imposing. 8.1.1
Construction of one-class classifiers
A one-class classifier can be presented in the following form: Cocc(z;w,a)=qIL,,(X;w,a)
2 7)=
1, z is a target object, 0, x is an outlier,
(8.1)
where h,, is a proximity function, w arc the free parameters of' h,, and Z is the indicator (characteristic) function. The parameter a influences the flexibility of the boundary, hence it denotes the complexity of the resulting model. For instance. it refers to the number of clusters k in an OCC based on the k-means clustering or the number of Gaussian components K in an OCC defined by a mixture of Gaussian densities. Without loss of generality. we will assume that h,, measures a similarity of x to the target class. If we choose the proximity h,, t o be a distance function, then the inequality sign in Eq. (8.1) is reversed, hence Cocc(z; w. a ) = Z(h,,(z; w , a ) 5 7).Given h,,, one needs to optimize its parameters and the threshold y to specify a one-class classifier. Most of the classifiers focus first on the optimization of h,, and the threshold y is usually selected afterwards; only a few methods can optimize them both together. One-class classifiers are characterized by their trade-off between thc false positive ratio (&I) and the false negative ratio (€11). Different y's will result
338
T h e dissimilarity representation for pattern recognition
in different trade-offs. Usually, the threshold y is optimized to reject a certain (user-supplied) fraction rfnof the target class, such as Tfn = 0.05, for instance. It should be small to prevent a high acceptance of outliers as targets. In theory, this value specifies the error on the target class in practice, however, due to estimation problems, the true error may differ froin rfn.Given this constraint, h,, should be optimized such that the error on the outlier class &Out is minimum. Assuming a uniform outlier distribution. the volume of the classifier Cocc can be minimized for a given complexity parameter a . Assume that data are represented in a n-dimensional space, hence x E R". Suppose that ~ ( X ~ W T describes ) the target distribution. Without loss of generality, we will assume that h,, is a similarity function, taking nonnegative values. h,, is optimized such that: rnin &Out = min W w
s.t.
LrL
Z(h,,(z;
When the optimization of h,, becomes:
.Il'
Z(h,,(z; w , a) 2 y) dx
W,
a ) 2 7) P(XIWT)
(8.2)
dx = 1 - r f n
is independent of y, then the constraint
~~~(hwT(x;w,a))~h,T(x;w,a) = Tfn,
(8.3)
where p (h,,(x; w , a ) )is the probability distribution of the proximity function. It is known that the optimal solution to minEoUtas in Eq. (8.2) is W
~ ( x ( w T )with the threshold y determined by Eq. (8.3) [Tax and Miiller,
20041. Given a fixed target acceptance rate, rtp = 1 - Tfn, the threshold y will be derived from the training set such that the OCC accepts rtpof the target class (so the empirical target error is r f n ) .Given N training samples, y is determined such that:
Remember that h,, is derived on the training set. To avoid overfitting, a better estimate might be obtained by using an additional validation set, if one has lots of data. Suppose Tthr is a user-specified fraction, a rough estimate of rtp.Then, y may be determined as (1 - rthr)-percentile of the sorted sequence of' the
One-class clussi5ers
339
1
c L O8 U +
EO 6
H
go4
-
P 02
'0
04 0 6 0 8 outliers accepted (FP)
02
1
Figure 8.1 Receiver Operator Characteristics (ROC) curves. One is marked by a solid line and one is marked by a dashed line. The gray area corresponds t o the AUC measure associated with the ROC solid-line curve.
proximity outputs computed from the training examples. This is a practical implementation of Eq. (8.3), where ( l - r t h r ) is an estimation of rfnas judged by the estimated probability density function from the proximity values hWT(xZ).If the number of samples is small and the proximity function creates wide boundaries, (1- p h r ) may diffcr from the assumed rfn.This is important to realize, since in our study, due to the implementation of oneclass classifiers [Tax, 20031, some of them estimate the threshold directly based on rfnand some others use Tthr.
ROC curve. To study the behavior of one-class classifiers, a RcceivcrOperator Characteristics (ROC) curve can be used [Bradley et al., 1998; Tax and Duin, 20011, which is a function of the true positive ratio (target acceptance) (I versus the false positive ratio (outlier acceptance). Eout . Of course, example outliers are necessary to evaluate it. Outliers are provided either in a validation stage, or they are generated according to an assumed distribution. In principle, an OCC is trained with a fixed target rejection ratio rfn(or the threshold fraction q h r ) for which the threshold is determined. This classifier is then optimized for one point on the ROC curve. In order to compare the performance of various classifiers, the AUC measure can be used [Bradley, 19971. It computes the Area Under the Curve (AUC), which is the total performance of a one-class classifier integrated over all thresholds:
340
The dissimiluraty representation for pattern recognition
where y is estimated on the target set. The AUC value of 0.5 or less indicates that a particular OCC is worse than random guessing. The larger AUC, tlie better the one-class classifier is. For instance in Fig. 8.1, the solid curve indicates better performance since the corresponding AUC value is larger than the AUC value for the dashed curve. The black dots indicate points for which the thresholds of the two OCCs were optimized. In practice, may be limited to a tighter range, so the integration above can be performed for a specified region of interest, such as [0.05.0.3],for instance. Moreover. a s the costs of making wrong decisions may differ, additionally a weighting function w(ytar) may be introduced. Estimation of complexity Another important problem in one-class classification is the determination of the right complexity Q of huT. This is merely a parameter which influences the flexibility of the resulting boundary. For instance, if one models the target class as a mixture of K normal density functions, then K is such a complexity parameter, while the free parameters w correspond to the parameters (mean vectors and covariance matrices) of the individual distributions. A criterion was proposed for this purpose in [Tax and Miiller, 20041, where no implicit assumption on the outlier distribution is made. Basically, a one-class classifier is trained, starting from the most simple complexity parameter for an increasing a till it becomes inconsistent with the specified error on the target training set. The incorisistency is evaluated on a target validation set of M examples as Etar > Etar 2,/MP(l P), where E:\ is the target rejection error on
+
-
the validation set. Words of precautions. To estimate the outlier acceptance rate or derive a ROC curve, outlier examples can be uniformly generated in a bounded region around tlie target data. Note, however, that the needed number of outliers becomes huge in high dimensional spaces. Assume that the target class is centered at the origin in Rn and modeled by a hypersphere with the radius R . Suppose further that the outliers are generated from a hypercube [-R. R ] " . It is known that the volume of the hypercube is Vout = 2"R". while the voliinie of the hypersphere is = 2BE& n r ( y ) , where I? is the gamma function r(t)= zt-' exp{ -z}dz, t > 0. With the increasing dimension n, V,,, goes to infinity, while V,,, goes to 0. These facts become already very apparent for n such as 6-10 [Tax, 20011. The consequence of this fact is that in high dimensional spaces, (n>l o ) , practically all the outliers drawn from tlic hypercube will be rejected by the OCC-hypersphere. This means that such an OCC will be optimized for the target rejection rate only. In
som
One-class classifiers tiaussian density
Moti density
341
Parzen density
PCA 1
I
-5
k-means
0
5
1-NN
SUM
-5
0
svuu
5
Figure 8.2 Example one-class classifiers in feature spaces for a theoretical banana class. The 1-NNDD is trained with rthr = 0.1, the remaining classifiers arc trained with ~~h~ = 0.05. Two Gaussian distributions are selected for the MoG-OCC. Since the data are in 2D, the PCA-OCC is forced to find a one-dimensional subspace.
general, the shape of the OCC will have a different form, however. if the volumes of the target description and the region of outlier generation differ (such that their ratio V,,,/VOut approaches zero with increasing dimension), one must be careful with the analysis. In such cases, dimension reduction techniques should be used first. 8.1.2
One-class classijiers in feature spaces
Following [Tax, 20011, we will briefly describe the three main approaches to one-class classification in feature spaces: density methods, reconstruction methods and boundary methods. Many of such techniques are adaptations of well-known methods in unsupervised and supervised learning. For a discussion on their robustness to outliers, possibilities tBoincorporate oiit,liers into training, setting of the parameters, and both coniputational and storage cost, the reader is referred to [Tax, 20011 and his other articles. In the presentation below, we will assume that the data live in an n-dimensional space. Examples of the methods discussed below are shown in Fig. 8.2 for an artificial data. Density methods. In the probabilistic setting, an OCC is obtained as a density estimate of the target class with a suitable threshold to neglect the tails. The may work well if the sample size of the training target class
342
The dissimilarity representation for pattern recognition
is sufficiently large (and the dimension of the feature space is restricted) arid a flexible density estimator is used. However. the density model cannot be too complex, as the curse of dimensionality should be avoided [Jain and Chandrasekaran, 19871. On the other hand, too siniple density model will introdiice a large bias. So a right trade-off between the model complexity arid fit to the data should be found. When a good probability model is assumed (with a small bias) and the number of training target examples is siifficieritly large, this approach gives a boundary in the areas of high density of the target distribution. The simplest model is the normal density. given as :
where p is the mean vector and C is the covariance matrix. By the central limit theorem [Billingslcy, 1995; Wilks, 19621, this model assumes that all target points originate from the mean by introducing a large number of small independent and identically distributed disturbances. To avoid badly scaled target data or a singular covariance matrix, a regularization can be applied or a pseudo-inverse can be computed instead of the inverse, as discussed in Sec. 4.4.2 for the normal density-based classifiers. The advantage of the normal density is that the optimal threshold y for a given target rejection rate, rfn,can be computed, provided that target data are normally distributed. Following the same reasoning as for the distribution of square Euclidean distances, as discussed in Sec. 7.2, the square Mahalanobis distance D i f = (x - p)TCp'(x - p ) is X$-distributcd with r n degrees of freedom. The threshold y on D i I should be set at the specified target rejection rate rfnsuch as x k ( D i f )d ( D i f ) = rfn. Robust estimation of the parameters of a normal distribution IS discussed in [Rousseeuw and Leroy, 1987; Rousseeuw arid van Zomeren, 19901. In practice, a more flexible density model may be needed, which is a mixture of Gaussian models (MoG). This is a linear combination of normal distributions:
where aII are the mixing coefficients. Although this model has a smaller bias than the single normal distribution, it requires much more target examples for the estimation of the parameters. If K is defined beforehand,
One-class classafiers
343
the parameters of the individual components can efficiently be estimated by the EM algorithm [Bishop, 19951: see Appendix D.4 for more details. To reduce the number of free parameters, often diagonal covariance matrices are assumed. A further extension is hy the Parzen density estimation [Parzen, 19621. The estimated density becomes a mixture of, most often, Gaussian kerncls centered on individual training target points with identical diagonal covar iance matrices C = hI (more general covariance matrices can he used a5 well). Given N training target points { x ~ }one , has:
The parameter h is optimized by maxiniurn likelihood [Duin et (11.. 2004bI. Note that this density estimator is sensitive to scaling.
Reconstruction methods. Characteristics of the target data are niodeled by reconstruction methods; which make use of prior knowledge and make assumptions about the generating process of the data. Usiially, methods of' unsupervised learning are adapted to find clusters or extract feat,iirc:s to represent tlie data more efficiently. In these one-class classification rnethods, a set of prototypes or subspaces is used and a reconstruction crror is defined. An OCC is determined by minimizing the chosen reconstriiction error. The basic assumption is that outliers do riot follow tlie t,arget distribution? and, as a result, they should yield a high reconstruction error'. The reconstruction error (the proximity function) of a test object is a sort of a distance to the target set, either to specific prototypes or to a subspace. The simplest method is the k-means clustering [MacQueen, 19671 wlierc the target data are characterized by a few prototypes p j . Tlicse arc created as mean vectors of a set of vector in local neighborhoods, measmed by the square Euclidean distance. The placing of p j is then optimizcd by minimizing the following reconstruction error: ~
The k-means clustering can be solved by thc EM-algorithm [Bishop, 19951: see also Sec. 7.1.1. Another possibility is to use a self-organizing map (SOM) [Kobonen. 20001. Then, the prototypes are not, only optimized with resptct to the
344
T h e dissimilarity representation f o r p a t t e r n recognition
target data, but they are also constrained to a low-dimensional, usually two- or three-dimensional manifold with some regular grid. This is optimized, often by the use of neighborhood function decreasing with time, such that the neighborhood relations between the vectors and prototypes on this manifold, are also preserved in the original feature space. Note that this may be suboptimal if this low-dimensional manifold does not fit the data well. In both cases above, the proximity function is based on the square Euclidean distance h;irn(x) = min /Ix- pill 2 . I
Concerning feature extractors, a Principal Component Analysis (PCA) can be used to estimate a linear subspace. The PCA mapping finds this subspace such that the variance in the data is preserved as well as possible in the square error sense; see also Appendix D.4.3. Without loss of generality, we assume that the mean vector of target data lies at the origin. The eigenvectors of the estimated covariance matrix are determined and the ones corresponding to the largest eigenvalues are chosen as orthonormal vectors. They become the principal axes, which point in the directions of the largest variances. A linear K-dimensional subspace is then defined by a K x m matrix Q K consisting of K most significant eigenvectors, where K is chosen to explain a specified fraction of the total variance in the target class. The projection of x onto this subspace is computed as Qx; and the reconstruction of x is estimated as Q-Qx = Q'(QQ')-'Qx = QTQx, where Q- is the pseudo-inverse of Q and QQT = I , since the eigenvectors are orthonormal. The reconstruction error of a test object x is then
EPCA(x;Q, K ) = h;,CA(x; &) = I / x
-
QTQxlI2
(8.10)
This OCC method performs well when a clear linear subspace is present, even if a small target sample is only available. However, when the target class lies in a nonlinear subspace or when is distributed in separate subspaces (or clusters), the PCA will produce an intermediate subspace which may be a poor representation of the target class. Note that the PCA is incapable of incorporating known outliers into the training, as it focusses on the variance of the target set only. The PCA can be extended to a mixture of (probabilistic) principal component analyzers, formulated in a probabilistic context by [Tipping and Bishop. 19991; see Appendix D.4.4 for more details. Basically, by introducing several (orthonormal) principal component bases Q k , a mixture of Gaussians is applied, where each density is estimated in one of such PCA subspaces. This leads to covariance matrices of the form C I , = a21+ QLQk.
One-class classifiers
345
The proximity of an object x to the total mixture model is defined as: K
MPCA LT (x;bJ. QJ>,"=,,
c2, K)=
hET(x;{ P ~2, 1
+ QjQ,)).(8.11)
j=1
The free parameters can be estimated by the EM algorithm. which minimizes the log-likelihood of the training target set [Tipping and Bishop, 19991. Autoencoders [Japkowicz et al., 19951 or diabolo networks are neural network approaches to learn a representation of the data. They are trained to reproduce the input vectors at their output layer. The difference between them is in the number of hidden layers and their sizes. Autoencoders have one hidden layer with a large number of hidden units, while diabolo networks have three or more hidden layers with nonlinear transfer functions. where the 'middle' hidden layer contains a very low number of hidden units. It is known that autoencoders give a PCA type of solution [Tax, 20011. Both networks are trained by minimizing the mean square error. The reconstruction error is the square Euclidean distance between thc original and output (reconstructed) vector:
Boundary methods. These methods find ways of determining a closed boundary around the target set. Usually, they lead to a description with a minimal volume. In most cases, they rely on (Euclidean) weighted distances to a subset of the training target objects. As they directly focus on the boundary, the threshold on the proximity function is also directly obtained. Some of these methods will be described in Sec. 8.2.1. Here, we will only introduce the support vector data description (SVDD). The SVDD was proposed in [Tax and Duin, 1999, 20041. Let a target set be represented by N vectors in a n-dimensional feature space, i.e. {x, E R",i = 1 , 2 , . . . , N } . A hypersphere of the minimal volume is sought which encloses the target points. Suppose this hypersphere is defined by the center a E R n and the radius R. To accommodate outliers in the training set, the distance from xi to the center a must not be strictly smaller than R2, but larger distances should be penalized. This is done by the use of slack variables & , which measure the distance to the boundary. The tradeoff between the volume of the hypersphere and the fraction of outliers is controlled by an extra parameter C. The task becomes now to minimize both the radius of the hypersphere (and indirectly the volume) and the
346
T h e dissimilarity representation for p a t t e r n recognztion
distance from the outliers to the boundary L ( R ,C , & ) = R2 requiring that (almost) all the data lie inside the hypersphere: R2+CCiEi
Min.
s.t.
+ CCi[i,
lIxi
-
all2 5 R2 +ti, i = 1 , 2 , . . . , N .
(8.13)
(8.14)
The above constrains can be incorporated into L by applying Lagrange multipliers Q? arid optimizing the resulting Lagrarigian [Bishop, 19951. Then, the center a is found to be a linear combination of the data vectors xi, a = Cjaixj such that Cia, = 1 and 0 5 c i i 5 C , for all i. As many ( 2 , become in practice zero, the vectors x, corresponding to the positive cri become then support vectors (SVs). They appear to lie on the boundary of the hypcrsphere. The radius R is determined by computing the distance from tlie center a to any support vector. Then, the SVDD becomes: CSVDD(X)
=z(IIx-aII2 5 R
~ )
More flexible descriptions can be introduced analogously to the SVM method [Vapnik, 19981; see also Sec. 4.4.2.All the inner products (x,y) above can be replaced by a suitable kernel function K ( x ,y ) . Especially, the Gaussian kernel K(x.y ) = e x p ( - w ) provides a good transformation [Tax, 20011. Such a Gaussian kernel contains a complexity parameter CJ, which influences also tightness of the boundary. For a small C J , the SVDD resembles a Parzcn density estimator, while for a large CJ, the original hypcrsphere is obtained [Tax and Duin, 1999, 20041. CT can be selected by setting thc maximally allowed rejection rate rfnon the target set. If a new variable v = is defined, it describes an upper bound for the fraction of target vectors outside the description [Scholkopf e t al., 2000bl. Tlic width parameter o is optimized for a specified target rejection rate.
&
8.2
Domain descriptors for dissimilarity representations
Having introduced the one-class classification problem, we can now focus on dissimilarity representations. In general, we will assume that reflectivity and positivity conditions are fulfilled for the dissiniilarity measure d. (Note that in conceptual representations, Sec. 4.2, one can also use negative dissimilarities.) Although for convenience the symmetry requirement
One-class classz5ers
347
is adopted (it is required for an embedding of dissimilarity data into a pseudo-Euclidean space), it is not necessary for constructing OCCs in pretopological and dissimilarity spaces. We do not require that the measure a! is metric. Remember that a dissimilarity representation D ( T ,R ) based on the representation set R = { p l , p 2 , . . . , p n } and the training set T can be now interpreted in three ways. In the pretopological approach, one-class classifiers will be defined by the dissimilarities to neighboring objects. In the embedding approach, where R C T and the symmetry condition holds, OCCs will be built in embedded pseudo-Euclidean spaces. In the dissimilarity space approach, classifiers will be constructed in n-dimensional dissimilarity spaces D ( . ,R ) . Unless stated otherwise, both R and T consist of the target objects only.
Concave transformations of dissimilarities. Transformations of dissimilarities play a two-fold role. If a measure is unbounded, then some atypical objects of the target class (i.e. with large dissimilarities) may badly influence the solution of a one-class classifier. Therefore, a transformation to a bounded interval is useful. For instance, a transformation to [O, 11 can be used, such that the dissimilarities are scaled linearly in local neighhorhoods and, globally, all large dissimilarities become close to 1. Another issue is to impose an extra flexibility (complexity) of a model description by a nonlinear transformation equipped with a parameter to be tuned. The purpose of such a transformation is e.g. to enhance the compactness (expressed by small dissimilarities) of local neighborhoods by setting a proper parameter value. To determine a suitable value is not trivial; usually an additional validation set, possibly containing outlier examples, should be used or the inconsistency criterion applied, as proposed in [Tax and Miiller, 20041. Transformations that we have in mind are non-decreasing concave functions since they preserve the order of original dissimilarities. The concavity ensures that metric properties are preserved; see Theorem 3.7. Examples of such transformations are the following functions3 (defined on : )XE linear, f(z)= ax, power, f ( x ) = z p , where p E [0,1], logarithmic, f ( x ) = 2 - 1 or f ( x ) = log(1 a x ) , or sigmoid f(x) = 2 -1,
+
l+exp{- >}
where s controls the ‘slope’ o f f (the size of the local neighborhoods). Such transformations are applied in an element-wise way to dissimilarity representations such their transformed versions f ( D ( z R)) , are obtained. 31t is straightforward to check their monotonicity and concavity. The latter is guaranteed by non-positive second derivative [Fichtenholz, 19971.
348
T h e dissimilarity representation for p a t t e r n recognition
Below we will introduce one-class classifiers constructed in three different frameworks. as discussed above. Their behavior will be illustrated on an artificial example of a theoretical banana target class described by a Euclidean distance representation D ( R ,R ) and its sigmoidal transformations. The following notation will be used: P ( z ,R ) = 2 / ( l +exp{--}) D ( x , R ) - 1 (sigmoidal-I) and Db2(x,R ) = 2 / ( 1 8.2.1
+ e x p {D-( zTR)’} )
-
1 (sigmoidal-11).
Neighborhood-based OCCs
We will consider domain descriptors built by the use of neighborhood relations to the representation objects. Such objects are chosen as the ones that have relat,ively many close neighbors, as judged by their dissimilarities. The classifiers introduced here will rely on the k-centers algorithm or ont the dissimilarities to the k-nearest neighbors [Tax, 2001, 20031. The k-centers data description relies on the dissimilarities to k center objects only. Remember that the k-center method is originally a clustering technique. It looks for k center objects, i.c. examples ~ ( ~ ~ 1( 2, 1. ,. . , p ( k ) that rriiriiniize the rnaxiniuni of the dissimilarities over all the objects to their nearest neighbors. Starting from a random initialization, the error E = Iriaxi,l,,,,,~ min, D ( t i , p ( , ) ) is minimized in a forward search strategy. With M trials, such as M = 50, the objects corresponding to the minimal valuc of E are determined; see also Sec. 7.1.2. Given a dissimilarit,y representation D ( T ,R), the k-centers R, = >p(k. are found among the representation objects from R. Given N training objects t i , the dissimilarities to their nearest centers are computed as d,,(t,, Rcent)= niinz=l,,..,kd ( t t , q Z ) ) . The threshold y is determined as the (1-rthr)-th percentile of the sorted sequence of d,, where r t h r is a specified fraction, such as r t h r = 0.1. Alternatively, a suitable y can be sought such that a specified false negative ratio, qn= 0.05, i s reached. The classifier C~--CDDis then defined as
In the k-ncarcst neighbor data description (k-NNDD), the proximity fimction relies on the nearest neighbor dissimilarities. Given N target training objects t,, the average k-nearest neighbor dissimilarities are derived as drtrL(t7. R) = d ( t , , p f ) , where p: E R is the j - t h nearest neighbor of t,. The threshold y is determined as the (l-rthr)-th percentile of the sorted sequence of d n n . Similarly, as above, y can be first found to ensure the
I:=,
349
One-class classifiers
k-CDD
k-NNDD
I
I
Figure 8.3 Neighborhood-based OCCs for a Euclidean distance representation D of a theoretical banana class. The plots on the left and right sides show the OCCs with the thresholds Tthr = 0 and Tthr = 0.1, respectively. Remember that rthr is a threshold on the derived conceptual proximity values and not the false negative ratio. That is why for Tthr = 0.1, one cannot expect 10% of points to be outside the class boundary. The legends refer to various choices of k .
that the fraction of rcjected target examples is qn.The classifier C~-NNDD becomes then
where p.', E R is the j - t h nearest neighbor of ic. So, Ck-NNDD relies on the dissirnilarities to all objects from R. Note that as non-decreasing and concave transformations preserve the order of dissimilarities, they will hardly change the OCCs with respect to the ones built on the original dissimilarities. Note that the proximity function can also be used to define not one: but many thresholds, e.g. thresholds either for each center (the k-CDD) or for each object (the k-NNDD). This, however: requires a lot of dat,a to guarantee a reasonable estimation. These one-class classifiers are built using the target information only. It is not clear to us how potential outliers can be incorporated to define the boundary. In general, if IRI < JTI.then the OCCs defined on D ( T , R ) will be denoted by their reduced versions,
350
T h e dissimilarity representation f o r p a t t e r n recognition
i.e. the k-nearest neighbor reduced data description and k-centers reduced data description, the k-NNRDD and k-CRDD, respectively. To give an example, neighborhood-based OCCs are trained on a Euclidcan distance representation D of the theoretical banana class. Since the data are two-dimensional, the boundaries of the derived classifiers can be drawn. The results are presented in Fig. 8.3. The ‘bubble’-like character of the I;.-CDD is caused by the Euclidean balls (containing neighboring objects) around the centers. One can easily see that the value of k influences the boundary a lot; it is a complexity parameter of the k-CDD. On the other liarid, although the boundary of the k-NNDD relies on the averaged k-NN distance. it is estimated by using all training objects. As a result, it boundary becomes smoother. 8.2.2
Generalized mean class descriptor
One of the simplest way to describe a class relies on the proximity to the ’average, representative. If objects are described as vectors in a feature space, then the mean vector plays such a role. In Sec. 4.5, wc discussed that a proximity of an object to the average representative can also be formulated when only a dissimilarity representation is given. This leads t o the generalized nearest mean. Assume R represents the target class W T . Any symmetric dissimilarity matrix D ( R , R ) can be interpreted as a distance matrix of an underlying, pseudo-Euclidean space I such that the pseudo-Euclidean distances are preserved. Assume that x results from the projection of D ( z ,R ) onto E . We know from Sec. 4.5 that the proximity function h U T ( D ( zR, ) , w T )= IlIx - x&I/z/ can equivalently be computed by the use of dissimilarities
cy=i
& cr=~ c,”=i
as h u T ( D ( X , R ) , W T ) = ; 1 d 2 ( z , p i )d 2 ( P i > p j ) 1 . To construct a one-class classifier based on this principle, the threshold y can be chosen as the (1- rthr)-th percentile of the sorted sequence of the huT(D(ti,R), W T ) values. The generalized mean-class data description (GMDD) is defined as
(8.18)
& lTD*’l( I 7).
R )T(1; ) lTD*’(z,R) Alternatively, C G M D D ( D ( X , = Sirice this inequality implicitly corresponds to I I Ix,-X& 1;
I
5 y,this one-
One-class classzfiers
351
class classifier accepts objects as targets if they lie inside a pseudo-Euclidem hypersphere with the radius of y. We assume that R = T holds, however R might consist of a fixed, small subset of the objects from T . The objects in R should represent the dissimilarity information on the original objects such that the pseudo-Euclidean mean defined by R lies close to the original mean. In our proposal, the selection of R relies on Eq. (4.31), as discussed in Sec. 4.5. There, we showed that given two classes, the difference between the average between-class square dissimilarities and the average within-class square dissimilarities approximates the square pseudo-Euclidean distance between the two class nieans in the embedded space. This knowledge can be used as follows. Suppose R is a raridoni subset of T . Consider two classes: one defined on I2 and the other on T . Now, the square pseudo-distance between the two class means can be approximated using Eq. (4.31). Hence, we can proceed with random choices of R and finally choosing the one which offerers the sniallest difference t o the mean defined on the complete set. This is computed fast, so e.g. N = 100 of possible sets R can be considered. This selection will be called the mmm-resemblance. Some flexibility can be gained by nonlinear transformations of the dissimilarities. As an example, the GMDD is trained on a Euclidean distance representation D of the 2D banana class, as well as its sigmoidal transformatioris D" and 0"'. The boundaries of the resulting one-class classifiers are shown in Fig. 8.4. Since the original representation is Euclidean, then the GMDD on D yields a spherical description (the boundary is defined by the square distance to the mean of the target class), as observed in the figure. The parameters of sigmoidal transformations are not optimized: they were chosen in relation to local neighborhoods. Depending on the parameter, the resulting boundary may become either tighter or wider. Generalized weighted mean class descriptor. Since the GMDD relies on all objects of the set R , a natural extension is to define a similar classifier, but based on a few objects only. This leads to the concept of a weighted mean in a pseudo-Euclidean space E l ?To = /?$xi, where all pi are nonnegativc arid C:=,/Ti = 1. Ideally, the flZshoiild he selccted such that many of them are zero and only a few of them are positive. This would imply a sparse formulation based on the dissimilarities to the nonzero objects only. Remember that R C T and D ( R ,R ) is assumed t,o be syrrimet,ric. A one-class classifier in the embedded I can be now desigiied based
The dassimalaraty representation j o r pattern recognition
352
D'; ~ = d "
I
050
Figure 8.4 Generalized mean class descriptor (GMDD) for dissimilarity representation ll (orig) and its signioidal transformations: D s and Ds2 of a theoretical banana class. s is a parameter of such transformations. The legend describes various choices of s, i.e. s = 0.5m, m, 20, where o is defined for the original distances D either as the averaged m - n e a r e s t neighbor distance (d") or t h e standard deviation ( d s l d ) . The threshold r t h r = 0.1 has been used.
on the square distance to the weighted mean of the target class. Let X = {XI, x 2 , . . . , xn} be a vector representation in & resulting from the embedding of D ( R ,R ) . The remaining T \ R objects are then projected to E . Let Xg be the weighted mean vector of all objects. The classifier is now described as a pseudo-Euclidean hypersphere with the center placed a t the weighted mean and the radius f i . This leads to the proximity fimction huT(D(z, R).wT)= J J J-xZ$JI:J. Similarly as above, such a proximity can be equivalently expressed by the dissimilarities only 1 n as L , ( D ( ~ , R ) , w T=) I Ckkl P d 2 ( z : , P i )- z C,=I C,"=IP d j d 2 ( P i , P j ) I = IpTD*2(x,R)- z1 pT p 2 (R,R)PI.This formulation can be derived analogously as in Sec. 4.5 by using Xg instead of X. Such a formulation is similar to the support vector data description (SVDD), described in Sec. 8.1.2. The SVDD is reformulated for kernels, hence positive definite similarity representations, while we focus here on general dissimilarities. The question now arises how the weights pi should be found. A logical approach is to determine such that the pseudo-hypersphere has a minimal (positive) radius, hence the square pseudo-Euclidean distances to the wcighted mean are minimized (in the pseudo-Euclidean sense). Training objects p , (T = R ) can be then forced to lie inside the pseudo-Euclidean hypersphere. Nonnegative slack variables ti, accounting for possible errors, are introduced to obtain a tighter boundary: niin s.t.
One-class classifiers
353
N
where u E (0,1]is a user-specified parameter. The idea of using C7=l
=Z(I
C P D * ~ ( ~ , P -+I~ )
5 Y),
(8.20)
P h f o
where dp = ;PTD*2(R,R)P. As before, concave transforniations car1 be applied to the dissimilarities to add an extra flexibility. We presented a simple one-class classifier in a pseudo-Euclidean space. Other methods can be obtained by suitable adaptations of tlie class descriptors described in Sec. 8.1.2. These are Gaussian and Gaussian mixture models. a PCA subspace model or a mixture of probabilistic PCA model.
8.2.3
Linear programming dissimilarit3 data description
If the representation set R contains target objects, then the objects I): with large dissimilarities D ( z , R ) are considered as outliers and should be remote from the origin in the dissimilarity space. This characteristic is used for designing a one-class classifier [Pqkalska et al., 20031. If the dissimilarity measure d is metric, then all vectors D ( z ,R ) , lie in a prism, bounded from below by a hyperplane on which the representation objects lie and boiindetl
354
The dissimilarity representation for pattern recognition
Figure 8.5 Illustrations of the LPDD (left) and the LPDD-I1 (right). The dashed lines indicate the boundary of the area which contains the genuine objects if the measure is metric. The LPDD tries to minimize the max-norm distance from the bounding hyperplane to the origin, while the LPDD-I1 tries t o attract the hyperplane towards the average of the distribution. The LPDD-II is defined below.
from above by tlie largest dissimilarity. We will assume that d is bounded, otherwise it can be scaled to be such; see also Fig. 4.12 for an illustration of this. Consider a dissimilarity representation D ( T ,R ) , where R is the representation set R = { p l , p 2 , . . . ; p n } and T = {tl,t a , . . . , t N } is a set of objects. Let H be a hyperplane in R";i.e. H = {x E R" : wTx = p, w # 0 E R",p E R} and let x E R" be any point outside H . x can be projected onto H by using an arbitrary norm .tP,p 2 1 [Mangasarian, 19991. Then, tlie distance between x and the hyperplane H or, in fact, the distance between x and its projection XH onto H , is measured by the dual norm eq such that q satisfies = 1. The &-distance of x to H is defined as
+
IWTX-LJ
d , ( X l H ) = IIX - XHlIrl = llwl~T, Target objects are represented as points in a non-negative part of' a dissimilarity space. As a result, the points are naturally bounded by the hyperplanes of tlie form zi = 0, for z = [zl, z z ? .. . , znIT€ Rn. To describe boundaries of' thc target class, one could minimize the volume of a prism; cut by a hyperplane H : wTD(z,R ) = p; see Fig. 8.5 for an illustration. (Note that in general H is riot expected to be parallel to the prism's bottom hyperplane) In this case, non-negative dissimilarities impose both p 2 0 and 'UJ; 2 0. However, this task might be infeasible. A natural extension is to minimize the volume of a simplex with the main vertex coinciding with the origin of a dissimilarity space and the other vertices, say vj, resulting from the intersection of H and the axes of the dissimilarity space. Note that v j is a vector of all zero elements except for t ~ j i= p / w z ,provided that wi #O. '
One-class classzfiers
355
Assume now that there are n < N non-zero weights of the hyperplane H , such that, effectively, H is constructed in RTL.From geometry we know that the volume V of such a simplex can be expresscd as (Vbase/r!). (p/[lwl\z). where Vbase is the volunie of tlie base, defined by the vertices vj. The minimization of h = p/llw/l~,i.e. tlie Euclidean distance from the origin to the hyperplane H is then related to the minimiza,tiori of V . Let D ( T ,R ) be a dissimilarity representation bounded by tlie hyperplane H , i.e. , wTD(ti,R ) p for i = 1 , 2 , . . . , n, such that the &distance to the origin d , ( Q , H ) = p/IIwJlp is minimal (remember that y satisfics = 1 for p 2 1, since tq and -4, are the dual norms) [Mangasarian, 19991. This means that the hyperplane H can be determined by minimizing p - / ( w ( l PHowever, . in order to avoid any arbitrary scaling of w, wc may require that lJwJl,= 1. As a result the construction of H can kit. solved by the minimization of p only. The mathematical programming formulation of such a problem is [Mangasarian, 1999; Bennett and Mangasarian, 19991:
<
$+$
rriin p s.t. WTD(ti,R ) 5 p ,
liw/lp = 1, p
i = 1 , 2 , .., N ,
(8.21)
2 0.
From the algorithmic point of view, by assuming (lw(lp= 1, one rcqiiires that the dissimilarities are bounded by small values, siicli as 1 or 10, for instance; otherwise p should become very large t o fiilfill the constraints, which niay lead to an unbounded minimization problem. If p = 2, then the hyperplanc H is det,erniinetl by minimizing /!,,the Euclidean distance from the origin to H . Such an minimization problem results in a quadratic optimization. A simpler, linear programming (LP) formulation is of interest to us. This can be realized for p = 1. Knowing 1 1. simple that /(w/Iz5 llwlll 5 fillwliz and by assuming that ( ( w / [= calculations lead to p < h = p//lwli2 < f i p . Therefore, by minimizing d , ( O , N ) = p (and requiring that / J w / J = 1 1): h will be hoiinded, (for a, fixed and small R, the minimization of p bounds h) arid, therefore. the volume of the considered simplex, as well. By the above reasoning and Eq. (8.21), a class represented by dissirnilarities can be characterized by a linear proximity function with the weights U J and ~ the threshold p. Such a hyperplane simply ‘piishes’ the objects in the directioii of the origin in the dissimilarity space. Our one-class classifier,
356
T h e dissimilarity representation for pattern recognition
the Linear Programming Dissimilarity-data Description C ~ D is D defined as: (8.22)
The proximity function is found as the solution t o a soft-margin forniulation*, which is a straightforward extension of the hard-margin case (by neglecting the slack variables), as:
(8.23)
wi>_O, p > O , < i > O ,
i=1,2
,.., N
where are the slack variables, allowing objects to lie above the hyperplane, i.e. accommodating some targets as outliers. In the L P formulations, sparse solutions are obtained, meaning that only some weights w j are positive. Objects corresponding to such non-zero weights, will be called support objects (SO). (Note that they cannot be called support vectors, since they directly refer to the objects and not to their vector representations.) These support objects construct the effective representation set Re, JR,I = r and, as a result, test objects need to be evaluated by computing dissimilarities to objects from R, only. The left plot of Fig. 8.5 is a two-dimensional pictorial illustration of the LPDD. The data are represented in a metric dissimilarity space, and by the triangle inequality, the dissimilarities can only lie inside the prism indicated by the dashed lines. The LPDD boundary is given by the hyperplane, as close to the origin as possible in terms of the &-distance (determined by the minimization of p ) , while still accepting (most) target objects5. The outlicrs should be remote from the origin. Proposition 8.1 Consider Eq. (8.23) for D ( T , T ) . T h e n , v E (0,1] is the upper bo,imd o n the outlier fraction f o r the target class, i e . the fraction of objects that lie outside the boundary; see also [Scholkopf et al., 2001, 200Ub]. This means that CE1(l- C L P D D ( D ( ~ 5 ~ ,v. T)) We abuse here somewhat the soft-margin and hard-margin formulation from the support vector machine research e.g. [Scholkopf et al., 2001, 2000b; Tax and Duin, 20041, where it can be proved that v is an upper bound for the error on the target class. 5This picture might be misleading. In a vector space Rn,all representation objects lic in ( n- 1)-dimensional subspaces determined by all but one basis axes. E.g. in a 3D, the objects from R are placed on the zy-, zz- and yz-planes.
One-class classi5ers
357
D"; u=d"
r-
1
1-
- - - - 0.50 I0
1
-
............
I : Q : $o i. '
D s 2 ; u = ds t d
-_I
0.50
ci
.
% '.
...........
y
1o
',
,,'
._ Y
050
I
" =o
I
" =0.1 I
..... i
d
t'
,,
I
I
...... '..,_._I .' _,ll
+%
i
,'
'i.
.......
=O.l
Y
=0.1
Figure 8.6 Linear programming data description for a dissimilarity representation D (orig) and its sigmoidal transformations D s and Ds2 for a theoretical banana class. Various boundaries are shown depending on the choice of s; s = 0.50,0,2u, where a is defined for the original distances D either as the averaged m - n e a r e s t neighbor distance (d") or the standard deviation ( d s f d ) . The number of support objects varies from 2 , 3 for the LPDD based on original distances (marked by a dash-dotted line), 6 for s = 2u and up to 18 for s = 0 . 5 ~ .The upper plots refer to v = 0 and the bottom plots refer to v = 0.1.
Sketch of proof. The proof goes analogously to the proofs given in [Scholkopf et al.. 2001, 2000bl. Intuitively, these proofs follow the reasoning given as: assume we have found a solution of Eq. (8.23). If p is increased slightly, the term C,& in the objective function will chaiige proportionally to the number of points that have non-zero tt (i.e. the rejected target objects). At the optimum of Eq. (8.23), therefore, it has to hold Nv 2 number of outliers. As before, nonlinear transformations of dissimilarities can be used. The LPDD is trained on a Euclidean distance representation D of the 2D banana class, as well as on signioidal transformations D" and 0". The OCC's boundaries are presented in Fig. 8.6. As it can be observed, such a L P formulation offers flexible descriptions of the data boundary depending on the transformation Using outlier information. The LPDD can straightforwardly be extended to handle example outliers. This means that T will have some outlier examples. The representation set R can also contain outliers. If the problem describes the targets against 'pure, non-targets (healthy ver-
358
The dissamzlarity representation f o r pattern recognition
siis diseased people), we think that the instances of R should belong t o the target class, otherwise the outliers from R niay become support objects, lierice objects which determine the decision. This point, however, needs to he investigated further. In the LPDD, the determined hyperplane in a dissimilarity space is attracted towards the origin and the objects are placed in the half-space below this hyperplane. In fact, they lie in a simplex with the main vertex coinciding with the origin and the other vertices resulting from the intersect,ion of tlie hyperplane arid the axes of this dissimilarity space. This is described by the constraint wTD(ti,R) 5 p [ i , where [, 2 0 account for possible errors, siicli that the targets can be fourid above the hyperplane. This is the place, where oiitlicrs should lie. If some outliers are accepted as targets, this will lead to wTD(ti.R) 2 p - ti, assuming that t, are now outlier examples. An additional variable yi E {+l,-1} will denote targets by 1 and outliers by -1. Eq. (8.23) remains then the same except that the constraint changes to yi (wTD(t,,R)) 5 ?j,p t i . This constraint simply forces thc known outliers (y, = -1) to be placed in the right half-space of the hyperplane.
+
+
Linear programming dissimilarity data description 11. A linear programming formulation for novelty detection has also been proposed in [Campbell and Bennett, 20001. The reasoning therc starts from a feature space in the spirit of positive definite kernels K ( S ,S) based on the vector , x ~ } .The authors restricted themselves to (modified)
R.BF kernels, i.e. for K ( x , , x j ) = exp{ -D(xi'x.7)2},where D is either
2s2 Eiiclideari or 41-distance. In principle, we will refer to RBFp, as to the 'Gaussian' kernel based on the [,-distance. Here, to be consistent with our LPDD method, we will rewrite their soft-margin LP formulation ( a hardmargin formnlation is then obvious by neglecting the slack variables), to iiicliide a trade-off paramcter v as follows:
inin
+ ~ , " = , ( w ~ ~S)(+x i ,+ EL=, ti p)
+
s.t>. wTK(x,,S) p
WT1
=
2 -&,
i = 1 , 2 , ..,N
(8.24)
1:
wi>O,[i>O,,
i = 1 , 2 ,.., N .
Unfortimately, p now lacks the interpretation as given in the LPDD case. is a trade-off parameter, relating different quantities, i.e. weighting tlie error contributions and the average classifier output. From our point of
One-class classifiers
359
view, K can be any similarity representation, moreover, not necessarily square (in the same way as the LPDD is defined for general dissimilarity representations). So, for simplicity, we can denote this method as Linear Programming Similarity-data Description (LPSD). Following the principles as described above, one can consider an equivalent formulation for the LPDD. Including also the information on possible outliers (yl = -1), the soft-margin LPDD-I1 (a hard-margin LPDD-I1 is then obvious) becomes then:
Similarly to the LPDD, a sparse solution is obtained. Hence, also the objects corresponding t o non-zero weights are the support objects. The C ~ D D - I Iis defined identically as the CLPDD in Eq. (8.22). The difference lies in the way how the weights w are found during the training process. If only target objects are given, the hyperplane is determined such that its averaged output is attracted towards origin. Hence, such a formulation may lead either to a narrow description of the target class (narrower than the LPDD) for a compact class or to a wide description of the target class, when there are examples lying further away than the main bulk of the data. Sec Fig. 8.5 for an illustration of such a case. However, when outlier examples are present, this might be advantageous, since the outliers influence the average output and, as a result, a hyperplane ‘balanced’ in-between the targets and outliers can be determined. Here, to be consistent with our dissimilarity approaches, we will focus on the dissimilarity-based OCCs. For remarks concerning the LPSD, see [Pqkalska et al., 20031. Only LP classifiers were presented here. All other methods described in Sec. 8.1.2 can be applied in dissimilarity spaces. 8.2.4
More issues on class descriptors
There are additional points to be discussed concerning the class descriptors, especially the L P classifiers. An important point refers to the support objects of the LPDD and tlie LPDD-11. There is an essential difference between such support object and to a similar concept of support vectors in the SVM terminology. Since we operate on dissimilarities, tlie chosen
The dissimilarity representation for pattern recognition
360
La
Neighborhood-based OCCs
. ;:.. ..
. .. *..
*
3-NNDD;
Tthr
= 0.1
3-NNDD; 7'thr
0.2
2-CDD;
Tthr
= 0.1
-:.]::::
GMDD on D;z9(T,7'); rthr = 0.1 I
I
I
,
....
Hard-margin LPDD on D:29(T, T)
,
I
Hard-marain LPDD-I1 on Di2q(T,T ) v
I
s
= 0.5d"
s
=d "
Figure 8.7 One-class dissimilarity-based classifiers on the two-cluster data represented by the non-metric &,s-distances Do.g(T,T) and their sigmoidal transformation D:ZQ,. Neighborhood-based OCCs are designed on the original dissimilarities (since they are not influenced by sigrnoidal transformations), while other OCCs are built on their transformed versions. In the latter case, the slope parameter s takes one of the following Values: S = O.5d", 0.8dNNrd", 1.5d", 2d", where d" iS the averaged V%-neareSt neighbor distance. Although the boundaries of one-class L P classifiers are originally determined by hyperplanes in dissimilarity spaces, they are nonlinear in the input space due ho the nonlinearity of the distance. Support objects are marked by squares. Note that the support objects of the LPDD-I1 tend to lie on the boundary, which is not necessarily true for the LPDD. Since the boundary of the k-CDD is determined by k objects, these are also marked by squares (see top row).
361
One-class classifiers
3-NNDD;
1(
rthr
GMDD; r+hr= 0.1
= 0.1
1
err = 0.1
.
;.I
I 1
1
-- -, *
.
I
LPDD-11: u = 0.1
LPDD: u = 0.1 err = 0.06
Figure 8.8 One-class dissimilarity-based classifiers for the uniform cloud with outliers. OCCs are built for the ti-distance representation D1(T,T ) . Support objects are marked by squares. Note that the k-CDD may not be able t o disregard outliers in thc target class. Note also that the LPDD determined by four support objects defines a reasonably tight boundary around the cloud.
I
err = O.’
*I
err = 0.1
err = 0.1
L
s = 0.5d,,,
s
= d,
s = 4d,,
Figure 8.9 GMDD built on sigmoidal-I transformations of the !I-distance represcntation D1 designed for the uniform cloud with outliers. Five objects are selected from T based on the mean-resemblance criterion (Sec. 8.2.2),for the representation set R c T. They are marked by squares. All training data points are assumed to be targets. The GMDD is trained with Tthr = 0.1 and is originally considered in embedded pseudo-Euclidean spaces defined by Di(T, R ) and D;’(’T, R ) . The plots show the resulting boundaries in the input space. From left t o right, s takes the values of 0.5dn,,, d,,,, 1.5dr,,,, 2d,, and 4d,,, where d,, is the average distance. ‘err’ in the plots refers to the effective error on the target set.
362
T h e dissimilarity representation for pattern recognitzon
support objects arc related to the dimensions of a dissimilarity space. The boundary of the OCCs is determined by a suitable hyperplane on which some objects are likely to lie. In general, such boundary objects are different than the support objects, although they may coincide. The boundary objects are in principle objects which are far away from the origin, hence they influence the hyperplane weights, hence the support objects. This means that by removing support objects from tlie target class, the OCC’s boundary may remain nearly unchanged [if there are other objects close t,he removed support objects), however, by a removal of a far-away boundary object, other support objects can be chosen. This is in agreement with the SVM forniulations, where support objects are also boundary objects. In general, all class descriptors mentioned here are able to uncover clusters in the data, as well as ‘outliers’ present in the target class, provided that proper parameters are specified. Consider two two-dimensional artificial data sets. The first set consists of two clusters of 15 points each. It will be denoted as the two-cluster data. These data are represented by a non-metric lo.g-distance. The second set contains one uniform, rectangular cluster, contaminated with three outlier points. It will be denoted as the uniform cloud with outliers data. In total, 50 points are given. For a dissimilarity representation, the el-distance is used. (We have explicitly chosen non-Euclidean distances to show that any distance can work.) Now, the one-class classifiers are trained. Since the artificial data arc twodimensional, it is possible to draw the decision boundaries in the 2D input space, even if the OCCs are trained in (high-dimensional) dissiniilarity spaces. Figures 8.7 - 8.11 show the boiindaries of various dissimilaritybased class descriptors trained on square dissimilarity matrices D ( T ,7’) or their sigmoidally transformed versions. In Fig. 8.7, the two-cluster data set and decision boundaries of the trained OCCs are shown. The neighborhood-based OCCs are designed on the original dissiniilarities D0.9 since they are hardly influenced by concavc non-decreasing transformations. Other OCCs are built on their signioidally transformed versions D& and DG!9. Although the OCCs are trained on square dissimilarity representations, the LP classifiers offer sparse solutions by choosing a number of support objects, i.e. objects which determine the boundary and t o which the dissimilarities should be computed in a testing stage. Basically, the number of support objects is related to the complexity of the boundary, which can he observed while comparing tlie leftrriost and rightmost plots of the LPDD and the LPDD-I1 in Fig. 8.7. The k-CDD is also determined by a small number of objects, namely k objects, where k is
One-class classifiers
363
Sigmoidal-I transformation Soft-marEin LPDD on D : ( T , T ) ; v = 0.1 err = 0.06
err = 0.06
Soft-margin LPDD-I1 on D ; ( T , T ) ; v = 0.1 err = 0
err = 0.08
E
MI
0
s = 0.5d,,,
s = d,
s = 4d,,
s = 2d,,
Sigmoidal-I1 transformation Soft-margin LPDD on D s 2 ( T ,T ) ; Y = 0.1 err = 0
ill I
err = 0.06 err=0.04
*I ,
Soft-margin LPDD-I1 on D s 2 ( T , T ) ; v = 0.1 err = 0.06 i
s = 0.5dm,
s = 2d,,
Figure 8.10 One-class soft-margin LP classificrs, trained with Y = 0.1, designed for the uniform cloud with outliers. All training data points arc assumed t o be targets. The data are described by the PI-distance representation D1. However, herc, the sigrnoidal transformations are used. Ideally, the OCCs should disregard maximum 10% of the points. T h e classifiers are determined in dissimilarity spaces D f and DS2. From left to right, s takes the values of 0.5d1,,,, d,,,,, 1.5d,,,,, 2d,, and 4d,,,,, whcre d, is the average distance. The plots show the resulting boundaries in the input space. ‘err’ in the plots refers t o the effective error on the target set. Support objects are marked by squares. They belong t o thc OCCs.
The dissimilarity representation for p a t t e r n recognition
364
Hard-margin LPDD
1
s = 0.5dme
/
LPDD-I1 on 0 ; ( T ,T )
err=^
s = d,
s = 2dm,
s = 4dme
Figure X . 1 1 One-class hard-margin LP classifiers designed for the unzform cloud wzth ovtlzers. The data are described by the !I-distance representation D1 and its sigmoidal transformation. Here, the three outliers are labeled as such in a training stage. As a result, hard-margin OCCs should reject them as the target points lie in a relatively compact cloud. The classifiers are found in dissimilarity spaces 0 : . The plots show the resulting boundaries in the input space. From left to right s takes the values of 0.5dm,, d,,,, 1.5dm,, 2dme and 4d,,, where d, is the average distance. 'err' in the plots refers to the effective error on the target set. Support objects are marked by squares. Both L P classifiers return the same support objects, hence the same boundaries. This follows since the target data are a compact cloud and outliers are disregarded.
specified beforehand. (This is tlie difference with the LP classifiers, where the support objects are specified by u and are determined in the response to the mathematical programming formulation.) On the contrary, the kNNDD and the GMDD require all objects for the boundary construction. Two cases are considered for the artificial data with three outliers. First, all the data points are assumed to be targets and soft-margin LP classifiers with u = 0.1 are trained. Then, these three outliers should possibly be ignored. This can be observed in some plots of Fig. 8.8 and Fig. 8.10, provided that a proper scaling parameter s of the sigmoidal transformations is used. Another possibility is to label the outliers appropriately and use them in the training of the LP classifiers (other classifiers, the k-CDD, the I;NNDD and the GMDD cannot directly incorporate such label information). So, in the training set? 47 points are labeled as targets and three points as outliers. Given that, it is sufficient to train hard-margin LP classifiers, since the remaining points makc a compact cloud. The results are presented in Fig. 8.11. While soft-margin LP OCCs trained on the targets seem to be highly influenced by the slope parameter s (Fig. 8.10), the hard-margin classifiers, trained by using also the outlier information, seem to be much less. When the LP classifiers are designed by treating all the points as targets, it is much harder for tlie LPDD-I1 to disregard the three outliers than for the LPDD; compare the boundaries of the LPDD arid the LPDD-I1 in Fig. 8.8
One-class classi,fiers
365
Soft-margin LPDD on D s 2 ( T ,R ) ; v = 0.1
Soft-margin LPDD-I1 on DS2(T,R ) ; v = 0.1
I I
I
I
s = 0.5d,,,,
I
s = d,,,
I
s = 2dm,
Figure 8.12 One-class soft-margin L P classifiers designed for the unzform cloud with outliers. T contains 50 training points. Five random points of T are assigned to the representation set R. The data are described by the !I-distance representation D1 ( T ,R ) . Sigmoidal transformations O f 2( T ,R ) are used for the OCCs. The classifiers are defined in dissimilarity spaces. The plots show the resulting boundaries in the input space. From left to right, s takes the values of 0.5dme,dnle,l.5dme,2dme and 4dme, where d,,,, is the average distance. ‘err’ refers to the effective error on the target set. Support objects are marked by squares. Note that the support objects are examples of R, hence there can be maximum five of them.
and Fig. 8.10. This is not surprising, since the boundary of the LPDDI1 is determined by taking into account the averaged dissimilarity output to which outliers significantly contribute. In such cases (where only the target data are provided, possibly containing ‘outlier’ examples), logically, it seems more reasonable to use the LPDD. On the other hand, when outlier information is used for training, the LPDD-I1 might work better. In our case, however, Fig. 8.11, both the LPDD and the LPDD-I1 determine the same support objects. hence find the same boundary. They both provide a tight description around the uniform cloud and they seem not to depend much on the parameter of the sigmoidal transformation. Additionally, Fig. 8.12 shows results for a rectangular dissimilarity rcpresentation D ( T ,R ) , in which just five points are randomly chosen from T for the set R. The results are obtained assuming that all the points coristitute the target class. In such a case, the support objects come from R , so there can be maximally five of them. Since the boundary relies on the dissimilarities to a few objects only, its flexibility is limited. The bouridary changes only somewhat with growing parameter s of sigmoidal transfor-
366
The dissimilarity representation SOT pattern recognition
mations; compare Fig. 8.12 with two bottom rows of Fig. 8.10. It might be therefore useful to pre-select a representation set R smaller than the original training set T . 8.3
Experiments
In this section, we will present, three experiments. They aim to study how the introduced one-class classifiers work in practice. 8.3.1
Experiment I: Condition monitoring
Fault detection is an important problem in machine diagnostics: failure to detect faults can lead to machine damage, while false alarms can lead to unnecessary expenses. As an example, we will consider a detection of four types of fault in ball-bearing cages, a data set from the Structural Integrity and Damage Assessment Network [Fault data, site]. Each data instance consists of 2048 samples of acceleration. After pre-processing with a discrete Fast Fourier Transform, each signal is characterized by 32 attributes. which is a sparse sampling. The data set consists of five categories: normal behavior, NB, described by the measurements made on new ball-bearings, and four types of anomalies, A1 Ag. See Appendix E.2 for further description. The experiments are performed in the same way as described in [Campbell and Bennett, 20001, making use of the same training set, and independent validation and test sets; see Table 8.2. ~
Table 8.2 Fault detection data
996 996 996
Since there is no prior information available on suitable dissiniilarity measures, three simple measures are used: Euclidean (!2-distance) , city block (!I-distance) arid non-metric ( 0 8-distance. Hence, we analyze three different dissimilarity representations: Do 8 , D1 and D2. All these dissimilarity representations are linearly scaled by which basically corresponds to a change of a 'unit' and it is performed to bound large dissimilarities. Additionally, three diEerent concave transformations are studied for each
A,
One-class classzfiers
367
representation. These are: a power function with the parameter p , and sigmoidal-I and sigmoidal-I1 transformations defined by the parameter s ; see Sec. 8.2. Since neighborhood-based one-class classifiers are hardly influenced by such transformations, they are not applied there. One-class classifiers are built on (transformed) representations defined for the NB class, i.e. they are defined on D ( T ,R ) , where D stands now for a chosen dissimilarity representation, T is a training set of 913 examples of the NB class and R 2 T is a representation set. Two cases are considered: either R is equivalent to T (which means that an OCC is trained on a square dissimilarity matrix) or R consists of 20% of the target examples selected by the k-centers algorithm (which means that an OCC is trained on a rectangular dissimilarity matrix). The optimal values of either p (power transformation), s (sigmoidal transformation) or k (parameter for the kCDD and I;-NNDD) are determined on the validation set. This is a set of examples from the NB class and two outlier subclasses, A1 and Aq. There is no unique way of determining a good parameter. In our case, we train one-class classifiers to reject not more than 1% of the target objectsG,so we automatically choose the parameter s ( p or k ) such that it corresponds to the smallest mean error averaged over the NB class and two outlier subclasses on the validation set. If tjhere are more parameters for which the same error is reached, the final parameter is chosen as their median value. In practice one may wish to weight the error contributions depending on what is more costly, wither a false alarm or missing the identification of machine fault. One may also wish to keep the percentage of the target rejection under a chosen value, which could lead to a different way of establishing the parameter value. Since we cannot decide what are the factors to be taken into account in such a decision, an automatic procedure based on the averaged error (computed both on the targets and outliers) seems appropriate to follow. To select a suitable s for sigmoidal-I and signioidal-I1 transformations, s was considered in the range of [0.1d,,,, 5davT],where d,,, is the average distance within the target class. k was selected as the best integer between 1 and 50, as judged on the validation set. The errors of the first kind (classifying targets as outliers, i.e. false alarms) and the second kind (classifying outliers as targets, missing fault detection) kind for the OCCs built on the 0 0 . 8 and D1 representations ‘This statement is only approximately true. If v = 0.01 for the LPDD, then 1%is the maximum error on the target class. This is, however not guaranteed for the LPDD-11. In the case of other OCCs, a threshold Tthr is set up t o e.g. 0.01.
368
The dissimilarity representation for pattern recognitzon
1 i
0 50
200
400
600
800
1000
Figure 8.13 Eigenvalues in the embedding of 0 0 . 8 ( T , T ) .
are shown in Tables 8.3 and 8.4. An important observation is that the !o,s-tlistance measure is more advantageous than both metric !Ior Pz -distances. The results for the Euclidean dissimilarity representations and their transformed versions are very bad (much worse than for the city block representation D l ) , so only the results for 00.8 and D1 are presented. Also. as expected, sigmoidal transformations offer more flexibility. Therefore, the performance of one-class classifiers built on the power transformations is not reported here, as the results are much worse for the D1 and DZ representations and somewhat worse for the Do.8 representation. Two factors are especially taken into account to analyze the results. These are the error on the target class, which should be kept small (possibly round 1%)and the error on the two outliers subclasses A3 and Ag, which are novel to the classifiers used (the information on two other outlier subclasses was used for setting-up of the parameters). The following conclusions can be drawn from our study and from Tables 8.3 - - ~8.4:
(1) In general, one-class classifiers perform significantly better on the Do.8 representation than on D1, and, in turn, perform on D1 much better than on DZ. The other general conclusion is that sigmoidal transformations offer more flexibility and contribute to better results of the GMDD and the LP classifiers than power transformations. (2) The best overall performance is reached for the 1-NNDD on Do.8 trained with ?“thr = 0.05; see Table 8.3. The 1-NNDD yields the errors of 9.9% and 8.3% on the AS and A4 outlier subclasses, respectively, while maintaining the zero errors for the other outlier subclasses and the error of 1.4% for the NB class. The errors on D1 are increased to 2.1% and 9.8% for the subclasses A3 and A4 and to 1.6% for the normal class. In both cases, this performance is achieved based on dissimilar-
One-class classafiers
369
Table 8.3 Ball-bearing data. Errors of the first and second kind of OCCs trained on Do s ( T ,R ) . The target class T describes normal behavior (NB); IT1 = 913. R C T has 183 (20%) examples. R is selected by the k-centers, k = 183. Re is an effective set of examples on which the OCCs rely. The optimal parameter s of sigmoidal transformations or the optimal k for the k-CDD and k-NNDD are selected based on performance on the validation set
TxT R TxT TxR
0.01
TxT
0.00
TxR TxT TxR
0.01
TxT R TxT
o'OO 0.01
TxR
TxR
I
TxT [ TxR
I
TxT R
T xxTR T
T xR T T
nn, "'"I
I 1 I
20.35 38.32 20.35
I
1
1 8 1
I
1
0.0 1.2 1.1
I 1
53.2 1.1 48.1
1 1
0.0 1.4 1.0
I I
1
1.5 0.0 0.4
I 1 I
91.4 2.0 83.8
I
I I
1
0.0 21.4 0.8 913 0.7 11.9 2.25 2.29 183 0.7 13.3 0.8 0.0 23.0 913 1.4 4.4 1.4 7.6 2.25 0.0 0.01 2.29 I 183 I 1.4 , 4.5 1.4 I 0.0 , 7.8 I I I GMDD on a sigmoidal-I1 transformation of D0.8 4.51 913 0.8 2.2 0.0 0.9 2.3 1.9 1.4 0.0 2.6 183 1.4 0'00 4.86 1.8 0.0 3.2 1.5 2.0 5.07 913 0.8 2.2 0.0 0.9 "01 4.57 183 2.4 O'Oo
TxT
TxR
u'uu
7.89 17 0.9 0.6 1.3 0.0 0.9 0.0 1.0 0.7 1.2 8.86 8 0.7 0.0 1.1 0.7 1.5 9.02 12 0.9 0.0 1.2 0.6 1.2 10.86 10 0.8 LPDD on a sigmoidal-I1 transformation of D0.8 11.27 16 0.8 0.8 1.2 0.0 0.9 1.0 1.3 0.0 12.86 0.6 11 0.5 0.9 1.2 0.0 1.2 12.40 13 0.8 0.0 1.0 1.3 10.00 14 0.9 0.8 LPDD-I1 on a sigmoidal-I transformation of D0.8 22.26 14 1.1 0.3 1.6 0.0 0.7 19.44 1 0.0 53.2 0.0 1.5 91.4 0.3 1.5 0.0 0.5 13.52 17 1.4 48.1 1.1 0.4 83.4 19.44 1 1.1
18.2 16.1 19.5 19.5
17.0 14.4 17.6 16.9
19.0 17.6 20.2 15.8
17.6 16.3 18.9 13.6
17.1 96.0 11.7 93.2
14.0 97.6 9.3 94.4
96.0 23.6 93.3 55.7 57.5 40.7 41.5
I I I
,
97.6 23.3 94.4 58.9 61.2 42.7 43.3
19.3 27.2 28.8 19.4
18.6 28.6 29.7 18.7
I._
TxT T xR
0.00
TxT TxR
0.01
TxT
0.01
TxT TxR
0.05
TxR
38 47 42 1 1
38 6 47 42 913 183 913 183
5.1 0.5 0.4 0.1 4.6 0.2 1.3 0.7 1.5 1.1 1.7 1.4 k-NNDD on Do.8 2.3 0.1 0.2 0.8 0.4 6.7 0.3 1.4 1.2 0.5 2.6 2.8
0.0 0.0 0.0 0.0
9.5 8.8 1.5 3.3
41.9 39.7 22.9 26.8
41.8 39.3 22.1 27.9
0.0 0.0 0.0 0.0
3.4 11.3 0.0 0.5
29.8 46.1 9.9 15.0
31.5 46.7 8.3 13.9
T h e disszmilarzty representation f o r pattern recognition
370
Tablc 8.4 Ball-bearing data. Errors of the first and second kind of various OCCs trained on D1 ( T ,R ) . The target class T describes normal behavior (NB); IT1 = 913. R C T has 183 (20%) examples. R is selected by the k-centers, k = 183. Re is an effective set of examples on which the OCCs rely. The optimal parameter s of the sigmoidal transformations or the optimal k for the k-CDD and k-NNDD are selected based on performance on the validation set
TxT TxR
0.01
TxT TxR
0.05
2 1 1
913 183 913 183
0.7 0.3 1.9 2.6
4.6 11.0 0.4 0.9
0.5 0.3 1.6 2.7
0.0 0.0 0.0 0.0
6.6 19.7 0.0 1.5
39.4 54.4 12.1 20.9
42.4 56.5 9.8 20.1
One-class classifiers
(3)
(4)
(5)
(6)
371
itlies tjo all 913 training examples. When the 1-NNDD is based on 183 objects only, then its performance deteriorates, becoming worse than the one reached by the LP classifiers defined on less than 20 support objects. On the other hand, since the boundary determined by the 1-NNDD is very wide (see e.g. Fig. 8.3), a larger threshold (i.e. 0.05) on the proximity function should be used. The best LP performance is reached for the LPDD-I1 with v = 0.01 t,rained on a sigmoidal-I transformation of 0 0 . 8 ;see Table 8.3. The errors on the A3 and A4 outlier subclasses are 11.7% and 9.3%, respectively. The error on the NB class is 1.5%. Such results are obtained by using only 17 support objects. The best LPDD (keeping target. error small), based on 14 support objects gives the error of 1.3%for the target, class and the errors of 15.8% and 13.6% for the above mentioned outlier subclasses. Such performances are only somewhat worse than the results for the best 1-NNDD, while they are based on the dissimilarities to less than 2% of all training objects. Since our experiments are done in the same way as in [Campbell and Bennett, 20001, our results can be compared. In [Campbell and Bennett, 2000], a sparse linear programming formulation was proposed (from which our the LPDD-I1 method is derived) for Gaussian kernels. The results on the test set reported there (and also re-created by us in [Pekalska et al., 20031) are the following test errors: 1.3%for the NB class, 0% for Al, 46.7% for Aa, 71.7% for A3 and 74.5% for Aq. They are very bad in comparison to our LPDD results on a sigmoidal-I transformation of 0 0 . 8 or of D1. We think that this is mainly caused be the use of the Euclidean distance, as the Gaussian kernel relies on it. This is supported by the facts that our LPDDs perforni also badly on D2 and when a radial basis function is defined on the city block distance ( d l ) , better results are obtained for the method in [Campbell and Bennett, 20001; see [Pqkalska et al., 20031. The LPDD may benefit from a representation set R smaller than the training set T . This can especially be observed for signioidal transformations of 0 0 . 8 , as shown in upper rows of Table 8.3, where the test errors are about the same or smaller than the ones obtained for complete dissimilarity representations. On the contrary, the LPDD-I1 determines only one support object and its performance deteriorates to about 90% error on the outlier class. A smaller R seems to be disadvant,ageoiis for other OCCs, as well. The GMDD and k-CDD do not perform well.
The dissimilarity representation for pattern recognition
37 2
Do 8 representation 100-
'-
I 50 1
01 1
-loop; I -50
- trarung-
7
,
A +A2 -validation
- t r y
A,+A, -validation
1
=>
-200
-150
-100
-50
0
50
- 6
-04
-02
0
02
04
- 02
0
02
04
D1 representation 04
40
4
r
I.
7
-O-$6
-
-04
D Z representation N! 2 - ~ a q A +A -validation
1
0.31 0.2~
0.1 0
-0 1
-21
-0.2
-4 -0.4 -0.
-8 6
Figure 8.14 2D linear approximate embeddings of the dissimilarity representations (left) or their sigmoidal-I transformations (right). Embedded spaces are defined on the training target objects D ( R ,R ) . The outliers from the validation sets are then projected there. Note the scale differences.
To explain better the differences among different one-class classifiers, we will focus on the data characteristics. As explained in Chapter 6, to understand the data, one may visualize their dissimilarity relations. Here, we simply find pseudo-Euclidean embeddings of each of the dissimilarity representations Do.g(T,T),& ( T , T ) and D z ( T , T ) (T is the training tar-
One-class classzfiers
373
get class) as well as their best (in terms of the performance) sigmoidal-I transformations. The mappings are defined on the NB class and then the remaining outliers from the validation set are projected after that. Approximate projections to 2D are shown in Fig. 8.14. Remember that these linear projections preserve the variance as much as possible as revealed in two dimensions. The preserved variances are equal to 19.396, 26.1% and 70.6% for the &.8(T, T ) ,D l ( T ,T) and D s ( T ,T ) ,respectively. For the sigrnoidal transformations they are somewhat smaller. Although the preserved variance is only large in the case of the Euclidean representation (70.6%):the first two eigenvalues of the embedding (which correspond to the variances) are significantly the largest for all the cases. See e.g. Fig. 8.13, where the eigenvalues for Do.g(T,T ) are shown. Remember that for the &(T,T ) rcpresentation, the resulting projection is equivalent to the PCA applied to data instances represented by their pre-processed 32 attributes in as discussed in Sec. 3.5.1. Although the projections of our dissimilarity data only roughly reflect, the actual relations, they still allow us to get some insight. Analyzing the left plots in Fig. 8.14, one can immediately see that for the P0.8- and P I distances, the target data (NB class) seems to be a rather compact cloud. The outliers are widely spread in-between and around the target class. So, the target class seems to lie among the outliers (and the overlap for the target class is very high). Judging visually, the ratio of the area of the target cloud to the outlier cloud is smaller for the 0 0 . 8 representation than for the DI representation. This simply suggests that the &.s-distance offers better discrimination between the t,argets and outliers. On tjhe contrary, concerning the Euclidean representation, the target cloud is very large in comparison t o the outlier cloud. Hence, many outliers will likely be incorrectly assigned. Analyzing the right plots in Fig. 8.14, one can observe that they change the sizes of the target and outlier clouds and they also shift their positions with respect to each other when comparing to the left plots. As a result, some parts become non-overlapping, and possibly better one-class classifiers can be built. Also bad performances of the GMDD and the k-CDD can now be somewhat understood. The NB class seems to be a relatively compact cloud. As the k-CDD builds a bubble-like description around the k-centers (see Fig. 8.3), it wilI not be beneficial for a single bulk. Since the GMDD still builds a relatively wide boundary around the data points and hence its flexibility is limited (due to the fact that it relies on the mean of the target class in a pseudo-Euclidean space; see Fig. 8.4),it will not be advantageous
3 74
The dissimilarity representation for pattern recognition
.
..
-
5
.
..
. +
Figure 8.15 Approximate embeddings of two dissimilarity representations Did{r (left) and D ~ A M (right) computed on the diseased mucosa data. Embedded spaces are defined on the (training) target objects D ( R ,R ) . The outliers T$ are later projected.
in the situation of a high overlap between the target and outlier examples. So, only flexible one-class classifiers which can build tight boundaries are of use. This explains good performance of the L P classifiers and the k-NNDD. 8.3.2
Experiment 11: Diseased mucosa in the oral cavity
In this cxperinient , we will analyze the autofluorescence spectra acquired from healthy and diseased mucosa in the oral cavity; see Appendix D.4.5 for the data description. The measurements were taken at 11 different anatomical locations using six different excitation wavelengths. After preprocessing [de Veld et al., 20031, each spectrum consists of 199 bins. In total, 988 spectra were obtained for each excitation wavelength, 856 spectra representing healthy tissues and 132 spectra representing diseased tissues. The spectra are normalized so that they yield a unit area. Here, we will focus on a single excitation wavelength of 365 nm. In our study, all spectra are randomly split 50 times into training sets T and test sets T t e in the ratio of 60 : 40. Each training set consists of both target and outlier examples, while representation sets R c T contain only target objects. As a result, the sets have the following sizes: IRI = 514, IT1 = 594 and /Ti,/ = 394 (337 healthy people arid 57 diseased patients). One-class classifiers that use of the target data are trained on D ( R , R ) . When outlier examples are incorporated into the classifiers, they are trained on D ( T ,R ) . In a testing stage, D ( T t e , R ) is used in both cases. Eight dissimilarity representations are considered for the normalized spectra. The first three dissimilarity representations arc based on the 11-
One-class classifiers
375
Table 8.5 Diseased mucosa in the oral cavity. AUC measure (.loo) for various OCCs. The training set T has both target and outlier examples, while R C T consists of the targets only. ( R (= 514 and (TI = 594. Parenthesis show the average number of support objects for the L P classifiers or ] R e ( . The GMDD is defined by 10 objects found by the mean-resemblance approach. The results are averaged over 50 runs. Thc standard deviations of the AUC.100 means are less than 1.4 for the LPDD trained on D ( T , R ) , and less than 0.6, in general.
Di Orig
I
I
I
I
D0.8 @eA. %d,"' DSAM LPDD, v = 0.05, trained on D(R,R) (2.4) 173.5 (3.2) 172.5 (2.9) 179.4 (3.2) 168.1 (2.8) (7.3) 73.1 (7.7j 72.6 (3.2j 79.7 (3.45 68.5 (5.0j (6.8) 73.8 (7.9) 73.0 (4.1) 79.9 (3.6) 72.5 (6.2) (7.6) 76.4 (8.0) 74.4 (7.0) 80.4 (4.8) 74.5 (7.5) LPDD. v = 0.05. trained on DfT. R): outliers used
172.3 72.7 Sig-I 73.7 Sig-I1 7 5 . 9
I1
1
I
I
I
DBH 74.4 74.7 79.3 77.8
1
(2.1) (2.6) (5.0) (6.9)
79.5 (2.7) 84.5 (12.6) 86.1 (9.5) 80.1 (25.3) 83.4 (5.1) 84.7 (11.2) 86.4 (8.7) 82.0(15.9) 84.5 (7.7) 86.0(15.8) / , 86.4 110.8), , 81.7 (11.6) 82.9 (12.0) LPDD-11, v = 0.05, trained on D(R, R) Orig I 66.4 (3.5) I 5 5 . 5 (3.2) I 63.4 (5.9) I 76.5 (6.8) I 58.8 (4.9) 1 7 4 . 2 (3.8) POW 53.0 (11.0) 48.0 ( i 3 . i j e s . s ( i a . l j 76.2 (13.25 6 1 . 9 ( i 2 . 3 j 70.0 i4.8) Sig-I 52.0 (15.9) 51.6 (15.9) 59.1 (9.4) 56.5 (10.0) 55.5 (13.2) 51.4 (10.1) LPDD-11, v = 0.05, trained on D(T, R); outliers used Orig 76.0 (3.4) 66.0 (3.6) 78.8 (6.0) 81.7 (5.2) 74.6 (4.0) 81.2 (3.2) Pow 78.3(11.1) 78.2(14.9) 79.8 (8.5) 81.6 (8.2) 76.5 (7.8) 81.6 (4.1) Sig-I 76.8 (14.0) 76.8 (15.2) 78.1 (8.6) 78.1 (8.7) 75.5 (11.9) 77.3 (9.9) GMDD, r t h r = 0.05, trained on D(R,R) Orig 77.0 (10.0) 77.0 (10.0) 78.8 (10.0) 79.5 (10.0) 76.7 (10.0) 77.2 (10.0) Pow 78.0 (10.0) 78.2 (10.0) 79.5(10.0) 79.7 (10.0) 77.6 (10.0) 78.9 (10.0) Sig-I 78.1(10.0) 78.4 (10.0) 79.4 (10.0) 79.8(10.0) 77.8(10.0) 79.2(10.0) Sig-I1 77.9 (10.0) 78.7(10.0) 78.7 (10.0) 79.2 (10.0) 77.8(10.0) 79.0 (10.0) k-CDD, r t h r = 0.05, trained on D(R,R) k=l 61.6 (1.0) 58.7 (1.0) 48.0 (1.0) 53.7 (1.0) 50.9 (1.0) 72.5 (1.0) k=5 65.5 (5.0) 66.0 (5.0) 75.2 (5.0) 76.7 (5.0) 65.7 (5.0) 73.2 (5.0) k = l l 74.4 (11.0) 74.6 (11.0) 76.1 (11.0) 78.7 (11.0) 73.6 (11.0) 76.9 (11.0) k = 2 1 76.2 (21.0) 76.8 (21.0) 78.7 (21.0) 81.1 (21.0) 76.3 (21.0) 81.0 (2.0) k = 4 1 77.5(41.0) 78.1(41.0) 82.0(41.0) 83.2(41.0) 78.0(41.0) 82.9(41.0) k-NNDD, r t h r = 0.05, trained on D(R,R) k=l 73.4 (10.0) 74.5 (10.0) 76.4 (10.0) 78.6 (10.0) 73.2 (10.0) 77.0 (10.0) k=3 77.9 (10.0) 78.6 (10.0) 79.4 (10.0) 79.4 (10.0) 79.0 (10.0) 80.1 (10.0) 78.3 (10.0) 79.4 (10.0) 79.9 (10.0) 79.7 (10.0) 78.5 (10.0) 79.7 (10.0) k=5 k=l 76.6 (26.0) 77.6 (26.0) 79.7 (26.0) 81.9(26.0) 77.5 (26.0) 81.7 (26.0) 79.7(26.0) 80.3(26.0) 80.4(26.0) / 8 1 . 9 ( 2 6 . 0 ) / 8 0 . 0 ( 2 6 . 0 ) 82.6(26.0) k=3 k=5 79.7(26.0) 80.2 (26.0) 80.4(26.0) 81.4 (26.0) 79.9 (26.0) 82.1 (26.0) k-NNDD, r t h r = 0.05, trained on D(R,R); lRel = 514 everywhere k=l 180.0 80.4 86.7 88.2 80.4 85.7 85.7 86.2 81.1 84.9 81.0 81.4 k=5 84.8 80.8 84.3 85.2 k = l l 80.9 81.3 84.0 84.4 80.4 k = 2 1 80.7 81.1 83.6 80.0 82.8 k = 4 1 80.1 83.2 83.6 80.6
Pow Sig-I; SiE-11 -
82.5 (20.9) 82.6(15.8) 82.2 (13.3)
82.6 (21.7) 82.7(16.5) 82.3 (15.0)
~
~
,
(
\
\
~
I
1
1
1
I
1
1
I
I
I
I
I
I
I
I
1
I 1
I 1
I
1 1
I
I
I 1 I
376
The dissimilarity representation f o r pattern recognition
distances D1 computed between the Gaussian-smoothed spectra and their first and the second order Gaussian-smoothed derivatives, Dferand D?der, respectively. The smoothing was always done with g = 3 samples. The zero-crossings of the derivatives indicate the peaks and valleys of the spectra, so they are informative. The differences between spectra focus on the overlap area, the differences in their first derivatives emphasize the locations of peaks and valleys, while the differences in their second derivatives indicate the tempo of changes in spectra. Also &,gnon-metric distances are used, again between the spectra and their Gaussian-smoothed derivatives, resulting in the following representations 0 0 . 8 , Die; and Did:", correspondingly. Note that spectra are characterized by a natural measiiremerit order expressed in the connectivity between their neighboring wavelengths. Therefore, dissimilarity measures which can incorporate that fact might be beneficial. Derivative-based measures take such information into account. Another dissimilarity representation DSAM is based on the spherical geodesic distance &AM(X, y) = arccos(xTy), which is actually the spectral angular mapper distance [Landgrebe, 20031; see also Sec. 3.5.9. The last representation relies on a Bhattacharyya distance, a divergence measure between two probability distributions; see Sec. 5.2.2. This measure is applicable: since the normalized spectra, say, si can be considered as iniidimensional histogram-like distributions. They are constant on disjoint N ' intervals 11,.. . , I N , such that si(z) = CZT1 h",(z E I z ) , where hi are nonncgative arid ~ ( 1 ,is) the length of I,. The Bhattacharyya distance is N defined as C I B H ( S ~ , s j ) = -log ( h ; h i ) ' l 2 ~(1,). ) In conclusion, all the dissimilarity representations considered here emphasize different aspects of the spectra data. Since in the problem of one-class classification , there is always a tradeoff between the false negative ratio (error on the target class) and false positive ratio (error on the outlier class) and we just want to compare the methods, the AUC measure seems to be appropriate, as it provides an overall judgement; see also Sec. 8.1. Otherwise, we need to fix a point of comparison, which is subjective. Table 8.5 shows the AUC measures for the LPDD, the LPDD-11, the GMDD, the k-CDD and the k-NNDD trained on six dissimilarity representations. The LPDD, the LPDD-I1 and the GMDD rely also on power and sigmoidal transformations of the representations, while the k-NNDD and k-CDD are built on the original dissimilarities. The parameter s of sigmoidal transformations was set to s = d,,,, where d,.,, was the average distance for the target class. Of course, this parameter is computed for each measure separately.
(cz=l
One-class classzjers
377
Since the !o,s-distance representations are somewhat more discriminative than the !I-distance representations, for the derivative-based measures, the results for the later are omitted. The following conclusions can be made by analyzing Table 8.5: (1) The most discriminative dissimilarity measure is the &*-distance computed between second order derivatives of the spectra (Dif:'). The probabilistic Bhattacharyya distance is also good. The less discriminative dissimilarity measure is the geodesic spherical distance ( D g e o ) . (2) The LPDD gives better results than the LPDD-11. (3) The use of outlier information is beneficial for the LP classifiers. Both the LPDD and the LPDD-I1 perform significantly better than when they are trained on the target examples only. They, however, need more support objects for this. (4) The results of the GMDD based on 10 objects, selected by the meariresemblance procedure (see Sec. 8.2.2) are nearly the same as the results obtained by the GMDD trained on the complete set R (514 objects). This suggests that the target class is rather compact. Although the GMDD is not a flexible classifier, its performance is better or the same as reached by the LP classifiers trained on the target class. When compared to 11-CDD (hence based on a similar number of objects), the GMDD works better. However, when a larger k is used, the kCDD outperforrns the GMDD. ( 5 ) The best result for one-class classifiers trained on the target class is obtained for the 1-NNDD on Didge' (AUC is 88.2). When outlier information is used, then the LPDD behaves similarly well, however, additionally, it allows for a significant reduction in computation. In the best case, the LPDD selects at most 44 support objects (out of 514 in the set R ) for the sigmoidal-I transformation of the representation, reaching the AUC of 88.9. Alternatively, the LPDD selects 13 support objects of the power transformation of reaching the AUC of 88.4. The 1-NNDD relies on dissimilarities to all 514 objects.
8.3.3
Experiment 111: Heart disease data
In this experiment, we will analyze the heart disease data. Given information on ill and healthy patients, the goal is to detect the presence of a heart disease. There are 303 instances, where 139 correspond to healthy patients. Since the data consist of mixed types: continuous, dichotomous
378
T he dassimilarity representation for pattern reco.qnition
Table 8.6 Hcart disease problem. AUC measure (.loo) of onc-class classifiers built on the Gowcr's distance representation. T is a training set of both target and outlier examples and R C T consists of the targets only. lRI = 84 arid IT1 = 183. Parcnthcsis show either the average number of support objects for the LP classifiers or the effective number of objects from R determining the boundary. The results are averaged over 50 runs. The standard deviations of the AUC.100 means are less than 0.5.
LPDI Original Power; p = 0.5 Power; p = 0.3 Sig-I; s = 0.5d,,, Sig-I; s = d,, Sig-11; s = 0.5da,,, Sig-II; s = d,,,.
LPDD-
v = 0.05,trained on D(R,R ) D(T,R)
(70.4)
.
(69.2)
(70.4) (68.3) 81.7 (37.5) 84.9 (16.9j v = 0.05.trained on D(R, R ) D(T,R)
I
82.4 83.4 82.2 81.5 84.6 84.9
Oriffinal Power; p = 0.5 Power; p = 0.3 Sig-I; s = 0.5d,,, Sig-I; s = d,,,, Sig-11; s = 0.5da,, Sig-II; s = d,, I
k-NNDD, k = l k:=3
k=5 k=7 r t h r = 0.1
k=1 k=5 k=8 k = 13 k = 21 k = 31 k = 51
rthr
85.7 (8.0) 85.7 (8.0) 85.4 (8.0) 85.6 (8.0) 86.0 (8.0) 84.8 (8.0) 85.5 (8.0) = 0.1,trained (8.0) 74.8 (8.0) 80.5 (8.0) 82.3 (8.0) 82.8 k-CDD (1.0) 82.2 (5.0) 75.4 (8.0) 74.5 74.2 (13.0) 74.7 (21.0) 74.7 (31.0) 75.2 (51.0)
(72.8) (80.9) (71.1) (58.5) (68.3) (16.9)
D(R,R) 85.9 (13.0) 86.2 (13.0) 85.8 (13.0) 85.9 (13.0) 85.8 (13.0) 85.5 (13.0) 85.8 (13.0) D(R. R) 75.0 (13.0) (13.0) 80.1 82.2 (13.0) 83.5 (13.0) k-NNDD 76.1 (84.0) 81.0 (84.0) 82.3 (84.0) 83.6 (84.0) 84.8 (84.0) 85.6 (84.0) 86.3 (84.0) ~
One-class classzfiers
379
and categorical variables, the Gower’s dissimilarity, as defined in Sec. 5.1, is computed. All data sets are randomly split 50 times into training T and test Tt, sets in the ratio 60 : 40. The training sets consist of both target and outlier examples, while the representation sets R c T contain only the target examples. (Rl = 84, (TI= 183 and ITt,( = 120 ( 5 5 healthy patients and 64 diseased patients) in each run. One-class classifiers which rely on the target information are trained on D ( R ,a). If additionally outliers are incorporated to a classifier in the learning stage, OCCs are trained on D ( T ,I?). In a testing stage, D(Tt,, R ) is used in both cases, The results are presented in Table 8.6. Since the number of diseased patients is larger than the number of healthy ones, we have also tried to design an OCC by assuming that the ill patients form the target class. The results, however, become worse, so they are not shown here. The identification problem defined by this data set is difficult, since the outliers cannot be easily distinguished from the target class. This suggests that the measurements, based on which the Gower‘s dissimilarity representation is derived, do not have enough discriminative power. Such a conclusion can be drawn because of the following facts: 0
0
0
0
The LP classifiers need many support objects, around 60% of the target objects on average. The LPDD-I1 decreases its performance when it is trained using the outlier information. This suggests a high overlap of outlicr examples with the target class. The k-NNDD improves its performance with growing k , while k-CDD does not. The GMDD (a weak classifier) outperforms the k-NNDD (which is often a very good classifier).
This suggests that the target class can be characterized by a single cloud (1-CDD performs the best), but outliers tend to lie ‘in-between’ the targets. It seems that in such a case, the GMDD may perform relatively well. When defined on a reduced set R, consisting of 8 or 13 objects, it performs comparably to the best LPDD-11. 8.4
Conclusions
This chapter is devoted to the problem of one-class classification. It is important as it often exists in practical applications, such as health diag-
380
The dissimilarity representation f o r pattern recognition
nostics. machine condition monitoring or industrial inspection. The goal is to characterize and describe a single class, identified as a target class, sndi that the resembling objects are accepted as targets and outliers (nontargets) are rejected. Such a detection has to be performed in an unknown or ill-defined context of alternative phenomena. The target class is assumed to be well-defined and well sampled. The alternative outlier set is usually ill-defined: it is badly sampled (even not present at all) with unknown and hard to prcdict priors. Since the outlier class is ill-defined, in complex problems, an effective set of features discriminating between targets arid outliers cannot be easily found. Hence, it seems appropriate to build a representation on the raw data. The dissiniilarity representation, describing objects by their dissimilarities to the target examples, may be effective for such problems since it naturally protects the target class against unseen novel examples. Boundary descriptors are examples of one-class classifiers that might be suitable to handle such problems. Here, dissimilarity-based classifiers are introduced. They are defined in three frameworks: based on neighborhood relations, in embedded spaces and in dissimilarity spaces. A number of OCCs is proposed, such as the GNIDD, a simple OCC in an embedded space defined by the square distance to the class mean, and the LPDD and the LPDD-11, one-class classifiers defined as hyperplanes in a dissimilarity space. Efficient solutions can be obtained, meaning that these OCCs ultimately refer to a relatively small number of target objects to make the decision. Such objects are automatically detected for the LP classifiers, as thcy result from solving the linear programming optimization. In case of the GMDD, they can be chosen as a specified fraction of the target examples. These OCCs are compared to the neighborhood-based ones: the k-CDD and k-NNDD. Three different problenis are analyzed here: machine monitoring, lesion diagnostics and heast, disease diagnostics. When outliers do not heavily overlap with the target class, and some outliers are used for training, the LPDDs providc the best solutions as a trade-off between the performance and the computational aspect (the effective number of target objects which definc the boundary). The k-NNDD, based on the average dissimilarity to the k-ncarcst neighbors, is a good classifier and may outperform the LPDDs, yet it requires many more target objects for the good definition of its boundary. When there is a high overlap between the targets and outliers, the GMDD, as a weak classifier, may work well, since it relies on global information, i.e. the distance to the mean (in an embedded space),
One-class classzfiers
381
instead of the local information, as e.g. the k-NNDD does. Concerning dissimilarity measures, the best measures in machine nionitoring arid lesion diagnostics are non-metric *-distances between intermediate representations. This is an interesting point, since it supports our idea that non-metric dissimilarities can be beneficial for learning.
This page intentionally left blank
Chapter 9
Classification
Inanimate ob,jects are always correct and carmot, unfort,unately, be reproached with anything. I h,ave never observed a chair shift from one foot to another, or a bed rear on its hind legs. A n d tables, even when they ure tired, will not dare to bend their knees. I suspect that objects do this from pedagogical con,sidemti,on,s, to r.eproiip u s constantly for our instability. "OB.JECTS" , ZBIGNIEWHERBERT The challenge of automatic classification is to develop computer methods which learn to distinguish among a number of classes. Each class is represented by a set of examples. When an appropriate mathematical representation of objects is found, here based on dissimilarity representations, a decision rule is constructed. Usually, standard two-class classification problems are studied first, since multi-class problems are often solved by combining two-class discrimination functions. To constriict a dissimilarity-ba.sed cla.ssifier, we will use a tra,ining set, T 7 /TI = N , and a representation set R, JRl = n. R is a collection of prototype examples from T . In a. learning process, a classifier is constructed on the N x n, dissimilarity representation D ( T ,R ) , relating all training objects to all prototypes. The information on a set Tt, o f t new objects is provided by their dissiniilarities to the examples from R , that is, as a t x n dissimilarity matrix D(Tt,; R). Similarly as in Chapter 8, we interpret dissirnilarity representations in three different ways: in pretopological spaces, in ernbedded spaces, and in dissimilarity spaces. All these approaches, together with particular discrimination functions, are introduced and described in Chapter 4. Because there is an abundance of interesting questions for dissimilaritybased classification. we can only discuss some of the most intriguing problems. Basically, we will demonstrate that the k-nearest neighbor (KNN) niethod can be outperformed by alternative classifiers built on dissimilarity representations, especially for small representation sets. When the dissirn-
383
384
The dissimilarity representation for pattern recognition
ilarity measure is discriminative and the classes are densely sampled, then in a close neighborhood of an object (measured by the given dissimilarity), there will be many objects of the same class. This is the reason why the k-NN rule is expected to perform well' for sufficiently large sample sizes. Thcrcby, it becomes our referencc method. Other essential questions refer to the selection of an informative representation set from a given training set, the use of non-metric dissimilarity measures and their possible corrections to make them metric and the usefulness of monotonic transformations. The results presented here come from our expcrirricnts conductcd on various data sets. They are supported by articles [Duin et al., 1997, 19981 and [Pckalska and Duin, 2000, 2001a, 2002a; Pekalska ef al., 2002b; Pekalska and Duin, 2002c,b; Pekalska et al., 2004b, 2005b; Duiri et ul., 1999; Duin and Pvkalska, 2002; de Ridder et ul., 20021.
9.1
Proof of principle
This section considers siniple decision rules in dissimilarity spaces and in embedded spaces to provide the 'proof of principle' that alternative dissirriil,2rity-base~classifiers are beneficial. It aims at explaining our way of thinking and to describe a general set-up of experiments. It plays an introductory role to the subsequent sections, which describe classification problems in more details. Oui. experiments will demonstrate that the trade-off between recognition accuracy and computational effort can be significantly improved by using a linear (or quadratic) classifier built in dissimilarity or embedded spaces; instead of the k-NN rule. Such a linear classifier is constructed from a training set described by the dissimilarities to the representation set,. If this set is small, it has the advantage that only a small set of dissimilarities has to be computed for its evaluation, while it may still profit from the accuracy offered by a large training set. 9.1.1
N N rule us alternative dissimilarity-based classifiers
The k;-NN rule [Cover and Hart, 19671 assigns an object to the class most frequently represented among its k nearest neighbors. It is commonly practiced in pattern recognition, as it is a simple and intuitive approach. It does 'Under the assumption of sampling from the same probability distribution, the k-NN rule in a Euclidean space (with metric distances computed there) reaches asymptotically the error of at most twice thc Bayes error.
Classification
385
not require any training, except for the selection of a suitable k . Coriventionally, the A"-: rule is computed in a vector space and relies either on the (appropriately weighted) Euclidean or city block distance, derived from feature-based representations. The k-NN method is known to be asymptotically optimal in the Bayes sense for metric distances [Hart, 1968; Devroye et al., 19961. It can learn complex boundaries and generalize well, provided that an increasing set of training objects T is available and the hypervolumes of the k-neighborhoods approach zero. However, classification performance of the k-NN method may significantly differ from its asymptotic behavior for a given finite training set (e.g. when the data points are spamely sampled or have variable characteristics over the space). Many variants of the N N rule as well as many distance measures have been invented or adopted for feature-based representations to handle such situations. They often take local structure into account or weight the neighbor contribiitions appropriately; see e.g. [Hastie and Stuetzle, 1989; Lowe, 1995; Wilson and Martinez, 1997; Sjnchez et al., 1998; Avesani et ul., 1999; Paredes and Vidal, 2000; Domeniconi et al., 20021. Such approaches are designed to optimize parameters of a chosen distance measure in local regions of the feature space or the number k of nearest neighbors. In this work wc study dissimilarity representations derived either from sensor measurements or from other intermediate representations (e.g. st,ring descriptions, shapes or feature spaces). Consequently, we cannot always naturally refer to an accompanying feature-based representation, since such a space may not exist or might not he given. The dissimilarity measure is designed to compare objects and, when derived, it is used to construct dissimilarity representations. Given a test set Tt,, the k-NN rule makes a decision by ranking the dissimilarities D(Tte,T) t o the training objects T, and applying the voting mechanism. Although it is based on local k-neighborhoods, it is still coinputationally expensive, as dissimilarities to all training: examples have to be computed. Moreover, its classification accuracy may be affected by the presence of noisy examples. On the other hand, when the training set is small, the k-NN rule potentially decreases its performance. In our approaches, we will define classifiers on the dissimilarities D ( . ,R); computed to the representation objects R C T . The I;-" rule becomes in fact a k-nearest prototype method, however, we will not make such a distinction in naming, always referring to a k-nearest neighbor method. Classifiers constructed in dissimilarity or embedded spaces beconic 'more global' in their decisions by making use of the complete R instead of k
386
T h e dissimilarity representation f o r p a t t e r n recognition
neighbors only. As dissimilarity information is captured in an appropriate vector space, many traditional classifiers can be adopted there. Moreover, siich decision rules can be optimized by a training set which is larger than the given representation set. As a result, classifiers in dissimilarity or embedded spaces may overcome the limitations of the NN rule. Many dissimilarity measures are based on sums of differences between (prc-processed) measurements. If such differences can be approximately described by the same distribution (which may be true for standardized feature-based data or for normalized image or spectra representations), their sum (or average) is approximately normally distributed thanks to central limit theorem [Billingsley, 1995; Feller, 19681. Hence, Bayesian classifiers assuming normal distributions, the (R)NLC or (R)NQC, (Regularized) Normal density based Linear or Quadratic Classifiers, as described in Sec. 4.4.2, should perform well in such dissimilarity spaces. In practice, even if the assumption on normality is violated, such classifiers tend to work well. They may perform much better than the k-NN method, especially when the representation set is small. By using weighted combinations of dissimilarities, they suppress the influence of noisy examples as well. Since the training can be done off-line, here we are only concerned with the computational effort needed for the evaluation of a new object. Given n examples in the representation set and n. dissimilarities computed, the additional complexity of the RNLC is O ( R )(products and sums), while the coniplexity of the RNQC is 0(n2)’.The 1-NN rule requires 0 ( n , )comparisons and the k-NN rule needs at least O ( n ) and at most O ( n log(n)) comparisons. Thereby, the k-NN rule might seem to be preferable3. However, our point is that the k-NN method requires a larger R than the RNLC/RNQC to reach the same accuracy. If the cost of determining dissimilarities is very high (which is true if dissimilarities are computed for data having a large arrioiint of measurements such as images or spectra, or derived in a template inatchirig by minimizing the matching cost over various possible ’We assume that the number of classes c is very small with respect to n = IRI,and that a c-class problem is solved by combining the result of c classifiers trained one-against-all. Hence O ( c n ) = O ( n ) and O(cn’) = O(n’). ‘The I;-NN method is usually applied to metric distances, often defined in vector spaces. To avoid expensive computation of distances, there has been interest in approximate arid fast N N search. Many algorithms have been proposed, usually making use of the triangle inequality. Example works can be found in [Berchtold e t al., 1998; Grother e t al., 1997; Mico e t al., 1996; Mic6 and Oncina, 1998; Moreno-Seco e t al., 2003; Rarriasubramanian and Paliwal, 2000]. As our study assumes general, possibly non-metric dissimilarities. we focus on the exact N N methods.
Classzfiication
387
transformations), the cardinality of R is crucial for judging the computational complexity. Therefore, we claim, that the RNLC can improve thc k-NN rule with respect to the recognition accuracy and computational effort. The same holds for tlie RNQC if R is small. The other approacli to dissimilarities relies on a linear embedding4 into a pseudo-Euclidean space R ( p ~ 4where ) , p 4 = m. Hence, the objects a,re represented as points in this space such that the pseudo-Euclidean distances between them reflect the original dissimilarities. Traditional discrimination functions built in (Euclidean) vector spaces can be adopted to make use of indefinite inner products. The details of such a construction were given in Sec. 3.5.3. The projection of a test object D ( t ,R) onto iR(P>q) requires O ( n ) operations and the evaluation of' a linear classifier needs O ( m ) operations (since we have an m-dimensional space), m < n., so the tot,al complexit,y is (?(nm); see also Sec. 3.5.5. Consequently, this approach might be more computationally expensive than the use of dissimilarity spaces if rr~is large. The projection is unsupervised, i.e. no class information is currently used in the embedding. The role of an embedded space is to spatially reflect tlie dissimilarity information. By building linear or quadratic classifiers there, a better performance may be reached than by the k-NN rule directly applied to the original dissimih-ities. In summary, the k-NN rule operates on the dissimilarities directly. Discrimination functions in dissiniilarity spaces treat D ( T , R) as 'input features', hence build their decision based on (non-)linearly weighted dissimilarities D ( . ,I?). Embedded spaces represent objects as points such that the distances are preserved. In this way (if the conipactriess hypothesis holds). we expect that the classes become relatively compact clouds of points. If the assumption of the true representation holds, then there would be no overlap between the classes, and vector-based classifiers can be constructed there. We will now present the results of two experiments. The first one shows the behavior of a few classifiers in the three frameworks discussed above, as applied to square dissimilarity representations, D ( T ,2"). Thc second experiment explores further the dissimilxity space approach.
+
4There exist many nonlinear embeddings (usually with distortion) into a Euclidean space. Some of them were used in Chapter 6 to visualize the dissimilarity data as 2D spatial configurations. The reason we focus on the linear embedding is twofold: computational aspect and a clear linear projection algorithm. Nonlinear mappings oftcn require more operations than the linear ones. We hope to study their usefulness for classification in future or
(1
388
9.1.2
T h e dzssamalaraty representation f o r p a t t e r n recognition
Experiment I: square dissimilarity representations
The following example illustrates the use and benefits of dissimilarity-based classifiers over the direct use of the k-NN rule. Two data sets are chosen for this purpose; see Appendix E.2 for deta,ils. The first data are the NIST handwritten digits [Wilson and Garris, 19921, consisting of 2000 images of ten evenly probable classes. The similarity measure based on deformable template matching, as defined in [Jain and Zongker, 19971 serves for building the non-metric dissimilarity representation. The second data refer to randomly generated polygons, consisting of 2000 polygons, evenly distributed over two classes of convex quadrilaterals and irregular heptagons. The polygons are compared by computing the Hausdorff distance, Def. 5.2, between their corners. In both cases, the entire data set is randomly split into the design set L of 1500 examples and the test set Tt, of 500 exaniples. Growing representation sets R (such that R ultimately becomes L ) are randomly chosen from the design set L . Hence, for a growing set R, the following classifiers are built on D ( R , R ) and tested on D(Tt,,R); see See. 4.4.2 for the classifier descriptions: (1) The k-NN rule is applied t o D(Tte,R) directly. (2) Two linear classifiers in a dissimilarity space D ( R ,R): the linear progranimirig classifier (LPC), Eq. (4.23), with the trade-off parameter of y = 1 arid the RNLC with a fixed regularization parameter of X = 0.01. The regularization is necessary, since otherwise t,he estimated covariance matrix becomes singular. Additionally. also the SQRC (strongly regularized quadratic classifier) is used for the NIST digits. ( 3 ) The Fisher linear classifier (FLD) in an embedded pseudo-Euclidean space. For the NIST digits, the dimension of the pseudo-Euclidean space is related to a fixed fraction of the preserved generalized variance, Sec. 3.5.4, hence it will grow with a growing R. For the polygon data, the dimension is fixed to 45. These are two different approaches, since the number of eigenvalues significantly different from zero, determined in the embedding process (hence estimating the intrinsic dimension) seems to be fixed for the polygon data, while not for the deformable template matching distance on the NIST digits.
The results are shown in Fig. 9.1. For comparison, the test results of 15, are also presented. These figures the best k-NN rule, Ic = 1 , 3 , makc clear that the alternative dissimilarity-based decision functions may perform well, much better than the best k-NN rule.
Classification
Zongker NIST digits 0.2r
1
B
5 0.151
--
389
Polygon data 0.2r
SRQC in a dissim space FLD in an embedded space
E l z 0.15 c l
5 g
01
01
%5 I
i i
5
005
U
4
00
500
O05L I
- i _ - l 1000 1500
IRI
Figure 9.1 Generalization error (averaged over 25 runs) of some decision functions trained in dissimilarity and embedded spaces, and the k-NN rule directly applied. The classifiers are trained on two dissimilarity representations: the Zongker N E T digits based on deformable template matching distance (left) and the Hausdorff distance between the polygon corners (right). In general, the standard deviations of the means are less than 0.007 and in the majority of cases, less than 0.003.
9.1.3
Experiment 11: the dissimilarity space approach
The experiments are conducted to compare the results of the Ic-NN rule and the RNLC and the RNQC built on dissimilarity representations. They are designed to observe and analyze the behavior of these classifiers in relation to different sizes of both representation and training sets. We are concerned with possible gains of using small representation sets R and large training sets. A small R is of interest, because of both storage and computational aspects to assure that the evaluation for new objects is cheap. The results presented here were published in [Pqkalska arid Duin, 2002al. Two different dissimilarity measures are studied for the NIST digit sets [Wilson and Garris, 19921, represented by 2000 binary images in total, 200 images per class. The measures are: the Euclidean distance between Gaussian-smoothed images (images are blurred to make the measure be somewhat robust against tilting and thickness) computed in a pixel-wise way and the modified Haiisdorff distance, Def. 5 . 3 computed between the contours. The experiments are performed 25 times for randomly chosen training and test sets for each representation set R under investigation. In a single experiment, each data set is randondy split into two equal-sized sets consisting of 1000 objects: the design set L and the test set' Tte. L serves for obtaining both the representation set R and the training set T . After R is chosen, a number of training sets of different sizes are then considered. First, T is identical to R and then it is gradually enlarged by adding
390
The dissimilarity representation f o r pattern recognition
~
~~
~~
~~
~
split the entire sct into the design set L and the test set Tt, dcfine a vector of the cardinalities T R for the representation set R for i = 1 to J r ~dol randomly select R C L of the cardinality r ~ ( i ) crrork-”(i) = test (k-NN, D(Tt,, R ) ) for 2 = i to l r ~ do l choose the training set T of the cardinality r n ( z ) such that T = R objects randomly selected (per class) from L\R train (RNLC/RNQC, D ( T ,R ) ) errorRNLC/RNQC(i> 2 ) = test (RNLC/RNQC, D ( T t e , R ) ) end end
+
Figure 9.2
Pseudo-code for a single experiment, in See. 9.1.3.
random objects until it becomes L. There are many ways of selecting the representation set R out of the design set, L ; some of them will be discussed in the subsequent sections. Here, wc do not study the best possible set R for the given problem, instead, we focus on illustrating our approach. Therefore, the representation objects are chosen randomly. Additionally, the condensed nearest neighbor (CNN) is uscd for the selection. In a single experiment, initially, a subset of the design set L is used for representation. Then it is increased gradually by randomly adding new objects until it is equivalent to tlie complete set L . In this way a number of representation sets of different sizes can be studied. The CNN criterion is based on the condensed nearest neighbor method [Hart, 1968; Devijver and Kittler, 19821 developed to reduce the computational effort of tlie 1-NN rule. The CNN method finds a subset of the training set so that the 1-NN rule gives a zero error when tested on the remaining objects. Here, the representation set R becomes the condensed set found on the design set L. In contrast to the random selection, cardinality of R is automatically determined by the CNN method and it is fixed in a single experiment. However, since tlie training sets differ in all experiments, the number of representation objects may vary. Therefore, the size of R is averaged over all runs when reported in Table 9.1. Both the R.NLC and the RNQC, assuming normal distributions with equal or different class covariance matrices respectively, are built for different training sets. The regularized versions are used to prevent the estimated covariance matrices from being singular (e.g. in the case of the RNLC, when IT1 approaches IRI).R.egularization takes care that the inverse operation is possible by emphasizing the variances with respect to the covariances;
Classification
39 1
see also Sec. 4.4.2. When (TI M ( R ( then , the estimation of the covariance matrices is poor. In such cases, different regularization parameters may significantly influence the performance of the RNLC/RNQC. For sufficiently large training sets, these matrices are well defined and no regularization is needed. In our experiments, the regularization parameters are fixed values of at most 0.01 for training sets such that IT1 IRI. Since they are not optimized, the results presented here might not be the best possible. The pseudo-code for a single experiment is schematically shown in Fig. 9.2. The following fixed choices of k , k =: 1 , 3 > 5 , 7and 9 are studied for the k-NN rule. Additionally, we also tried to optimize k by a leave-oncout procedure on D ( T ,T ) . However, tlie k determined in such a way was always found to be one of the fixed, odd k mentioned above. For both digit sets? the best k-NN test results are determined either for k = 1 or k = 3. In the experiments below we will report only the best test results for all studied values of k . Since the cardinality of R is automatically determined by the CNN criterion, the outer loop in t,he pseudo-code in Fig. 9.2 is superfluous. The training sets are chosen differently than in the case of a random selection. Since the classes are likely to be unequally represented in the found set R, the training set is constructed from R by adding objects, randomly selected from all remaining examples in L . The generalization errors are averaged over the experiments and are used to make the plots.
Results. Figure 9.3 shows generalization errors of tlie k-NN rule, directly applied, and the RNLC/RNQC in dissimilarity spaces. The k-NN results, marked by ’*’, are shown on the T, = nc line. The RNLC’s (RNQC’s) curves are lines of constant classification error (measured on indepcndent test sets) relating the sizes of the representation and training sets. Additionally, Table 9.1 summarizes the main results of our study. Given the fixed cardinality of R , the worst and the best results, depending on the training set size, are reported, for the R.NLC/RNQC. The CNN selection provides only a single set R of a fixed size. The k-NN rule versus the RNLC. When T and R are identical, the RNLC (with error curves starting on the rC = nCline in Fig. 9.3, left plots), generally yields a better performance than the equivalent k-NN rule based on the same R (compare also the k-NN results with the worst cases of the RNLC in Table 9.1). When T , is fixed (i.e. in the horizontal directions of Fig. 9.3), the classifiers yield the same computational complexity for an evaluation of new objects. However, larger training sets reduce the error
392
The dissimilarity representation for pattern recognition
Training size per class
(q)
( c ) RNLC on the mod.-Hausdorff repr.
Training size per class (nd
(d) RNQC on the mod.-Hausdorff repr.
Figure 9.3 Pixel-based NIST digit data set. The plots present generalization errors (averaged over 25 runs) for the blurred Euclidean dissimilarity representation (top) and for the modified-Hausdorff dissimilarity representation (bottom). The error curves arc the averaged generalization errors of the RNLC (left) and the RNQC (right) built in dissimilarity spaces. The k-NN results are indicated by '*'. All the representation sets are randomly chosen. If a horizontal line is drawn at a fixed value of r c , then its crossing points with the error lines determinate the number of training objects n, needed for specific performance. For instance, in subplot (c), given r, = 20, the RNLC needs nc % 30 training objects to reach the error of 0.15 and the RNLC needs n c = 95 objects to reach the error of 0.1. The k-NN error equals 0.17 for n, = r , = 20.
Classzfication
393
rate of the RNLC by a factor of 2 in comparison to the k-NN error (based on the same I?). For instance, in Fig. 9.3(a), we observe that the classification error of 0.18 is reached by the k-NN rule based on T, = 10 prototypes for which the RNLC offers a higher accuracy of = 0.16 if trained also with n, = 10 objects, reaching 0.09 when n, increases to 100. In other words, for a chosen representation set R (hence a fixed computational complexity for an evaluation of a new object) the RNLC error, with the increase of training size, decreases significantly t o the values that can only he obtained by the k-NN method if it is based on a much larger R. For instance, in Fig. 9.3(c), the RNLC built on T, = 10 prototypes (and the training set of n, = 100 objects) reaches an accuracy (an error of 0.12) for which the k-NN rule needs 40 objects in its representation set. The computational load with respect to the number of computed dissimilarities of the RNLC for the same classification accuracy is thereby reduced to 25%. Following the RNLC's curves of constant error, it can be observed that for large training sets much small representations sets are necded for the same performance. The RNLC may sonietimes demand only half the cornpiitational effort for the evaluation of new objects when compared t,o the k-NN method. Also, for the fixed, possibly large training set (i.e. iii the vertical directions of the considered figures), the RNLC constructed on a small R, might gain a similar or higher accuracy than the k-NN rule, but now based on the complete D ( T ,T ) . This is observed, e.g. in Fig. 9.3(a) for n, = 40. The k-NN method yields an error of 0.093 and the RNLC reaches a snialler error when trained on D ( T ,R) with R consisting of T , 2 20. Since the best k-NN results for both digit data sets are found for k = 1 or 3 [Pqkalska and Duin, 2002a], the results of the 1-NN rule based on the CNN criterion can be compared to the results of the k-NN rule based on a random selection of R. The former are better than the latter, probably because the CNN representation set is optimized for tlie I-". Also, as observed in Table 9.1 and in Fig. 9.3, the RNLC defined on tlie CNN representation set generalizes better than the RNLC defined on a random representation set.
The RNLC versus the RNQC. In general, the RNQC performs better than the RNLC for both dissimilarity data sets; compare the results in Fig. 9.3, left plots versus right plots. Since the RNQC relies on the class covariance matrices in a dissimilarity space. a larger number of samples is needed than for the RNLC to obtain reasonable estimates. The RNQC niay reach a worse accuracy than the RNLC for identical T and R. However,
The dissimilarity representation for pattern recognition
394
Table 9.1 Blurred NIST digit data set. Averaged generalization error (in %) with its standard deviation for three classifiers: the k-NN rule directly applied and the RNLC/RNQC in dissimilarity spaces. The representation set R is either randomly selected with rc objects per class or by the CNN criterion. The errors of the RNLC/RNQC refer either to the worst (left column) or to the best (right column) results achieved for a fixed T,
Euclidean dissimilarity representation Random selection k-NN RNLC 8.6 17.5 (0.4) 15.6 (0.3) 7.1 12.5 (0.3) 10.2 (0.1) 8.3 (0.2) 6.6 (0.2) 5.5 7.1 (0.2) 5.8 (0.1) 5.1 6.4(0.1) 5.1 (0.1) 5.0
rc 10 20 50 70 90
I (0.1)
(0.1) (0.1)
(0.1) (0.1)
RNQC 4.4(0.1) 19.0 (0.5) 10.3 (0.2) 4.6 (0.1) 5.6 (0.2) 4.7 (0.1) 5.0 (0.1) 4.6 (0.1) 4.6 (0.1) 4.6(0.1)
Modified-Hausdorff dissimilarity representation Random selection
k-NN 24.4 (0.4) 17.1 (0.2) 10.6 (0.2) 8.9 (0.1) 7.9 (0.2)
rc 10 20 50 70 90 rc
1
1-NN
RNLC 21.3 (0.3) 11.1 (0.2) 15.6 (0.3) 9.8 (0.2) 10.3 (0.2) 9.0 (0.2) 8.7 (0.2) 9.2 (0.2) 8.5 (0.2) 8.2 (0.2)
I
RNLC
I 34.9 21.2 9.9 8.2 8.2
RNQC (0.8) 8.0 (0.2) (0.5) 7.4 (0.2) (0.2) 7.2 (0.2) 7.2 (0.2) (0.2) 8.3 (0.2) (0.2) RNQC
following the curves of the RNQC’s constant error, both smaller representation arid training sets are needed for the same error when compared to the RNLC. The RNQC’s curves are simply much steeper than those of the RNLC. Thereby, the RNQC outperforms the RNLC for large training sets (and small R). The most significant improvement can be observed for a small R. For instance. the training set of n, = 100 examples allows the RNLC to reach the error of 0.049 when based on T , 2 70 prototypes, see Table 9.1. where the RNQC requires only between 5 and 30 prototypes for a. similar performance; see Fig. 9.3(c). When the largest training sizes are considered (the best results in Table 9.1) for the fixed set R, the error of the
Classification
395
RNQC decreases, yielding better results than the k-NN rule. Still, when the smallest errors of the RNLC and RNQC are compared, the RNQC generalizes better. Also, for the fixed training set T , i.e. in the vertical directions in Fig. 9.3, subplots (b) and (d), a small representation set R often allows the RNQC (trained on D ( T , R ) ) ,to reach a better performance than the k-NN rule based on D (T,T). 9.1.4
Discussion
Our experiments indicate that indeed a good classification performance can be reached by dissimilarity-based classifiers, an alternative to the k-NN method. Even if the classifiers are trained in n-dimensional dissimilarity space D ( R ,R ) determined by the dissimilarities to n prototypes, they may work better than the k-NN rule defined on the same R. The experiments focus further on the dissimilarit,y space approach and the role of the representation set R. They show that the RNLC constructed on dissimilarity representations D ( T ,R ) may significantly outperform the k-NN rule based on the same R. This holds for the RNQC as well, provided that each class is represented by a sufficient number of training objects (they are needed to estimate tlie class covariance matrices reliably). Sirice for tlie evaluation of new objects the computational complexity (here indica,ted by the number of prototypes) is an important issue, our experiments are done with such an emphasis. We have found out that for the fixed representation set, larger training sets improve the performance of the RNLC/RNQC. When such results are compared to the k-NN based on the same R , they are often better. Also, for the fixed training set T , smaller (than T) representation sets allow the RNLC/RNQC, trained on D ( T ,R ) , to gain a high acciiracy. When R is only somewhat smaller than T , such classification errors can be smaller than the ones reached by the k-NN based on the entire training set T , i.e. D ( T t e,T ) . The potentially good performance of the RNLC can be understood as follows. The RNLC is a weighted linear combination of the dissimilarities between an object II: and the prototypes. It seems practical to allow a number of representation examples of each class to be involved in a discrirniriation process. This is already offered by the k-" rule: however: this decision rille provides an absolute answer (due to a mechanism based on the majority voting). The k-NN method is sensitive to noise, so the k nearest neighbors found might not include the best representatives of a class to which an object should be assigned. The training process of the R,NLC. 11s-
396
The dissimilarity representation f o r p a t t e r n recognition
ing a larger training set T , emphasizes prototypes which play a crucial role during discrimination, but it still allows other prototypes to influence the decision. The importance of prototypes is reflected in the classifier weights. In this way, a classifier is built, which takes all prototypes into account. The RNQC includes also a sum of weighted products between pairs of dissimilarity vectors. In this way, interactions between the prototypes are emphasized. The RNQC is based on the class covariance matrices in a dissimilarity space, estimated separately for each class. Those matrices may really differ from class to class. Therefore, this decision rule might achieve a higher accuracy than the RNLC, where all class covariance matrices are averaged. However, a larger number of samples (with respect to the size of' R ) is required to obtain reasonable estimates for all covariance matrices, and, thereby, a good generalization ability of the RNQC.
9.2
Selection of the representation set: the dissimilarity space approach
In the dissimilarity space approach decision rules are functions of dissirriilarities to the selected representation objects (prototypes). Assuming that the entire dissimilarity representation D ( T .T ) is available, the question now arises how a small representation set R should be selected out of T to guarantee a good tradeoff between the recognition accuracy and the computational complexity. We know that a random selection of' prototypes may work wcll [Pqkalska and Duin, 2002a; Pekalska et al., 2002b, 2004b; Paclik and Duin, 2003b,a], as also indicated in the previous section. Here, we will analyze a number of systematic procedures. Since the selection of prototypes is usually investigated in the context of metric k-NN rules, before we move on. we will briefly discuss this point. The results presented here can also be found in [Pekalska et al., 2005bl. In the basic setup, the k-NN rule uses the entire training set as the representation set. hence R = T . Therefore, the usual criticism points at a space requirement to store the complete set T and a high computational cost for the evaluation of new objects. The k-NN rule also shows sensitivity to outliers, i.e. noisy or erroneously labeled objects. To alleviate these drawbacks, various techniques have been developed in feature spaces to tackle the problem of prototype optimization. Some research efforts have bcen devoted to this task; see e.g. [Hart, 1968; Dasarthy, 1994; SAnchez et ul.. 1997; Wilson and Martinez, 20001. From the initial prototypes (say, all training objects), the prototype optimization method chooses or con-
Classafication
397
structs a small portion of them such that a high classification performance is achieved. Two main types of algorithms can be identified: prototype generation and prototype selection. The first group focuses on merging the initial prototypes (i.e. the prototypes represented as vectors in a feature space are replaced e.g. by their average vector) into a small set of prototypes such that the performance of the k-NN rule is optimized. Examples of such techniques are the I%-meansalgorithm [Duda et al., 20011 or a learning vector quantization algorithm [Kohonen, 20001. The second group of methods aims at the reduction of the initial training set and/or the increase in the accuracy of the NN predictions. This leads to various editing or coridensing methods. Condensing algorithms try to determine a significantly reduced set of prototypes such that the performance of the 1-NN rule on this set is close to the one reached on the complete training set [Hart, 1968; Dasarthy, 1994; Wilson and Martinez, 20001. This is the consistency property [Dasarthy, 19941. Editing algorithms remove noisy samples as well as close border cases, leaving smoother decision boundaries [Devijver and Kittler, 1982; Wilson and Martinez, 20001. They aim to leave homogeneous clusters in the data. Basically, they retain all internal points, so they do not reduce the space as much as other reduction algorithms do. Usually, they are followed by condensing methods. Alt,hough the k-NN rule is often practiced with metric distances, there are problems when the designed dissimilarity measures are non-metric, such as the modified Hausdorff distance and its variants [Dubuisson and Jain, 19941, Mahalanobis distance between probability distributions [Duda et nl., 2001] or the normalized edit-distance [Bunke et al., 2001; Marzal arid Vidal, 1993; Vidal et al., 19951; see also Chapter 5. Such non-metric rneasures seem to naturally arise in template matching processes applied e.g. in computer vision [Dubuisson and Jain, 1994; Jacobs et al., 20001. If the dissimilarity measure is meaningful, the principle behind the voting among the nearest neighbors can be applied to non-metric dissimilarities and the I%-NNrule may work well; see e.g. [Pekalska et al., 2004bl or the subsequent sections. It is simply more important that the measure itself is discriniinative and describes the classes in a compact way than its strict metric properties. However, many traditional prototype optimization methods are not appropriate for non-metric dissimilarities, especially if no accompanying feature-based representation is available, as they can be bascd on the triangle inequality, for instance. Moreover, there are also situations, where the classes are badly sampled due to the problem characteristics as e.g. in
398
The dissimilarity representation f o r p a t t e r n recognition
machine or health diagnostics, or due to the measurement costs. In such cases, the k-NN rule, even for a large k and a very large training set will suffer from noisy examples. Yet, we think that much more can be gained when other discrimination functions, such as linear classifiers in a dissimilarity space, are constructed. In general, as pointed in the previous section, such classifiers make their decisions by averaging the information from a number of prototypes and they seem to be more robust against local distortions. 9.2.1
P r o t o t y p e selection m e t h o d s
The selection of a representation set for the construction of classifiers in a dissimilarity space serves a similar goal as the selection of prototypes to be used by the NN rule: minimization of a set of dissiniilarities to be measured for the classification of new incoming objects. There is, however, an important difference with respect to the demands. Once selected, the set of prototypes defines the NN classifiers independently of the remaining part of the training set. The selection of the representation set, on the other hand, is less crucial, as it will define a dissimilarity space in which the entire training set is used to train a classifier. For this reason, even a randomly selected representation set may work well [Pqkalska and Duin, 2002al. That is why, the random selection will serve as a basic procedure for comparing more advanced techniques. Sirriilar objects will yield a similar contribution to the representation. It may, thereby, be worthwhile to avoid the selection of objects with small dissimilarity values. Moreover, if the data describe a multi-modal problem, it may be advantageous to select objects related to each of the modes. Consequently, the use of procedures like vector quantization or cluster analysis can be useful for the selection of prototypes. Assumc c classes w1, w2, . . . , w,. Let T be a training set and let T,, denote the training objects of the class wi.Each method selects K objects for the representation set R. If the algorithm is applied to each class separately, then k objects per class are chosen such that ck = K . The following procedures will be compared for the selection of a representation set: Random, RandorriC, KCentres, ModeSeek, LinProg, FeatSel, KCentres-LP and
EdiCon.
Random.
A random selection of K objects from the training set T
RandomC. A random selection of k objects per class (equal class prior probabilities are assumed).
Classzfication
399
KCentres. This is a representation-based clustering procedure, dcscribcd in Sec. 7.1.2. For each class w,, this algorithm chooses a set RwLof k objects such that they are evenly distributed with respect to the dissimilarity information D ( T w zTwz). , Since the final result depends on the initialization, precautions are taken. To determine Rwz,we start from one center for the entire set D(TwT, TwL) and then more centers are gradually added. At any point, a group of objects belongs to each center. Rw7is enlarged by splitting the group of the largest radius into two and replacing its center by two other members of that group. This stops, when k centers are determined. The entire procedure is repeated 30 times, resulting in 30 potential representation sets. The final set Rut is the one which yields the minimal of tlie largest subset radii. The representation set R consists of all sets Rw,.
ModeSeek. For each class w, the mode-seeking algorithm [Cheng, 19951 looks for a set Rut consisting of the estimated modes of the class distribution, as judged with respect to D(Twc,Twz). The cardinality of Rut depends on the specified neighborhood size s. The larger the neighborhood, the smaller the resulting representation R U T .If a representation set of a particular size is sought, s is selected such that it generates the largest sct which is not larger than the demanded one. This algorithm is a clustering algorithm and it was introduced in Sec. 7.1.2. The procedures above niay be called unsupervised, in spite of tlie fact that they are used in a class-wise way. They aim at various heuristics, but they do not consider the quality of the resulting representation set in terms of the class separability. A standard procedure to do that is by feature select ion.
FeatSel. In traditional pattern recognition, the feature selection method determines an optimal set of K features according to some class separability measure. It is often done in the forward selection process [Jain and Zongker, 19971 by using eit,her the Mahalanobis distance or the leave-oneout 1-NN error. This standard approach is modified here to make use of a given dissimilarity representation. The entire dissimilarity matrix D ( T ,T ) is reduced to D ( T ,R ) by selecting an optimal set of K prototypes according to the leave-one-out 1-NN error. There is, however, a difference with respect to the standard feature selection procedure. Features are considered in a dissimilarity space, but the 1-NN error is computed on the given dissimilarities D ( T ,T ) directly, and not by the Euclidean distances derived from the given dissimilarity representation. The method is, thereby, fast as it is entirely based on comparisons and sorting. Ties can easily occur by
400
T h e dissimilarity representation for p a t t e r n recognition
the same number of misclassified objects for different representation sets. They are solved by selecting the set R for which the sum of dissimilarities is minimum.
LinProg. The selection of prototypes is here done automatically by training a properly formulated separating hyperplane f ( D ( z ,R ) ) = C,”=, iuj d ( z , p j ) + w o = wTD(z,R ) + wo in a dissimilarity space D ( T ,R ) . In general, R c T , but they can also be different. Here, we assume that R = T. The linear function is obtained by solving a linear programming problem, where a sparse solution is imposed by minimizing the tl-norni Iwjl. Such a minimization task is of thc weight vector w , llwill = C,”=, described in Scc. 4.4.2. We focus on the formulation (4.23). Many weights wi tend to be zero, as the found solution is sparse. The objects from the initial set R = T corresponding t o non-zero weights are the selected prototypes, i.e. the representation set R L P . Although the prototypes are determined to support a particular separating hyperplane, they can still be used by other discrimination functions. The choice of the tradeoff parameter y such as y = 1, see Eq. (4.23), seems to be reasonable for niany problems, so we fix it in our experiments. This prototype selection method is similar to a selection of features by linear progranirning in a standard classification task [Bradley et al., 19981. The important point to realize is that we do not have a control over the number of selected prototypes. This can be slightly influenced by varying the constant y (hence influencing the tradeoff between the classifier norm I1w 1 11 and the training classification errors), but not much. From the computational point of view, this procedure is advantageous for two-class problerns, since multi-class problems may result in a large set R L P . This occurs since differcnt prototypes are often selected by different classifiers when a multi-class classifier is derived in the one-against-all strategy or even more severely in the pairwise strategy. KCentres-LP. The KCentres algorithm is applied to a square dissimilarity representation D ( T ,T ) to pre-select a representation set R K C .This is then followed by a reduction based on the LinProg procedure applied to D ( T , R K c ) . In this way, the number of resulting prototypes can be somewhat influenced. Still, if RKC is not sufficiently large, the linear programming will make no reduction. Hence, this procedure reduces to the KCentres approach for a small R K C . EdiCon. An editing and condensing algorithm [Devijver and Kittler, 19821 is applied to the entire dissimilarity representation D ( T ,T ) ,resulting
Classijication
401
Table 9.2 Characteristics of the data sets used in experiments. a stands for the fraction of objects selected for training in each repetition ~
Data
Polydasth Polydistm NIS T-38 Zongker- 12
GeoSam GeoShape Wine Ecola-p08 ProDom Zongker-all
# classes
# objects per class (in total)
2 2 2 2 2 2 3 3 4 10
2 ' 2000 2 ' 2000 2 ' 1000 2 ' 100 2 . 500 2 . 500 59/71/48 143/77/52 878/404/271/1051 10 ' 100
N
per class
0.25 0.25 0.10 0.50 0.50 0.50 0.60 0.60 0.35 0.50
in a representation set R. Editing takes care that the noisy objects are first removed so that the prototypes can be chosen to guarantee good performance of the 1-NN (k-NN) rule. Similarly as in the case of the LinProg, the nuniber of prototypes is automatically determined. 9.2.2
Experimental setup
If a good dissimilarity measure is found. and a training set is sufficiently large and representative for the problein at hand, then the k-NN rule (based on R = T ) is expected to perform well. In other cases. a better gencralization can be achieved by a linear or quadratic classifier built in dissimilarity spaces. The weights of such decision rules are optimized on a training set and large weights (in magnitude) emphasize prototypes which are essential for discrimination. In the previous section, as well as in our studies [Pekalska and Duin, 2002a; Pekalska et al., 2002b, 2004b], we have found out that the linear and quadratic normal density based classifiers. the NLC and NQC, respectively, perform well in dissimilarity spaces. Experiments are conducted to compare various prototype selection methods for the classification in dissimilarity spaces. Smaller representation sets are of interest, because of a lower complexity for both representation and evaluation of new objects. Both linear (the NLC) and quadratic (the NQC) classifiers are considered in dissiniilarity spaces. Here, we will present only the results for the NQC, since it generally performs better than the NLC. In higher-dimensional dissimilarity spaces, i.e. for larger representation sets, the NQC is, however, cornputationally more expensive
402
The dissimilarity representation f o r pattern recognition
Table 9.3 Properties of the data sets used in experiments. The following abbreviations are used: M - metric, E - Euclidean, nM - non-metric, nE - non-Euclidean. The values ~-2:~ and r:: indicate the deviations from the Euclidean behavior. as defined in Eq. (9.1) and T-,","Idescribes the percentage of disobeyed triangle inequalities. Data
Dissimilarity
Property
Polydisth Polydistrn NIST-38 Zongker- 12 GeoSarn GeoShape Wane Ecoli-pO8 ProDom Zongker-all
Hausdorff Mod. Hausdorff Euclidean Template-match SAM [Landgrebe, 20031 Shape el Euclidean distance e0.8 distance Structural Template-match
M, nE nM E nM M,nE M , nE E nM nM nM
r$K [%] 25.1
11.0 0.0 13.3 0.1 2.6 0.0 13.4 1.3 38.9
7.2 0.0 24.7 0.9 35.0
0.00 0.00 3.84 10-9 0.41
than tlie NLC. Since we decided to compare all selection strategies by the performance of a single classifier, as a result, the LinProg was simply used for thc selection of R and not as a discrimination function. Otherwise it would not be comparable to the performance of tlie NQC defined on other representation sets. In each experiment, each data set is divided into a training set T and a test set Tt?. The NQC is trained on the dissimilarity representation D ( T ,R ) and tested on D ( T t e ,R ) . R c T is a representation set consisting of K prototypes. They are chosen according to a specified criterion, as described in Sec. 9.2.1. The 1-NN and the k-NN results defined on the entire training sct (hence tested on D(Tt,, T ) are provided as reference. Also, as a comparison, the k-NN rule is directly applied to D(Tte,R), with R selected by the KCentres algorithm and to the Euclidean distances computed in the representation D(Tt,, R). (This corresponds to the k-NN performed in the dissiniilarity space). The k-NN rule optimizes k over the training set T in the leave-one out manner [Duin et al.. 2004bl. Specification of the data sets. In all experiments, the data sets are divided into training and test sets, whose sizes are reported in Table 9.2. We choose a number of problems possessing various characteristics: defined by both metric (Euclidean or non-Euclidean) and non-metric dissimilarity measure$, as well as concerning small and large sample size problems. Seven data sets are used: randomly generated polygons, NIST scanned digits, geophysical spectra, proteins and their localization sites and wine types,
Classzfication
Approximate embedding of Poldisth
**
~~~
~
403
-
Eigenvalues of Poldisth 2OOr
~~
100, 1501 f
501
- 5 0 0 L
Approximate embedding of Poldistm
_i
~
200
400
600
~~
800
1000
Eigenvalues Poldistm - __ of
50-
20 I
I 200
n
400
600
800
1000
Figure 9.4 Left: approximate 2D embeddings of dissimilarity representations D ( T ,T ) for the polygon data. Right: the eigenvalues derived in the emhedding process.
resulting in ten dissimilarity representations (two different measures are considered for some data sets). The data sets refer to two-, three-, fourand ten-class classification problems. All sets are described in Appendix E. If the dissimilarity d is Euclidean, then the N x N dissimilarity representation D ( T ,T ) can be perfectly embedded in a Euclidean space. This is equivalent to stating that the Gram matrix G = - i J D * 2 J , 0"' = (d:J) arid J = I-illT,is positive semidefinite i.e. all its eigenvalues are nonnegative. A non-Euclidean representation D can be embedded in a pseudo-Euclidean space. The configuration X is determined in this space by an eigendecomposition of the Gram matrix G as G = QAQT, where A is a diagonal matrix of decreasing positive eigenvalues followed by decreasing (in magnitude) negative eigenvalues and then zeros, and Q is an orthogonal matrix of the corresponding eigenvectors. X is found as X = QmlAml $ , where m corresponds to the number of non-zero eigenvalues. See Sec. 3.5 for details. Let the eigenvalues be denoted by A's. Hence, the magnitudes of negative eigenvalues indicate the amount of deviation from the Euclidean be-
The dissimilarity representation for p a t t e r n recognition
404
havior. This is captured by the following indices:
r;:, is the ratio of the smallest negative eigenvalue to the largest positive one, while rpz describes the contribution of negative eigenvalues. Additionally, an indication of the non-metric behavior can be expressed by the percentage of disobeyed triangle inequalities, rr",". Table 9.3 provides suitable information on the Euclidean and metric aspects of the measures considered. The Hausdorff representation of the polygon data are strongly non-Euclidean. The modified Hausdorff representation of t>hepolygon data as well as the template-matching representation of the digits data are moderately non-Euclidean and non-metric. Concerning the geophysical data, the shape dissimilarity representation is slightly non-Euclidean, while the SAM representation is nearly Euclidean. Both are metric. For the Ecoli data, the non-metric &s-distance representation is used. ProDom representation is slightly non-metric and slightly non-Euclidean. The remaining two data sets: NIST digits and Wine have Euclidean representations. To visualize relations in our data, two-dimensional approximate embeddings of dissimilarity representations are found. They rely on linear projections from the corresponding Gram matrices, as sketched above (details are in Sec. 3 . 5 ) . The partia,l sum of the first two largest eigenvalues with respect to the total sum of all absolute values indicates how much of' the original dissimilarities is reflected in the projections. This is presented in Figs. 9.4 9.9. There we also show all eigenvalues of the Gram matrices (derived from the dissiniilarity matrices), hence the deviation from the Euclidean behavior can be visually judged. The number of dominant eigenvalues indicates the intrinsic dimension of a problem. The approximate embeddings are used for the purpose of exploratory data analysis. As judged from two-class problems, Figs. 9.4 9.9, the polygon data seem the most complex, while the Zongker-12 data seem the easiest. On the other hand, the ten-class Zongker-all data are the most complex. ~
~
9.2.3
Results and discussion
The results of our experiments are presented in Figs. 9.10 9.16. They show generalization errors of the NQC as a function of the number of prototypes
Classification
r
405 Eigenvalues of NIST-38
Approxmate embedding of NET-38 _-_*_ ___ ~
* .:**
*I
hU
=r
i
t'
*::**O
400
-
/i
OO
Approximate ernbedding of Zongker-12 ~
_
_
100
50
150
2 I0
-
Eigenvalues of Zongker-12
~
12r ~ -
~
I
101
I
I
------
Approximate embedding of Zongker
*i
lor
Eigenvalues of Zongker
__
~
~-
Figure 9.5 Left: approximate 2D emheddings of dissimilarity representations D ( T ,T ) for the NIST data. Right: the eigenvalues derived in the embedding process.
chosen by various selection methods. These error curves are compared to some variants of the NN rule. Note that in order to emphasize a small number of prototypes, the horizont,al axis is logarithmic. The prototype selection methods mentioned in the legends are explained in See. 9.2.1. Concerning the NN methods, the following abbreviations are used. The 1-"-final and the k-"-final stand for t,he NN results obtained by using the entire training set T , hence such errors are plotted as horizontal
T h e dissimilarity representation f o r p a t t e r n recognition
406
Approximate embedding of GeoSam ~~~
0
4 3' 2~
1~
,
I
-
1 i
O 105
Approximate embedding of GeoShape
b
200
300
400
500
Eigenvalues of GeoShape '
/
I
15
5
,
I
0 .-
Figure 9.6 Left: approximate 2D embeddings of dissimilarity representations D ( T , T ) €or the geophysical spectra data. Right: the eigenvalues derived in the embedding process.
lines. They are our reference. k-NN is the k-NN rule directly applied to D ( T ,R ) ,while the k-NN-DS is the Euclidean distance k-NN rule computed in D ( T ,R ) dissimilarity spaces (this means that a new Euclidean distance representation is derived from the vectors D ( z , R ) ) . In both cases, the representation set R is chosen by the KCentres algorithm. EdiCon-1-NN presents the 1-NN result for the prototypes chosen by the editing and condensing (EdiCon) criterion. The optimal parameter k in all the k-" rules used is determined by the niinimization of the leave-one-out error on the training set. Sometimes, k is found to be 1 and sometimes, k is a different value. The performance of all selection procedures mentioned in the legends (Random to EdiCon) is based on the NQC results in a dissimilarity space defined by the selected prototype sets R. Consequently, only the reduced set of dissimilarities have to be derived for testing, while the methods indirectly profit from the availability of the entire training set T . To enhance the interpretability of our results, the following patterns are
Classafication
--
407
Approximate embedding of Wine ~
400
300~.
0 .I*
**,
8
.* *
*d** * 10
Figure 9.7 Left: approximate 2D embedding of the dissimilarity representation D ( T ,T ) for the Wine data. Right: the eigenvalues derived in thc embedding process.
Eigenvalues of Ecoli-pO8
Approximate embedding of Ecoli-p08 120--
~~
~
~
* *
0 0 0
loot
*
* +*'
*z
~
80
:I.
**
*
*
20
O l -zoo
50
~
100
1
150
Figure 9.8 Left: approximate 2D ernbedding of the dissimilarity reprcseritatiori U ( T ,T ) for the Ecole-pO8 data. Right: the eigenvalues derived in the ernbcdding process.
104
Approximate embedding of ProDom 1
I
a-,
Eigenvalues of ProDom --
4
0
-600 ZOO 400
800
Figure 9.9 Left: approximate 2D embedding of the dissimilarity representation D(T,T ) for the ProDom data. Right: the eigenvalues derived in the embedding process.
The drssimilarity representation f o r pattern recognition
408
Polydisth, #Objects: 1000, Classifier NQC
+
v -+
f)
+
-+
k-NN-DS EdiCon-1-NN Random* KCentres * FeatSel * KCentres-LP "
_ _ _ _ _ - ---
Number of prototypes Polydistm, #Objects 1000; Classifier NQC
k-NN-DS EdiCon-1-NN Random' -+ RandomC' + ModeSeek' * KCentres * + Featself -8- KCentres-LP * +
7
-+
Ok
6 -0
30 40 55 70 I00 140 260 Number of prototypes
Figure - classification error of the NQC* and the k-NN - 9.10 Polygon data. Average classifiers in dissimilarity spaces as well as of the direct k-NN rule as a function of the number of selected prototypes. "
I
used in the plots. The supervised methods, the KCentres-LP and FeatSel are plotted by continuous lines, the unsupervised selections are plotted by dash-dotted lines and the random methods are plotted by dashed lines. Our cxperiments are based on M repetitions, i.e. M random selections of a training set. M = 10 for the Prodom and Zonglcer-all dissimilarity data and M = 25, otherwise. The remaining parts of the data are used for testing. Different selection procedures use the same collections of the training and test sets. The averaged test errors are shown in the figures. To maintain the clarity of the plots, we do not present the resulting standard
Classification
409
NIST-38; #Objects. 200; Classifier. NQC 10 5
-8. 10 g
95
v
2 9 & 85 5
c
g c
k-NN-DS EdiCon-l-NN
+
-
T
8
* KCentres *
7.5
+
: : 7 65
+
%
FeatSel' KCentres-LP *
---_-_-
% 6 9 $ 55 4
5 4.5
2
3
4
6
8 10 14 20 30 40 Number of prototypes
60
100
Zongker-12; #Objects, 200, Classifier NQC
+
-7
-+
0 2
3
4 5 6
8 10 14 20 30 40 Number of prototypes
60
k-NN-DS EdiCon-l-NN Random' RandomC*
100
Figure 9.11 NIST digit data. Average classification error of the NQC* and the k-NN classifiers in dissimilarity spaces as well as of the direct k-NN rule as a function of the number of selected prototypes.
deviations. In general, the standard deviations vary between 3% and 7% of the averaged errors. Fig. 9.10 shows the results for the two dissimilarity measures derived from the same set of polygons. Remember that the Polydisth is metric and Polydistm is not. The first striking observation is that in spite of its non-metric behavior, the Polydistm results are better: lower NN errors, less prototypes needed to yield a good result. Just 20 prototypes out of 1000 objects are needed to obtain a better error than found by the NN rules. In the k-NN classifiers, the average optimal k appeared to be 127 (Polydisth)
T h e dissimalarity representation f o r p a t t e r n recognition
410
GeoSam; #Objects 500, Classifier: NQC 22 '
k-NN-DS EdiCon-l-NN Random' -0RandomC* + ModeSeek' * KCentres * FeatSel * + KCentres-LP +
7
-+
0
3
4
6
8 10 14 20 30 4050 70 100 150 Number of prototypes
GeoShape, #Objects: 500, Classifier NQC
---+
k-NN-DS
v EdiCon-l-NN
*
*
3
4
6
Random' RandomC' Modeseek' KCentres ' FeatSel * KCentres-LP '
8 10 14 20 30 4050 70 100 150 Number of prototypes
Figure 9.12 Geophysical data. Average classification error of the NQC* and the k-NN classifiers in dissimilarity spaces as well as of the direct k-NN rule as a function of the number of selected prototypes.
or 194 (Polydsstm,). These large values correspond to the observation made before in relation to the scatter plots (Fig. 9.4) that this is a difficult data set. Nevertheless, in the case of the Polydistm data, the linear programming technique finds a small set of 55 prototypes for which the NQC error is very low (0.4%). The systematic procedures KCentres (KCentres-LP) and FeatSel perform significantly better than the other ones. The feature selection is also optimal for small representation sets. Notice also the large difference between the two results for editing and condensing. They are based on the samc sets of prototypes, but the classification error of the 1-NN rule (in
Classification
411
Wine, #Objects: 108; Classifier. NQC
+
v -+
-0+
+
3
4
8 10 14 20 2530 Number of prototypes
5 6
40
k-NN-DS EdiCon-I-NN Random' RandomC' Modeseek* KCentres *
54
Figure 9.13 Wine data. Average classification error of the NQC* and the k-NN classifiers in dissimilarity spaces as well as of the direct k-NN rule as a function of the number of selected prototypes.
Ecoli-pO8; #Objects: 165; Classifier: NQC
4' 3
"
4
5
'
7
I
10 14 20 30 Number of prototypes
45 60 75
Figure 9.14 Ecoli-pU8 data. Average classification error of the NQC* and the k-NN classifiers in dissimilarity spaces as well as of the direct k-NN rule as a function of the number of selected prototypes.
fact a nearest prototype rule), EdiCon-1-NN, is much worse than of the NQC, the EdiCon, which is trained on D ( T ,R). This also remains true for all considered problems, as can be observed in other plots. Fig. 9.11 shows the results for two of the NIST digit classification problems. The NIST-38 data set is based on a Euclidean distance measure,
412
The dissimilarity representation f o r pattern recognition
Prodom, #Objects: 913; Classifier. NQC
k-NN-DS EdiCon-1-NN Random' -* RandomC* + Modeseek* * KCentres' + FeatSel * + KCentres-LP ' +
v
-+
0" 8 10
'
14
'
20
I
30 40
60
100 140 200
320
Number of prototypes
Figure 9.15 Four-class ProDom problem. Average classification error of the NQC* and the k-NN classifiers in dissimilarity spaces as well as the direct k-NN as a function of the number of selected prototypes. The result for the LznProg is not visible, since it finds a representation sct of 491 objects.
whilc the Zongker-I2 relies on a nori-metric shape comparison. The k-NN classifier does not improve ovcr tlie 1-NN rule, indicating that the data set sizes (100 objccts per class) are too small to model the digit variabilities properly. Again. the systematic procedures do well for small representation sets, but they are outperformed by the KCentres routine for a larger riurriher of prototypes. The KCentres method distributes the prototypes evenly over the classes in a spatial way, that is related to tlie dissimilarity information. For small training sets (here 100 examples per class), this may be a better than an advanced optimization. Fig. 9.12 presents the results for the two dissimilarity representations of the geophysical data sets. From other experiments it is known that they are highly multi-modal, which may explain good performance of the ModeSeek for the GeoShupe problem and the KCentres for the GeoSam problem. Editing and condensing does also relatively well. Feature selection works also well for a small number of prototypes. Overall, the linear programming yields good results. Recall that we take the KCentres results as a start (except from the final result indicated by tlie squarc marker that starts from the entire training set), so tlie KCentres curve is for lower numbers of prototypes underneath it. In this problem we can hardly improve over the NN performance, but still need just 5% - 10% of the training set size for prototypes. In the next subsection, however, it is shown that these results
Classification
413
can still be significantly improved by modifying the dissimilarity ineasiire. So far, two-class classification problems have been discussed. To illustrate what may happen in multi-class situations, tjhe following problems are also considered: the three-class Wine and Ecoli data, the four-class ProDom data and the ten-digit Zonyker-all data. Although the Wine and Ecoli data are originally represented by features, their tp-distance representations can be used to show our point. In all experiments with the NQC, a small regularization, X = 0.01, is used; see Sec. 4.4.2. A regularization is necessary since for large representation sets, the number of training objects per class is insufficient for a proper cst,irnat#ionof the class covariance matrices. For instance, 100 training examples per class are used for the Zongker-a,ll data. The results for R with more than 100 prototypes are based on the NQC trained in more than 100 dimensions. Tlie peak for exactly 100 prototypcs, see Fig. 9.16, upper plot, is caused by a dimension resonance phenonicnon that has been fully examined for the linear normal density based classifier in [Raudys and Duin, 19981. When a larger regularization is used in this case, the NQC performs much better, as observed in the bottom plot of the same figure. Fig. 9.13 shows the results for the Euclidean representation of tlie Wine data. The ModeSeek seems to work the best, howevcr sincc the number of test objects is small (70 in total), all the selection procedures behave similarly for more than 10 prototypes. The latter observation also holds for the Ecoli-pO8 data, as observed in Fig. 9.14. The number of test objects is also sniall (107 in total). Here, however, tlie NQC does not iniprove ovcr the k-NN on the complete training set. Still, 20 (or less) prototypes are needed for reaching the same performance. Fig. 9.15 illustrates the study on prototype selection for the ProDom data. The data are multi-modal, as it can be judged from the 2D approximate embedding shown in Fig. 9.9. Some of the modes seem to be very small, possibly corresponding to outliers. ModeSeek inay focus on such examples, and perform worse than the class-wise random selection. Tlie KCentres and tlie FeatSel methods perform the best. For 100 (an more) prototypes, the NQC reaches the error of the k-NN on a complete training set, however, it does riot improve it. This might be partly caused by unequal class cardinalities and too-small regularization parameter. The Zongker-all data are highly non-Euclidean and non-metric. When a proper regularization (A = 0.05) is used, the NQC significantly outperforms the best k-NN rule. However, when the size of the representation set is too large (450 prototypes in hottorn plot), the NQC starts to suffer.
The dissimilarity representation f o r p a t t e r n recognition
414
Zongker-all, #Objects. 1000, Classifier: BayesNQ 1-NN-final k-NN-final k-NN k-NN-DS EdiCon-1-NN Random ' RandomC * ModeSeek * KCentres ' FeatSel * KCentres-LP * LinProg * EdiCon *
?b
I 4 20
30
50 7 0 I00 I50 Number of prototypes
250
5lO
Zongker-all; #Objects: 1000; Classifier: BayesNQ 201,
i
'
,
4 1 ' " 10 14 20
,
1-NN-final k-NN-final k-NN k-NN-DS EdiCon-1-NN Random * RandomC * ModeSeek * KCentres * FeatSel * KCentres-LP * LinProg * EdiCon *
I
30 50 70 100 150 220 350 540 Number of prototypes
Figure 9.16 Ten-class Z o n g k e r problem. Average classification error of the NQC, with the regularization of X = 0.01 (upper plot) and X = 0.05 (bottom plot), and the k-NN classifiers in dissimilarity spaces as well as of the direct k-NN rule as a function of the number of selected prototypes.
Only 3% of the training examples allow this decision rule t o reach the same performance as the k-NN rule on the entire training set. In general, the KCentres works the best,. Edited and condensed set seems to give a good representation set, as well. Additional observations are of interest for multi-class problems. First, in contrast to the two-class problems, a suitable regularization is necessary, since it can significantly influence the performance of the NQC. If the regularization is appropriate, a significant improvement over the k-NN re-
Classzjication
415
sults on the complete training set may be found by the use of a regularized
NQC. Next, as in the two-class problems we find that just 3% - 12% of the training set gives a sufficient number of prototypes for the NQC to reach the same performance as the k-NN rule. Like before, systematic selections of prototypes perform best. Finally, the EdiCon works well and tends to determine less prototypes than the LinProg. In summary, systematic selections perform better than the random selection, but the differences are sometimes small. The way we have ranked the algorithms in the legends from the Random to the KCentres-LP selections, roughly corresponds to the way they globally perform over the set of conducted experiments. In the future also pairwise strategies can he explored Pekalska et al. [2005a]. Concave transformations of dissimilarity representations. Concave transformations of dissimilarity representations may improve the discrimination properties between the classes, when linear or quadratic classifiers are used in dissimilarity spaces. An example can be given by the e z p ( - $ ) ) - 1 applied to the sigmoidal transformation fsigm(z)= 2 / ( l square dissimilarities in an element-wise way. The transformed representation becomes then DSigln= (fsigm(d:j)). A nonlinear transformation is applied to square dissimilarities, which significantly changes the original dissimilarities. Since the sigmoidal transformation is monotonically increasing, the k-NN rule performs identically as for the original dissimilarities. To illustrate possible benefits of a sigmoidal transformation, ail experiment for the GeoSnm representation is performed for a fixed number of prototypes, that is K = 20 and K = 60. We can observe in Fig. 9.12, top row, that the best average performance of the NQC is approximately 10% for 20 prototypes. When a suitable parameter s of the sigmoidal transformation is chosen; the best average performance of the samc classifier is 696, which can be improved to 4% when 60 prototypes are considered. This can be seen in Fig. 9.17. So, the gain in performance is significant. The k-NN error based on the entire training set T of 500 objects (hence tested on D(Tt,, T)) is 9.6%. Note, however, that such nonlinear transformations do not inimediately guarantee improved performance. It is simply related to the discriminative properties of the dissirriilarity measure used. The parameter s was investigated in the range of [0.5dm,, 10d,l,,], whcre d,, is the average distance of the original representation D ( T , T ) . The best classification accuracy is reached for s M 3dm,. It may be obscrved, however, that a specific choice of s is not very crucial. For a rangc of possible
+
416
The dassamzlarzty representatton f o r p a t t e r n recognataon
-4 1
t . .....
-
~
G S i g m Random-60] Sigm KCentres-60 Orig Random-60 Orig KCentres-60
m
2
2t 0
05 1 15 Parameters of the sigmoid transformation
2
OO
05 1 15 Parameters of the sigmoid transformation
2
Figure 9.17 GeoSam: classification error (averaged over 25 runs) of the NQC in a dissimilarity space based on 20 prototypes (left) and 60 prototypes (right) chosen either randomly or by thc KCentres algorithm. The prototypes are selected for both the original dissimilarity reprcscntation (Orig) and its sigmoidal transformation (Sigm) as a function of the parameter s. The horizontal lines correspond to the classification errors for the original representations based either on 20 or 60 prototypes. The standard deviations of the means arc less than 0.5%. The k-NN error defined on the training set T of 500 examples, i.e. derived from D ( T t e ,T ) is 9.6% for the GeoSam. Since the sigmoidal transformation is monotonic, the k-NN results remain unchanged.
values of s , a significant performance improvement is achieved compared to the original representation. The NQC defined on the representation set R sclccted by thc KCeritres algorithm performs somewhat worse than in the case of a randomly selected R. The iritercsting point is that the transformed dissimilarity representations arc strongly non-metric and non-Euclidean. When s is very small. however, then Dslgmis nearly metric and nearly Euclidean. For s t [dm,.4d,,], on average 70.8% of triangle inequalities are disobeyed. The dcviation of tlic Euclidean bchavior is on average T$ = 28.2 and r,,n E - 30.5. which suggests large negative eigenvalues of the corresponding Gram matrices.
9.2.4
Conclusions
Prototype selection is an important topic for dissimilarity-based classification. By using a few, but well chosen prototypes, it is possible to achieve a better classification performance in both speed and accuracy than by using all training samples. Usually, prototype selection methods are investigated in the context of the metric k-NN classification considered for feature-based representations. In our proposal, a dissimilarity representation D ( T ,T ) is
Classification
417
interpreted as a vector space, where each dinlension corresponds to a dissimilarity to an object from T . This allows us to construct traditional decision rules, such as linear or qiiadra.t>icclassifiers on such representations. Hence, the prototype selection relies on the selection of the representation set R c T such that the chosen classifier performs well in a dissiinilarity space D ( . , R ) . Since the classifier is then trained 011 D ( T ,R ) , a better accuracy can he reached than by using the k-NN rule defined o n the set R. Various random and systematic selection procedures have been empirically investigated for the normal density based quadratic classier (NQC) built in dissimilarity spaces. The k-NN method, defined both on a complete training set T and a representation set R is used as a reference. The following concliisioiis can be made from our study with respect to tlie investigated data sets:
(1) By building the NQC in dissiniilarity spaces jnst a very srnall number of prototypes (such as 3% - 12% of t,he training size) is needed tjoobtain a similar performance as the k-NN rule on the entire training set. (2) For large representation sets, consisting of? for instance 20%) of the training examples. significantly better classification results arc obtained for the NQC than for the best k-NN. This holds for two-class problems and not necessarily for multi-class problems, urilcss a suitable regularization parameter is found. ( 3 ) Overall, a systematic selection of prototypes does better than a ram dom selection. Concerning tlie procedures which have a control over the number of selected prototypes, the KCentres procedure performs well, in general. In other cases, the linear programming performs wcll for two-class problems, while editing and coritlensirig sets should be preferred for multi-class problems. In our investigation, multi-class problems are more difficult a s they need a proper regularization for the NQC discrimination function. Moreover, this classifier becomes coniputatiorially more expensive. Therefore, there is a necessity to study more suitable classifiers arid other prototype selection techniques for multi-class problems. 9.3
Selection of the representation set: the embedding approach
In the ernbedding approach, one considers an embedding of the syriirnetric dissimilarity data D ( T ,T ) into a k-dimensional pseudo-Euclidean space
418
The dissimilarity representation f o r p a t t e r n recognation
& = R ( P > 4 ) , k = p + q such that the original dissimilarities are perfectly preserved. In this process, however, many dimensions can turn out to be non-informative since the variance in the data are close to zero. The variances of the projected data are specified by the eigenvalues derived in the embedding; see Secs. 3.5.3 3.5.6 for details. In fact, one determines the dinierision m = p'+q' based on the number of dominant eigenvalues, i.e. the ones which are significantly different from zero. The remaining k - m dimensions are simply neglected as corresponding to noise and non-significant information. If m is much smaller than N = ITI, then the question arises whether N objects are necessary to determine the m-dimensional space. In fact, only ( m 1) objects can define a linear space: one object will serve as a reference to the origin and m objects will correspond to the basis vectors. This is computationally attractive, since only dissimilarities to these ( m 1) objects need to be computed. ~
+
+
9.3.1
Prototype selection m e t h o d s
The task can now be formulated as follows. Given the rcpresentation X in R(p,q) that preserves the original dissimilarities, choose the representation set R of m S 1 objects such that the projection defined by R, (hence the space defined by D ( R ,R) with the remaining T\R objects projected later t o this space) gives a configuration which is close to X (according to some criterion). A set R. spanning the space R" = IW(P>q)such that RTnis defined by r n leading principal axes, might not, however, exist. To avoid an intractable search over all possible subsets. an error measure between the approximated and original configurations can be defined to be minimized, c.g. in a greedy approach [Verma, 19911. Here, our ultimate goal, however, is not thc best approximation of the given configuration X , but, good classification results in an embedded space. In fact. R should be chosen such that the discrimination between the classes is preserved or even improved. The following procedures are considered for the selection of R: Random, KCentrcs. MaxProj. APE. LAE, Pivots and NLC-err, as explained below.
Random.
m!+1 objects are randomly chosen from all training objects.
KCentres. m + l ccnter objects are chosen such that they minimize the maximum of the dissimilarities over all training objects to their nearest neighbors; see also Scc. 9.2.1.
419
Classz$catzon
Note that the two procedures above do not guara,rit>eea. faithful representation of the origindly embedded X . The procedures below focus more on this aspect. We start our reasoning from X , whose nieari vector coincides with the origin. To simplify the approach, the origin of the embedded space will now be fixed to the projection of the object po which is the closest to the origin. Such an object is easily detected as the one whose average square dissimilarity to T is the srnallest [Goldfarb, 1985; Pekalska and Duin, 2002s; Pekalska et al., 2002bl. Having determined p o . the entire configuration X is shifted to the new origin. So, since now on, X refers to a shifted configuration. Starting from PO,objects are now successively added in each step until m f l objects are found. In each step, an object is selected that minimizes a specified criterion. This does not guarantee the overall optimal solution, however, it guarantees the best immediate solution. Let Ro = { P O } arid let Rj-1 be the representation set after the (j-1)t h step.To assure that the chosen objects are linearly iridepenclent and to make the selection a feasible process, in the j - t l i step, oiily h I objects Z j = { z i . . . . , &} C T\Rj-1 with the largest (in magnitude) projections on the j - t h principal axis are pre-selected to be tested against the specified criterion. M is assigned to e.g. 10% of the training size. This holds for all criteria introduced below.
MaxProj. In each step, this criterion chooses an object yielding the largest (in magnitude) projection on the j - t h dimension. Average Projection Error (APE). Let €,-I b t a .7-diinensional subspace of the complete ernbedded space & = R(P>q)( p q = k ) , where &]-l is determined by R,-1 = { p o , p l , . . . , p J - l } . Based on the properties of the inner products and the embedding, and given that po is projected as X I at the origin, the square pseudo-Euclidean distance between a vector x, ER'"and its projection x?-' onto the approximation error can be expressed as:
+
where g,(n)is the i-th column of the cross-Gram matrix G(") arid G is the Gram matrix, where both G and G ( n )refer to the representations in &j defined by pairwise dissimilarities between j tl objects (i.e. t>lieorigin arid
420
T h e dissimilarity representation f o r pattern recognataon
the b a ~ i s ) ~Having . chosen the set R,-1 = { p o , p l , . . . , p 3 - l } , in the 3-th step, an object z E 2 3 is selected as p , such that the average projection error eapr(x,) onto the space E,, defined by {RJ-l, x} (hence €, is determined by projecting D([R,-1,2], [R7-l,z ] ) )is the smallest.
Largest Approximation Error (LAE). Having chosen the set R,-l = { p o , y 1 , . . . ,p,-1}, in the j - t h step, an object ~ € 2 3 is selected as p3 as the onc which yields the largest approximation error Eq. (9.2) of' z onto the space EJ-', defined by R,-l. Since in the first step, the inner products cannot be defined yet, eapr(x,)is assumed to be equal to d 2 ( p , , p o ) , where p o is the object closest to the origin in the embedded space, as described before. NLC-err. Starting from R = { P O } , in the j - t h step, an object x E ZJ is selected as p , as the one for which the embedded configuration X , of D ( T ,R 3 ) allows for reaching the smallest 5-fold cross-validation error of the NLC (or other chosen classifier). In case of ties. an object with the largest projection on the j - t h axis is chosen. Pivots. Choose 7 times two pivot objects as described in the FastMap algorithm iii Src. 3.6.1. The above criteria select the representation set R as appropriately defined by R,. Their results can be judged by various measures. For instance, to see how much distortion was introduced by the approximation step (hence the selection of R),the mean square error between the original and approximated dissimilarities can be computed. Another possibility is the computation of the average between-class square distance to the average within-class square distance, again on both original and approximatcd 'Givcn a symmetric matrix D ( R , R ) ,a linear embedding into & = W m = RQ(P'r4') can be constructed such that the origin coincides with the vector represcntation of e.g. XI. Since by our assumption IIxzIIz = llxz - 011; = dz(xi,xl)= d2(pz,po)holds, then the Gram matrix (a matrix of inner products) G = { g z J } for the vector representation {xl.xa,. . . , xn} E € is expressed by using the pseudo-Euclidean distances as gzJ = (xz,x,,) = -1 [d2(xirx3)d'(x,,x~) d2(xj,x1)]. By the eigendecompo~
sition of G = &AQT =
~
(QIAl~)Jpq(QIAl~)T, X can be represented in the space &
as X = Q T Y L IA,, 14, where m reflects the number of eigenvalues, significantly different from zero. Novel objects D ( n ) = D ( T ( n ) , R )are then orthogonally projected onto & as X ( " ) . Based on t,he matrix of inner products G ( n ) = {gt,"'} consisting of g ( n ) = LJ
-;[~'(x:~),x~)- d2(x,,x1) d2(x4n),xl)], X ( ' L )is given by X ( n ) = G ( " ) X I A I - J p q or X ( " ) = G ( " ) G - ' X . This is similar the projection presented in Sec. 3.5 with the difference that a specified object is mapped to the origin. ~
42 1
Classi~cation
Table 9.4 Dissimilarity data sets used in the experiments. T and Tte correspond to the training and test sets, respectively. 1 . I stands for the set cardinality. A rough estimation of the effective intrinsic dimension I D relies on the number of significant eigenvalues in the embedding of D ( T ,T ) ,while ID refers to the number of indicative dimensions, in general.
Polydistm NIST-38 Zongker-12 GeoSam GeoShape ProDom
Dissimilarity
Property
Hausdorff Mod. Hausdorff Euclidean Template-match SAM [Landgrebe, 20031 Shape C 1 Structural Template-match
M, nE nM E nM M,nE M ,nE
nM nM
5 6
10
18 10
80 80
dissimilarities. It gives an indication on the class separability. Since, in fact, our purpose is the classification task, it is not crucial that the distances are well preserved when the classification performance is good. For this reason, we focus on the resulting classification error.
9.3.2 Experiments and results Most of the data sets that are used in our study are the one analyzed for prototype selection methods in the dissimilarity space approach; see Sec. 9.2.2. The experiments are performed M = 25 times for two-class data and M = 10 times for multi-class data, and the results are averaged. In each run, data sets are randomly split into the training and test sets, as indicated in Table 9.2. In each experiment, m + l prototypes are either directly selected by the Random or the KCentres approaches, or based on the dissimilarity matrix D ( T ,T ) . First the complete k-dimensional representation of D ( T ,T ) is found and then the set R of , m + l objects is chosen according to a specified criterion. Next, the approximated space, dcfined by objects from R is determined (i.e. the mapping based on D ( R ,R ) only), where additional T\R objects are projected. The NLC (equivalent to the FLD for two equally probable classes) is then trained both in the reduced and approximated spaces and the generadizatjionerror is computed for thc test set. Here, we h w e decided for a fixed and simple classifier, the NLC; although, in some cases it is not the best choice. As a reference, thc rcsults of the 1-NN and the best k-NN rule on the entire training set T , i.e. determined by D(T',, T ) ,are provided.
422
T h e dissamilaraty representation f o r p a t t e r n recognation
Poldisth; #Objects 1000; Classifier: NLC
"5
7
10 14 20
30
m
50 70 100
200
4- Random
--t = = Kcentres
+ MaxProl + APE +
LAE
* PlVOtS s NLC-err - 1-NN k-NN
Figure 9.18 Polygon data: classification error (averaged over 25 runs) of the NLC in an m-dimensional embedded space as a function of m for the Polydisth data (left) and the Polydistrn data (right). Except for 'ALL', other criteria choose a representation set R of m + l objects, which serves for the determination of an embedded space and training the NLC there. 'ALL' stands for the NLC results, where the m-dimensional embedded space is found by using all training objects. The N N results are based on R = T , i.e. 600 objects and are given as a reference. For the P o l y d i s t h data, the error curve corresponding to the random selection is not visible, since it lies above the given scale. Additionally, also the classification error of the NQC for the Polydasth is shown for R chosen as pivot objects or by hlaxProj criterion.
The results of our experiments are presented in Figs. 9.19 9.22. The standard deviations are not shown there to maintain the clarity of the plots. In general. thc standard deviations vary from 2% to 7% of the averaged classification errors. The number m of important dimensions (hence an indication on the cardinality of the set R, since lRI = m + l ) is related to ~
423
6'
8
10 12
15 rn
20
25
30
40
Zongker-12, #Objects 200, Classifier NLC
++ Random
1+
Kcentres MaxProj _..........._ + APE + LAE * P1vots ++ NLC-err - . ALL - - 1-NN k-NN -t
b 525
-
~
L
a, 0)
7J
m
g 0. Q
5
7
10
20
14
30
40
rn
Figure 9.19 NIST data: classification error (averaged over 25 runs) of the NLC in the m-dimensional embedded space as a function of m for the NIST-38 data (left) and the Zongker-12 data (right). All, but 'ALL' criteria choose a representation set R of 7n+l objects, which serves for the determination of a n embedded space and training the NLC there. 'ALL' stands for the NLC results, wherc thc m-dimensional embedded spare is found by rising all training objects. The NN results are based on R = T , i.e. 200 objects and arc given as a reference. Note that scale differences.
complexity of the given classification problem. This is somewhat related to the intrinsic dimension. As observed in Figs. 9.4 9.6, every dissimilarity probleni has a different intrinsic dimension m (determined by significant ' can eigenvalues in the embedding). By a visual judgment, the estirriat ions be made; see Table 9.4. So, ideally, our selected representation set R could consist of m + l objects. This, however, might not be sufficient. simply, because an approximation is made by using only the set R (instead of' T ) to ~
424
T h e dissimilarity representation f o r pattern recognition
define an ernbedded space. Moreover, the additional difficulty arises when the classes are riot linearly separable. If the linear classifier is not adequate for the embedded configuration (because the boundary is e.g. quadratic) , the classification error might be large. So, the choice of the representation set as well as the discrimination function plays a significant role in solving the classification task for the given R. In our study, the NLC has been selected, which might not be optimal. As observed before in Figs. 9.4 - 9.9, the following observations are important for the embedded space approaches: 1. Both Polydisth and Polydistm dissinlilarity data are strongly nonEuclidean. The intrinsic dimension is snmller for the Polydistm. tkian for the Polydisth. Also, as judged from the 2D spatial niaps, the classes for these problems are overlapping (in a two-dimensional approximate embedding space), yet, they are more compact for the Polydistm than for the Polydisth. For the Polydisth embedding, the classes may seem to be uniformly distributed. 2. The NIST digit,s appear to be linearly separable as shown in Fig. 9.5 for the 2D approximate embeddings. The intrinsic dimension is srnall for the NIST-38 case, while larger for the Zongker-12 data. 3 . The multi-modality of the geophysical data can be observed in cluster tendencies that are visible for the 2D approximate ernbeddings. Both sets seem to have a low intrinsic dimension. 4. The ProDorn data are nearly Euclidean.
Concerning classification performance in embedded spaces, the Polydisth problem is more difficult than the Polydistm problem; see Fig. 9.18. Indeed, the Polydistm classes are linearly separable and the effective intrinsic dimension is small. The NLC, based on all objects gives nearly a, zero error. The same can be achieved for 41 prototypes in the representation set R. Only 14-20 objects in the set R, chosen in a systematic way, make NLC, perform better than the best k-NN rule defined on R = 600 objects. Since there is a ‘big gap’ between the NLC error curve in an embedded space defined by all training examples and the NLC error curves in an embedded space determined by some prototypes only, we tend t,o think that the NLC might be not the most suitable classifier for this problem. Additionally, the NQC error curve is presented for the MaxProj and Pivots selection criteria iw: they give the best results. The generalization error decreases, however it does not improve over the NLC result found in an embedded space defined by all objects. The representation set R of 70
Classification
425
GeoSam; #Objects 500; Classifier: NLC
.
38 10
14
20
L .-
30 40
~~
60
90
150
60
90
150
rn
“8
10
14
20
30 40 rn
Figure 9.20 Geophysical data: classification error (averaged over 25 runs) of the NLC in the m-dimensional embedded space as a function of m for the GeoSam data (left) and the GeoShape data (right). All, but ‘ALL’ criteria choose a representation set R of m + l objects, which serves for the determination of a n embedded space and training the NLC there. ‘ALL’ stands for the NLC results, where the m-dimensional embedded space is found by using all training objects. The NN results ar? based on R = T , i.e. 500 objects and are given as a reference. Additionally, also the performance of the NQC is shown for the GeoShape and R chosen by the KCentres procedure.
objects chosen by the Pivots or by the MaxProj method allows the NLC to reach a similar performance as the k-NN based on all training objects. Note also that the NLC-err criterion should be preferred selects for small representation sets, however. As observed in Fig. 9.19, the NLC in an embedded space defined by 10 prototypes for the NIST-38 data and defined by 5 prototypcs for the Zongker-I2 data outperforms the best k-NN defined on all (200) training
426
The dissimilarity representation for pattern recognition
Prodom; #Obiects 913: Classifier: NLC
Figure 9.21 Four-class Prodom data: classification error (averaged over 25 runs) performance of the NLC in the m-dimensional embedded space as a function of m . All, but ‘ALL’ criteria choose a representation set R of m + l objects, which serves for the determination of an embedded space and training the NLC there. ‘ALL’ stands for the NLC results, where the m-dimensional embedded space is found by using all training objects. The N N results are based on R = T , i.e. 913 objects and are given as a reference. The lack of a proper regularization in the NLC makes some of the error curves grow up.
Zongker-all; #Objects 1000; Classifier: NLC
m
Figure 9.22 Ten-class Zongker data: classification error (averaged over 25 runs) performance of the NLC in the m-dimensional embedded space as a function of m. All, but ‘ALL’ criteria choose a representation set R of m+1 objects, which serves for the determination of a n embedded space and training the NLC there. ‘ALL’ stands for the NLC results, where the m-dimensional embedded space is found by using all training objects. The NN results are based on R = T , i.e. 1000 objects and are given as a reference.
Classzjication
427
objects. The Zonglcer-22 problem is linea,rlyseparable and the NLC defined on IRl = 20 objects reaches a nearly zero error for the Pivots and tlie LAE selection methods. The prototype selection procedures also seem to work well to fit the NIST-38 data, since both systeniatic and random approaches allow one to reach an accuracy close to the one reached by the NLC in an embedded space based on all training examples. From our earlier observations, we already know that the geophysical data are multi-modal. This means that, a linear chssifier in an embedded space will not fit the problem well. Yet, as observed in Fig. 9.20. the classes can be reasonably separated for the GeoSani problem. The representation set R of 30 examples defines an embedded space such that the NLC constructed there outperforms the best k-NN rule based on 500 training objects. However, for the GeoShupe problem, the NLC performs much worse than for the GeoSam. In fact, the NLC does not outperform the best 1-NN rule. This becomes, however, possible, when a quadratic classifier is used (see Fig. 9.20, right) for the KCentres criterion. In four-class Prodom problem, Fig. 9.21, some error curves grow with the increasing m. This is the side-effect of the lack of proper regularization in the NLC. The KCentres and the APE criteria seem to work well, however, in this case, the k-NN rule based on all training examples is tlie best. Concerning the ten-class Zongker problem, Fig. 9.22, at least 120 objects in the representation set R are needed such that the NLC in an ernbedded space defined on D ( T ,R) outperfornis the k:-NN. All in all, there is no single selection method that works the best for all m (which is also the size of the representation sets). For small representation sets, the NLC-err, the supervised selection based on the crossvalidation NLC error in an embedded space is always the best. This is riot surprising, since an embedded space is chosen to guarantee the best NLC performance. However, for larger representation sets, this method may b e come significantly worse than the other systematic selection procedures. The KCcntres approach seems to be good for multi-modal problems (thc GeoSam, the GeoShupe and the-class Zongker data), since the found prototypes represent the clusters. The two methods that especially focus on the preservation of the original ernbedded configuration, i.e. the APE and the LAE, are not significantly better than the other approaches. This again may suggest that the goal of classification slioiild determine the way the objects are chosen for R. In principle, all systematic approaches considered here may work well. The random selection, although not best, but it is also never the worst.
428
The dzssimilarity representation for pattern recognition
In comparison to the prototype selection methods investigated in the dissimilarity space approach, Sec. 9.2, somewhat different conclusions can be drawn with respect to the specific data (compare plots in Sec. 9.2.3 with the plots in the current section). The GeoSam is judged as an easier problem than in the dissimilarity space approach, while the GeoShape, the other way around. Also, the Polydisth problem seems to be better attacked by the dissimilarity space approach, while the NIST-38 can be better discriminated in an embedded space. Such observations indicate that both dissimilarity and ernbedding space approaches should be studied for choosing the best recognition strategy.
9.3.3
Conclusions
Important conclusions can be drawn from our study on dissimilarity data embedded in pseudo-Euclidean spaces. First of all, the NLC, built in an embedded space defined by all training objects can significantly outperform the k-NN rule. Secondly, a representation set R of less than 20% of the training size can be selected, on which the approximated space is defined. In such an approximated embedded space, the NLC can reach the same or even a much higher accuracy than the best k-NN rule based on all training objects (this holds for the GeoShape provided that the NQC is considered instead). Thirdly, the KCentres procedures work well for multi-modal data. For a small number of prototypes and a non-separable classification problem, the criterion based on the classification error (here, the NLC-err) should be recommended. Finally, we have observed that similarly as in the dissimilarity space approach, a random selection is also beneficial. In this study, m+l objects were used to define an m-dimensional approximated embedded space. It is also possible to use more objects to define t h t space. This remains an issue for further research.
9.4
On corrections of dissimilarity measures
We do not require metric properties of a dissimilarity measure d in the dissimilarity space approach or in the embedding approach. (However, d should be nonnegative and should obey the reflexivity condition, Def. 2.38.) We demand that the compactness hypothesis is fulfilled by dcsigning a measure which yields small values for objects that share many commonalities. This guarantees that such a measure is meaningful for the problem, i.e. the classes of objects will have a compact description. Ideally, we would like
Classification
429
to guarantee a true representatzon which requires that by a comparison of dissimilar objects, a large dissimilarity value is obtained. Although our approaches to dissimilarity representations can handle arbitrary measures, an open question refers to possible benefits of correcting the measure to make it metric or even Euclidean [Courrieu, 2002; Roth et al., 20031. Metric or Euclidean distances can be interpreted in appropriate spaces, which posses many useful algebraical properties and where an arsenal of discrimination functions exists. This might also be interesting for the Ic-NN rule. sirice metric properties allow for a construction of a faster approximation rule; see e.g. [Moreno-Seco et al., 20031. Here, we investigate ways of making a dissimilarity measure either ‘more’ Euclidean or ‘more‘ metric and the influence of such corrections on the performance of some decision rules. We will experimentally show that the corrected measures do not necessarily guarantee a better discrimination. These results can also be found in [Pckalska et al., 2004bI.
9.4.1
Going more Euclidean
We know from Sec. 3.5 that the Gram matrix G = -i.JD*2.J is positive semi-definite (psd) iff a symmetric distance matrix D is Euclidean. Consequently, if G has p positive and q negative eigenvalues, D is nonEuclidean and a perfectly embedded Euclidean configuration X cannot be constructed. However, D ca,n be corrected such that it becomes Euclidean, which is equivalent to making the corresponding Gram matrix G psd. Possible approaches to address this point were discussed in Sec. 3.5.2. Here, they are briefly mentioned: 0
0
Clipping. Only p positive eigenvalues are considered yielding a p1 dimensional configuration X = Q pA;. Now, after neglecting the negative contributions, the resulting Euclidean representation overestimates the actual dissimilarities. Adding 2 7 . There exists a positive 7 2 -Amin, where Amirl is the smallest (negative) eigenvalue of G, such that Dz, = [D*2 2 7 ( 1 l T - 1 ) ] * ; is Euclidean [Gower, 1986; Pekalska et al., 2002bl. This means that the corresponding G, is positive definite. In practice. the eigenvectors of G and G, are identical, but the value r is added to the eigenvalues, giving rise to the new diagonal eigenvalue matrix A, = Ak 71.The original dissimilarities are distorted significantly if T is large. Adding n. There exists a positive n 2,,,,A, where, , ,A is defined in Theorem 3.19, such that such that D , = D rc. (llT-I)is Euclidean.
+
+
0
+
430
The dissimilarity representation for pattern recognition
Table 9.5 Non-Euclidean and non-metric aspects of dissimilarity representations used ~ T~~~ ~ and , c indicate the smallest and for experiments in Sec. 9.4.1. The ranges of T largest values found for D ( R , R), where IRl varies between 30 - 500 or 10 - 200 for the digit and polygon data, respectively. As a reference, the last two columns present the average and maximum dissimilarity for the complete data.
1.2
3.1
0.7
The corresponding Gram matrix G, has the eigenvalues and eigenvectors which are different than these of the original Gram matrix G . Power or Sigmoid transformation. There exists a parameter p such that D, = (,y(d,j; p ) ) is Euclidean for a concave function g defined as y(x) = 9 with p < 1 or as a sigmoid g(z) = } - 1 [Courrieu, 20021. In practice, p is determined by a trial and error.
l+exz+4
These approaches transform the problem such that a Euclidean configuration can be found. It is, however, still possible that the applied corrections are less than required for imposing the Euclidean behavior. In such cases. the measure is simply made ‘more’ Euclidean (hence, also ‘more’ metric), since the influence of negative eigenvalues become smaller after proper transformations. An additional point to realize is that in case of approximate ernbeddings of a fixed dimension the spaces derived from D and DzT will differ. This is caused by the fact the dimensions corresponding to the negative eigenvalues become now the less important (by adding 7 to all eigenvalues, the negative ones become the closest to zero) in the latter case, so they will riot be selected. So, if the negative eigerivalue contributions are large, the corresponding cigcnvectors will reprcsent the space obtained from D . This means that the spaces obtained from an approximate enibedding of the original dissimilarity data and the corrected ones are very difkrcnt if the dissimilarity measure is highly non-Euclidean. 9.4.2
Experimental setup
Five dissimilarity data are used in our study; see Appendix D.4.5 for details. The first two sets refer to the dissimilarity representations built on the contours of pen-based handwritten digits [Hettich et al., 19981. The digits
Classzfication
43 1
are represented by strings of vectors between the coritour points for which an edit distance with a fixed insertion and deletion costs and with some substitution cost is computed. The substitution costs such as ari angle arid a Euclidean distance between the vectors lead to two different representations [Bunke et al., 20011, denoted as Pen-dist and Pen-angle, respectively. Both measures are non-Euclidean and non-metric. Here, only a part of the data consisting of 3488 examples, is considered. The values are also scaled by a coristant t o bound the dissimilarities. The digits are unevenly represented; the class cardinalities vary between 334 and 363. Another dissimilarity data set consisting of 2000 examples evenly distributed over ten classes describes the NIST digits [Wilson and Garris, 19921. Here, the dissimilarity measure based on deformable template matching [Jain arid Zongker, 19971 is used. The data arc referred as the Zongker dissimilarity data. The last two representations are derived for randomly generated polygons. They consist of convex quadrilaterals and irregular heptagons. The polygons are first scaled and then the Hausdorff and modified Hausdorff distances, defined in Sec. 5.5, between their vertices are computed, yielding the Polydisth, and the Polydistrra dissimilarity data. The two classes are equally represented by 2000 objects. If the dissimilarity d is Euclidean, then for a symmetric D = (di,,). all eigerivalues A, of the corresponding Gram matrix G are rion-negative. Hence, the magnitudes of negativc eigerivalues show the deviation from the Euclidea,n behavior. An indication of such a deviation is given by T n E = J X m i n J / X r m M ,100, x that is the ratio of the smallest negative eigerivalue rr61,L to the largest positive one. The overall contributiori of negative eigenvalues can be estimated by r:: = C,”=,JA71’ 100. Both thcse indices come from Eq. (9.1). Any symmetric D can also be made metric by adding a suitable value c to all off-diagonal elements of D . Such a constant can be foiind as c = maxp,q,t/dPn+dPt-dqtI. A smaller value imposing a niet>ric behavior of D was determined by us in a binary search. Table 9.5 provides suitable information on the Euclidean arid metric aspects of the measures considered. The following observations can be made: 0
0
The Pen-angle data set is moderately non-Euclidean and iiearly metric. The PerL-dzst data set is both moderately non-Euclidean and nori-metric.
The Zongker data set is highly non-Euclidean and highly non-metric. Thc Polydisth data set is highly non-Euclidean, yet metric. The Polydistm data set is moderately non-Euclidean and slightly nonmetric.
432
T h e dissimilarity representation f o r p a t t e r n recognition
The experiments are repeated 50 times for the representations sets of various cardinalities and the results are averaged. The representation objects are randomly selected. The cardinality lRI varies from 3 to 50 examples per class (ten classes) for the digit data sets and from 5 to 100 examples per class (two classes) for the polygon data. For each 1R1, two cases for the training set T are considered: T = R and T consisting of 100 or 200 objects per class for the digit and the polygon dissimilarity representations, respectively. In the latter case, the ratio of ITI/IRI becomes smaller with the growing (R(. The test sets consist of 2488, 1000 or 3600 examples for the pen-digit, NIST digit and polygon data, correspondingly. For each dissimilarity representation, the k-NNrule is considered, as well as the linear discriminant , the NLC, built in both embedded and dissimilarity spaces. The embedding is derived from D ( R ,R), but additional objects T\R, if available, are projected there and used for constructing the classifiers. To denoise the data and avoid the curse of dimensionality, the dimension of the embedded space was fixed to 0.31R1, so the dimensions corresponding to insignificant (small in magnitude) eigenvalues are neglected. Also the principal coniponerit analysis (PCA) [Fukunaga, 1990; Duda et al., 20011 was applied in the dissimilarity space D ( . , R ) to reduce the dimension to 0.3jR1. In both cases, although the dimensions are reduced, the spaces are still defined by all representation objects. 9.4.3
Results a n d conclusions
Adding a constant t o the dissimilarities or applying a concave transformation preserves their order, hence it does not influence the behavior of the k-NN rulc. However, during clipping (where all negative eigenvalucs are neglected in the embedding process), the recomputed Euclidean distances non-monotonically differ from the original ones, hence the k-NN rule will behave differently. Also both embedded and dissimilarity spaces change, so a linear classifier will change as well6. In our experiments, we study the influence of such corrections on the given measures for various representation sets R. For this purpose, a proper K and a proper T guarantecing the Euclidean behavior are determined. Two concave transformations are also additionally considered: the square root (which makes the dissimilar"Adding a constant is not worth doing in a dissimilarity space, since a constant shift is then applied to all D,, , but the self-dissimilarity stays the same, that is D,, = 0. Because of that, the classifier performance is expected t o stay the same or worsen somewhat. On the other hand, if we apply the shift s to all dissimilarities, the constructed classifiers should be the same, since all vectors D ( . ,R ) are shifted by the same vector s l .
433
Classafication
NLC; ps.-Euclid. space; T=R; dim=O.3IRI
"
10
20 30 (RJper class
40
NLC, PCA-dissim space T=R, dim=O 31R/
__
"
50
NLC; ps.-Euclid. space; RcT; ITI=100; dim=O.3IRJ
20 30 JR/per class
10
40
50
NLC; PCA-dissim. space; R cT: (TI=100; dim=O.3IRI 8-
7-6
8
~5:c 0 tl4-
U
F3aJ
2 2~ I
-
0
30 ' IR1 per class
20
40
50
SRQC; ps.-Euclid. space: R c T ; ITI=100; dirn=O.3IRI
OL
20 30 IRI per class
40
50
40
50
SRQC, PCA-dmm space, RcT ITI=100 dlm=O 31RI
O
-2
10
20 30 /RI per class
- 0
L
-
10
20
30 IRI per class
40
1 50
Figure 9.23 Pen-angle dissimilarity data. Classification error of the NLC (top and middle rows) and the SQRC (bottom row) as a function of the number of representation objects. The error is estimated over 50 runs. The standard deviation of the mean error reaches on average 0.3%.
T h e dissimilarzty representatzon f o r p a t t e r n recognztion
-
NLC; ps.-Euciid. space: T=R: dim=O.3IRl
NLC: PCA-dissim. soace: T=R: dim=O.3IRI
1 10
20
30
IRI per class
40
30
20
40
50
IRI per class
NLC: ps.-Euclid. space: R d ; ITI=100; dim=O.3IRI
NLC; PCA-dissim. space: RT ITI=100; dlm-0 31RI
~
~~~
-1
I~
10
20 30 JRIper class
40
SRQC: ps.-Euclid. space; R cT; IT/=100; dim=O.3IRI
O1
10
Figure 9.24
30 /RI per class
20
40
ol-
o;
-
10
50
-
A
L A - --A
30 IRI per class
20
40
50
SRQC; PCA-dissim. space; RcT; IT/=100; dim=0.3JRI
I O1
1'0
20
30
40
50
JRIper class
Pen-dzst dissimilarity data. Classification error of the NLC (top and middle
rows) arid the SQRC (bottom row) as a function of the number of representation objects.
The error is estimated over 50 runs. The standard deviation of the mean error reaches on average 0.3%.
Classzfication
NLC; ps.-Euclid. space; T=R: d1m=0.3(RI
clip 5-NN
NLC; PCA-dissim. space; T=R; dim=O.3IR/
I 30 IRI per class
10
20
40
435
50
NLC; ps.-Euclid. space: R cT; ITI=100; dirn=O.B(R(
'1
-.- 1-NN ,
cl!DS-"]
10
cljp 5-NN
10
I
! 50
NLC; PCA-dissim space, R cT; lTI=100; dirn=0.3lRl
L
SRQC; ps.-Euclid. space; R cT; ITI=100; dim=O.3IR
~
20 30 IRI per class
10
/RI per class
, 40
4- 5-NN
A
-
20 30 IRI per class
~
40
50
SRQC, PCA-dissim space, R c T /TI=100 d i m 0 31RI
L c l i p
5%"
1
, 20 30 /RI per class
I -L
20
30
/R(per class
40
50
10
40
50
Figure 9.25 Zongker dissimilarity data. Classification error of the NLC (top and middle rows) and the SQRC (bottom row) as a function of the number of representation objects. The error is estimated over 50 runs. The standard deviation of the mean error reaches on average 0.3%.
T h e dassimilarity representation for pattern recognitaon
436
NLC, ps.-Euclid space: T=R; dim=O.3IRI 14;
CllP& I -
20
40
60
IRI per class
80
100
NLC: ps.-Eucltd. space; R 6;ITI=100; dim=O.3IRI 14r
NLC; PCA-disstm. space; R cT; /T/=100; dim=0.3/RI
4 orig
add K add 27 S W
sigm cllp I-NN
40
60
80
100
IRI per class
Figure 9.26 Polydisth dissimilarity data. Classification error of thc N L C as a function of the number of representation objects. The error is estimated over 50 runs. The standard deviation of the mean error reaches on average 0.3%.
ity measures closer to Euclidean, yet still non-Euclidean) and the sigmoid with thc slope s being the average dissimilarity between the representation objects. The measures are non-Euclidean, but less than the ones originally given as judged by the magnitudes of negative eigenvalues in the linear cmbeddings. The results of our experiments compare the averaged performance of thc NLC and the 1-NN rule and the best k-NN rule (if k > 1). They are presented in Figs. 9.23 9.27. The standard deviations (for all the data) reach on average 0.3% and maximally 0.8 - 1.4% for very small R. Additionally, also the performance of the SRQC is shown in Figs. 9.23 9.25. This is a strongly regularized quadratic classifier, defined in Sec. 4.4.2. We used the regularization of 0.2. We present these results to indicate that -
Classification
NLC; ps.-Euclid. space; T=R; dim=O.3IRI
(R( per class
80
437
NLC; PCA-dissim. space; T=R; dim=O.3IRI
100
NLC; ps.-Euclid. space; RcT; ITI=100; dim=O.3IRI
NLC; PCA-dissim. space; R c T ; ITI=100; dim=O 31Rl
IR1 per class
Figure 9.27 Polydistm dissimilarity data. Classification error of the NLC as a function of the number of representation objects. The error is estimated over 50 runs. The standard deviation of the mean error reaches on average 0.3%.
a more complex classifier can reach a better accuracy than the linear one. The legends refer to the following transformations: orig - the original dissirriilarities; no transformation is applied. add K / add 27 - a constant value is added to the off-diagonal dissiiriilarities; D ( R ,R) becomes Euclidean. sqrt/sigm - a square root or a sigmoidal transformation of the dissimilarities; D ( R ,R ) becomes ‘more‘ Euclidean. clip - only positive eigenvalues from the linear embedding are used to derive the Euclidean distance representation. The k-NN rules are directly applied to the test dissimilarity representation D(Tt,,R) to compute the classification error. The ‘clip k-NN’ rules are applied to Euclidean distances derived from the ’clipped’ version of a linear
438
T h e dissimilarity representation for p a t t e r n recognition
embedding of D . This is obtained by taking only the positive eigetivalues and neglecting tlie negative eigenvalues.
Conclusions. The results of our experiments are presented Figs. 9.23 9.27. The following conclusions can be made from their analysis: 1. The correction of D based on adding 27 to all square dissimilarities different than the self-dissimilarities yields worse results than by adding n to the dissimilarities, while the NLC is trained in the corresponding embedded spaces. The former results are missing on some plots since they are worse than the chosen scale. 2. The NLC and the SRQC in tlie (corrected or not) dissimilarity spaces perform similarly or better than in the pseudo-Euclidean spaces. It can be observed by comparing right and left plots in all figures. 3. For large T and small R c T , the NLC and the SRQC in both the ernbedded and dissimilarity spaces (original or transformed by the square root or sigmoidal transformation) significantly outperforms the k-NN and the clipped k-NN rules. This can be observed in the bottom rows in all figures. For T = R , this phenomenon is much less pronounced; the k-NN might even become somewhat better the alternative classifiers, as seen for thc Pen-mgle data in Fig. 9.23, bottom row. 4. Concave trailsformations (here the square root arid the sigmoid function) have minor effect with respect to the original dissimilarities, when the NLC or tlie SR.QC are built in dissimilarity spaces. On the contrary, these classifiers deteriorate their performance while they are constructed on the .clipped' Euclidean distance spaces. 5 . Concave t'ransformations of the dissimilarities seem beneficial for the NLC and the SRQC in the corresponding pseudo-Euclidean spaces. These classifiers may perform better in such spaces than in embedded spaces derived from the original dissimilarities or in the Euclidean spaces obtained from the embedding of otherwise corrected dissimilarities. Interestingly, the results of the NLC and the SRQC in the original dissirnilarity spaces are comparable or even better. In general, the square root transformation seems to work well.
If for small representation sets the k-NN is far from optimal, linear (quadratic) classifiers built in both embedded or dissimilarity spaces can significantly outperform the k-NN rule. Concave transformations of dissimilarities are somewhat beneficial for the classifiers built in embedded spaces, however, they may have no essential effect in dissimilarity spaces (as ,judged from right plots in all figures). None of the transformations
439
Classification
considered here allows the NLC and the SQRC for reaching a considerably better performance than reached in original dissimilarity spaces. Thereby, we conclude that the potential advantages of the imposed Euclidean behavior are doubtful, that is they cannot be always guaranteed. It is more important that the measure itself' describes compact classes than its strict Euclidean or metric properties. This can be influenced by concave transformations which aim at diminishing the relative effect of large dissimilarities and not by making them really Euclidean7.
9.5
A few remarks on a simulated missing value problem
We think that dissimilarity representations are suitable for haiidlirig missing value problems. In order to study their applicability for that purpose, a missing value problem has been simulated for the recognition of the N E T digits 3 and 8 [Wilson and Garris: 19921. Here, images resampled to a 16x16 raster are studied. To analyze the performance of classifiers as a function ofthe number of missing values, the images of 3 and 8 have been raridonily corrupted. The level of corruption (degradation) is governed by a probahility P that a particular image pixel is unknown. Four different degradation levels are used in our experiments, i.e. P = {0.0,0.2,0.4,O.G}; see Fig. 9.28. Because the images are binary, the niissirig values can be just assigned t o the background pixels. This is in agreement with one of' the approxhes to
....
Y .
No degradation; P
=0
P = 0.2
P = 0.6
Figure 9.28 Simulation of a missing value problem by degradation. Degradation of 16 x 16 binary images of digits 3 and 8. The level of degradation is governed by the probability P that an individual pixel is set to background.
7Note that the beneficial effect of a nonlinear transformation of dissiniilarities for a random prototype selection and the NQC trained in transformed dissimilarity spaces has already been observed in Fig. 9.17.
440
T h e dissimilarity representation for pattern recognition
the missing value problem, where the unknown value becomes either the average or the most common value among all other present values. The usual way of computing dissimilarities on the binary data is to construct a similarity measure first and then to transform it to the corresponding distance. For the binary objects i and j the similarity measures are often based on the variables a , b, c and d reflecting the number of elementary matches between the objects, as explained in Sec. 5.1. Three dissimilarity measures were chosen for the analysis: Jaccard, d,j = U+b+C (Eu-
&
-4
(non-Euclidean metric) and clidean); simple matching, d i j = 1 ud-be Yule, d,ij = 1 - ad+bc’ (non-metric); see also Table 5.2. The Jaccard measure is of interest, since it is the overlap ratio excluding all non-occurrences, and, thereby, disregarding the information on matches between the background pixels. On the contrary, the simple matching measure describes the proportion of the matches with respect to the total number of pixels. Hence, it counts the matches between the background pixels, where some of them are in fact the unknown value. The Yule dissimilarity is a cross-product ratio. Our aim is to compare the behavior of the classification methods on these dissimilarities. For each level of degradation, complete distance representations were computed. We assume that the training and the test sets are dcgraded in a similar way. A training set of a fixed size of 100 samples per class was randomly chosen. All the classifiers are tested on an independent test set of 500 samples per class. The testing procedure is repeated 20 times and the results are averaged. Both the training and testing sets have now the fixed sizes and the varying quantity is the level of image degradation. The Fisher linear discriminant (FLD), Sec. 4.5, is trained in embedded spaces: the Euclidean space created by the restriction of the complete pseudo-Euclidean embedding, pseudo-Euclidean embedded space and the corrected Euclidean space (Sec. 3.5.2). All the spaces are retrieved with a large dimension corresponding to the 99.9% of the preserved variance; see Sec. 3.5.4. In the dissimilarity space approach, the following classifiers were used: the RNLC with the regularization of X = 0.01 and both sparse and non-sparse linear programming classifiers (LPC) built on the entire dissimilarity representations D ( T ,T ) , Eqs. (4.22) and (4.23), respectively. The sparse LPC selects, in fact, its own representation set. The NLC was also built on the representation D ( T ,R ) with R consisting of 25% randomly chosen objects out of T .
Classzjication
441
Jaccard distance
O
L
0
d
0.1
0.2 0.3 0.4 0.5 Probability of pixel degradation
Jaccard distance
0.6
J- 0'
Simple matching coefficient --
0.1 0.2, 0.3 0.4 05 Probability of pixel degradation
0.6
Simple matching Coefficient
~
u 0.1 0.2 0:3 0.4 05 0:s -
0.1
012 0:3 014 0:5 Probability of pixel degradation
0!6
0'
Probability of pixel degradation
Yule distance
a
Yule distance
v----
0.02'
J '
O
h
-
0:2 0:3 0:4 015 Probability of pixel degradation
Embedded spaces
0.6
0'
0.1 0.2 0.3 0.4 0.5 Probability of pixel degradation
016
Dissimilarity spaces
Figure 9.29 Comparison of classification approaches in embedded spaces (left) and in dissimilarity spaces (right) on three different dissimilarity representations: Jaccard (top), simple matching coefficient (middle) and Yule (bottom). The standard deviations of the averaged results are less than 0.2% for the degradation level P 5 0.2 and less than 0.4% for the larger P .
442
T h e disszmzlarity representation for p a t t e r n recognition
Fig. 9.29 presents the generalization error rate as a function of the increasing data degradation for tlie Jaccard, sirnplc matching and Yule measures arid three approaches: the l-NN rule, the embedding approach and the dissimilarity space approach. The following conclusions can be drawn: 1. The pcrforniance of all considered decision rules deteriorate with the increasing corriiption (missing information) level. Still, the best decision rilles rmch the error of 80/0-10% for P = 0.6, while the 1-NN rule reaches the error of = 18% for the same degradation lcvcl. 2. Most of tlie linear classifiers, both in the cmbcdded and dissimilarity spaces, outperform the 1-NN rule. They are also more robust against tlie missing values. Comparing all results, the 1-NN rule deteriorates tlie most. 3 . The NLC iii a dissimilarity space defined by R based on 25% randomly chosen training examples often yields worsc results than the other classifiers (right column of Fig. 9.29). 4. On avcragc, the Jaccard distance allows for a better separability of classes than the Yule and simple matching distances. Two methods give identical errors: the R N L C arid tlie LPC (both with R = T ) in dissimilarity spaces. They also achieve the smallest overall errors, which for the non-degraded images equals 1.7%.
As a reference, we will report tlie best results for other, more sophisticated representations based on the Euclidean distance between the Gaussiansinoothed 128 x 128 images and the modified-Hausdorff between the digit contours. For the training set of 100 objects per class, the best linear classifiers in the embedded and dissimilarity spaces reach = 4% for the Euclidean representation and 6% for the modified-Hausdorff representation, while the 1-NN error is = 6% for both of them; see also [Pckalska et al., 2002bl. It) is intcrcsting that a simple distance measure (like Jaccard), operating on binary images of digits, outperforms the modified-Hausdorff dissimilarity, conipiited on the contours. A possible explanation is that the Eiiclitlean and modified-Hausdorff dissimilarities are computed on the original 128 x 128 images, while in tlie first case, the images were rescaled to a lower raster and by this, the digits became aligned. The binary dissimilarity incasurcs are also considerably robust against the data degradation. The FLD in an ernheddcd space and the R N L C and the L P C in dissimilarity spaces applied to the degraded images at the level of P = 0.2 still perform comparably to tlie best results on tlie Euclidean or modified-Hausdorff dis-
Classzfication
443
tances. This still remains true for the degradation level of P = 0.4 and the Jaccard distance representation. In summary, we conclude that the presented binary dissimilarity measures. especially the Jaccard distance, are robust against missing (corrupted) information, when the classifiers are built in the embedded or dissimilarity spaces. Among the classifiers considered, the 1-NN rule shows the highest sensitivity to data degradation, which is to be expected due to its sensitivity to noisy examples. For imperfect dissimilarity measures, the 1-NN method can be outperformed by more sophisticated classifiers, taking into account a number of representative objects, thus becoming more global in their decisions.
9.6
Existence of zero-error dissimilarity-based classifiers
In the statistical approach to pattern recognition numerical features are used to describe objects as vectors in a vector space. Usiially, such fea.tures are reduced descriptions of objects. Some (significant) information is lost and, as a consequence, essentially different objects may be represented as the same vectors in the feature space. If this occurs for object,s of different classes, the classes overlap. There is no way of distinguishing such objects in the feature space and, thereby, any recognition scheme based on such a feature representation has a non-zero classification error. As a result, an error free recognition system is even asymptotically (for infinite training sizes) impossible. To handle this, traditional statistical classifiers estimate the class probability density functions and built the decision rules by minimizing the estimated class overlap. A dissimilarity-based approach to pattern recognition is based on dissimilarities computed between pairs of objects, while making use of their biological variability in the training set (which is observed by the variations in the dissimilarities). If the dissimilarities are directly found on thc raw measurements (which contain all significant information on the objects), the loss of information by the reduction to features, may be avoided. Under some circumstances, an assumption of a zero-error classification (hence no class overlap) holds for dissimilarity representations. We will discuss when such an assumption holds. This section relies on [Duin and Pekalska,
20021. The NN rule is often practiced on the dissimilarity data, usually metric distances. In such a case, the training set T can he used for the selection
444
T h e dissimilarity representation for p a t t e r n recognition
of' prototypes R, but when R is chosen, the remaining objects T'$ are not used for training. Other decision rules constructed either in embedded or dissiniilarity spaces make use of all training objects. They may demand less prototypes than the NN rule for reaching the same performance and, thereby, a smaller computational complexity. As mentioned above, under certain conditions, the class overlap related to the use of feature spaces can be avoided by the use of' dissimilarities. The question arises whether it is possible to build classifiers that exploit this in practice. In other words, whether we can construct classifiers that have asymptotically (for increasing training set sizes) a zero classification error. Note that for non-overlapping classes (and metric dissimilarities), the asymptotic error of the 1-NN rule is zero [Devroye et al., 19961. This may be, however, impractical to reach, since it may deniand an infinite training set to bc stored arid handled. 9.6.1
Asymptotic separability of classes
If the dissimilarity rrieasiire is zero if and only if the corresponding objects are identical, and if real objects can be unanibiguously labeled, then the class overlap may be avoided. This assumption can be exploited by trying to construct zero-error classifiers [Duin and Pqkalska, 20021, which should make use of the property of non-overlapping classes and definc the decision function in the 'gap' between them. In fact, this implies that the I-NN rule will constitute such a zero-error classifier. It may demand, however a very large training set. As classifiers in both dissimilarity and embedded spaces appear to be much more efficient than the 1-NN by requiring a small number of prototypes for the construction, the question arises whether these classifiers may also have an asymptotic zero-error. Assumptions.
The discussion is based on the following assumptions:
Real, physical classes of objects are separable, i.e. there is no physical object that belongs to more than one class. Raw rrieasurcments of objects are such that this separability is maintainecP. The dissimilarity measure d ( z , y ) between the objects z and y constructed on their raw measurements ( c g . scanned images) is such that d ( ~x), = 0 and d ( z ,y) 2 S > 0 if T and y belong to different classes. 'One way t o inspect this is t o let the objects be labeled by humans based on the measurements (e.g. a video screen that displays the object image t o be used for a further processing). The possibility of labeling objects correctly should still exist after scanning arid display.
Classijication
445
This assumption states that there exists a ‘gap’ between the classes of the &size: the objects of different classes have a dissimilarity of at least b. (4) The raw measurements of objects z and y are continuous functions of the parameters 6’ that influence their generation (e.g. lighting coiiditiom, srnall rotations or sensor deviations). Hence, tlic dissimilarity d(x(O)’y(6’))is continuous in 6’. The noise is such that for any two measurements x and x’ of the same physical object d ( z ,d)< 6 holds. (5) The digitalization of the measurerrients arid, thereby, the computer representation of the objects is such that the rriininium class gap is preserved. In general, the role of a dissimilarity measure is to capture the notion of nearness (closeness) between objects; it should be srnall for similar objects, and possibly large for distinct objects. Consider now a set of examples X . A possible formalization of the notion of closeness between them can be achieved by the use of neighborhoods, i.e. a collect,ioriof subsets of X for each element z E X . Neighborhoods provide a general tool for describing relations between the elements of X. Such neighborhoods can be defined by the use of dissimilarities as a special case; see Secs. 2.3 a.nd 2.4 for details. The &-ballneighborhood of z is given as b,(z) = {y E X : d ( x ,y) < E } . The nested neighborhood basis becomes E ( z ) = {b,(z) : E 2 0) and tlie space (X,E ) is pretopologkal; see Theorem 2.13. Elements of each neighborhood show a specific level of similarity and in practical applications only neighborhoods for some chosen, data dependent values of E can be considered. Since, later on, we want to definc classifiers on finite sets, we will restrict ourselves to a local basis. The neighborhood ) tlie set of all y which belong to the &,-ball ccntered at basis of x, & E z ( zis x, i.e. €,, (x)= ~,,I/ (x) for some specified E, > 0. Note that E, may depend on 2 . E, is chosen such that there exists a distinct object d in the sanie class for which d ( z , x ’ ) < E, holds (e.g. E, = 1.0001 . d(z,ar~(z))~ wlierc d ( z , n n ( z ) )is the dissimilarity to the nearest neighbor of‘ x). Consequently. N E P ( X ) is a neighborhood of II: if E E z ( x )2 N and the neighborhood system N ( s )is the collection of all neighborhoods of x. Consider now two classes of objects, denoted as w1 and w2.
Observations. be made:
Based on our assumptions, the following observations can
(1) For a sufficiently small positive E, and any object z in the class w i , there exists a distinct object in the same class such that the dissimilarity
446
T h e dissimilarity representation f o r p a t t e r n recognition
between them is smaller than &,); i = 1,2.
E,,
Yztw, gE3.>o gycw, (y # z A d ( z , y) <
(2) Y Z E W , 3 Y € W t VINEN(2) (Y # 5 A Y E N ) ; i = 1 , 2 . ( 3 ) The neighborhood basis of all z in w1 contains no elements of wz, that is VrEwl Viytwz 'y GEES(z) A x @EEU(y) and vice versa. (4) All z from the class w1 have a neighborhood that contains no elements of the class w2, Yztwl 3 N c ~ ( , ) N n w2 = 0. Equivalently,
htW2 h~(,) N n w1 = 0. This brings us to the existence of a rule that correctly assigns each w1 and each z E w2 to the class w2. The objects outside w1 U w2 will be mainly rejected. The ones sufficiently close (in terms of neighborhoods) either to w1 or w2, will be assigned to these classes. All objects form the classes w1 and w2 will, however, be correctly classified. This is a zero-error classifier (with a rejection option), provided that we only deal with the objects from either w1 or w2. z E w1 to the class
Theorem 9.1 Assume two classes w1 and w2. The folloiuin,g decision rule rosrectly classifies z f o r any x ~ w or l x~w2: 1. I f ~N,N(,)Nn w1 = 8 A N n w2 = 0, then reject x, 2. else if ~ N , N ( ~N) fl w2 = 0, then assign x t o th,e class w1, 3. else if ~ N E N ( ~n)w1 N = 0 , then assign, x to the class w2. Proof. Suppose that z E w1. By observation (l),the following holds: V . P J ~ N ( , ) ~ ~ ~y. P#Jz A y E w1. Hence, N n w1 # 0 and consequently, rule 1 does not apply. However, by observation ( 3 ) , ~ N ~ N ( ,n) w2 N = 0, which means that the rule 2 applies. As a result, z is assigned to the class w1. If .c E w1,rule 1 does not apply, since some of its neighborhoods have just elements in w1. Assume now that z E w2,then as a consequence of rule 3 , as rule 2 does not apply, x is classified as a member of the class w2. 0
Theorem 9.1 just shows that an error-free classifier exists. It does not describe how such a decision rule may be constructed based on a finite set of training examples. Rule 1 above should take care that the objects not belonging to one of the two classes, z G w l U w2, are rejected. There exists z g w l U w2, however, having sufficiently small dissiniilarities to the objects of at least one of the classes will also be classified either as w1 or w2. This does not contradict the theorem as it considers the elements z E w1 U w2 only. In other words, rule 1 rejects all objects that belong neither to the closure of w1 nor to the closure of w2. Rule 2 assigns z to w1 if z does not belong to the closure of w2. Rule 3 assigns z to w2 if x does not belong to
Classzfication
447
Figure 9.30 Digits misclassified by the NN rule (top row), their nearest neighbor of the ‘3’-class (middle row) and their nearest neighbor of the ‘8’-class (bot,tom row) for the Hamming-NIST-38 data.
the closure of w1. Rule 4 rejects z if it does not belong to both closures. Furthermore, for each subset A of q ,the closure d ( A )of A will be classified as w1 (objects which belong to the border of w1 have neighborhoods with no elements of w2 and vice versa). So, the classes become closed sets.
Experimental investigation. Consider the binary images of the digits 3 and 8 from the NIST database [Wilson and Garris, 19921. A Hamming distance (Sec. 5 . 3 ) representation Humming-NIST-98 is derived between the 32 x 32 re-sampled images. The first question that arises is whether this set fulfills the assumptions formulated above. All the nearest neighbor relations are checked for this purpose. In Fig. 9.30, the objects misclassified by the NN rule together with their nearest neighbors in both classes are presented. For some objects, it may be concluded that they are badly segmented as they contain isolated dots. As a consequence, they do not fulfill assumption 4. Object representations based on segmentation errors are not expected to have close neighbors. In a practical situation, they may be removed from the training set. New objects, having such defects, are, thereby, expected to be misclassified. For practical problems, it might be, therefore, difficult to construct a zero-error classifier. Figure 9.32 shows the distance values to the nearest neighbors in the Hamming distance representation for a part of the data. Note that the nearest neighbor belongs to a different class in a very fcw cases. This causes a classification error. The total leave-one-out 1-NN error estimate is 1.85%. The figure, however, suggests that except for a few cases, a gap betwcen the classes exists. In the following experiments, we will try to construct classifiers in this area. We will use a fixed training set of 500 objects per class. The remaining 500 objects per class are used for testing. The following classifiers are considered: tlie Fisher linear discriminant (FLD) in a dissimilarity space and the FLD in an embedded space. They are both defined on systematically growing representation sets, chosen from tlie training set T . Starting from a few objects in the set R , we proceed iteratively. In each step, the FLDs are trained and the training object that
448
T h e dissimilarity representation f o r p a t t e r n recognrtron
Dissimilarity space approach - test error - - I-"error
test error
- - 1-NN error
0.08
'. , 0.06
__ .
I
4
0 02
;
0 0
100 I
Ernbcdding approach
~
~
training error test error I-NN error
0.06 011 0 04
0 05
oOplOO
200
300
400
500
00
IRI
Hamming-NIST-38
Polydistm
Figure 9.31 Pcrforinance of the FLD in a dissimilarity space (top row) and in an embedded space (bottom row) as a function of the cardinality of R per class for the Hamming representation of the NIST-38 digits (left) and the Polydzstm data (right). R is a systematically growing subset of T . Both training and test errors are shown. The 1-NN error on the representations set is given as a reference. Note the scale differences.
is the closest to the current decision boundary is added to the R. For the construction of an embedded space, the eigenvectors corresponding to the largest eigenvalues, jointly explaining 70% of the generalized variance are used. The classifiers are trained on D ( T ,R ) . Classification errors computed on the training and test sets are considered as functions of the cardinality of R. These error curves can be observed in Fig. 9.31. The test errors of the 1-NN rule applied to D(Tt,, R ) are also plotted there. This figure shows that a zero-error classifier can be constructed for the training set, but it appears difficult to obtain such a result also for the test set. Since the assumption 4 is not fulfilled for these
Classification
449
300
100
150 200 250 300 NN distance to 3
Figure 9.32 Scatter-plot of the nearest neighbor distance values for both classes for the Hammzng-NIST-38 data.
data (some nearest neighbors belong to different classes), this might be an indication of its importance. To judge it fidy, however, other cxperimerits are needed. Note the instability of the results in case of the ernbedding procedure (bottom plots) for srnall representation sets. Additionally, we consider the two-class Polydistm polygon data of randomly generated convex quadrilaterals and irregular heptagons; see also Sec. 9.2.2. The care is taken that heptagons do not degenerate to quadrilaterals. In our experiments with growing representations sets, 500 objects per class are used for training and the remaining 1500 objects per class are used for testing. The results are shown in the right plots of Fig. 9.31. It can be observed that a zero-error classifier can be constructed for the training objects in a dissimilarity space. Although the generalization error oscillates in the neighborhood of zero, a perfect discrimination cannot be reached for these test data. It should be, therefore, concluded that an error-free classification on an independent test set is hard to obtained.
Further considerations. An attempt is now made to construct a zeroerror classifier more carefully. The convex polygon data set is considered. It represents two classes of convex polygons: pentagons (class w1,based on t = 5 points) and heptagons (class w2,t = 7). For the generation of a polygon. t vertices (points) are first regularly positioned on the unit circle, i.e. the distances between two consecutive vertices are equal. Next; two-dimensional noise is added to each vertex to perturb the polygons; see Appendix E.1. A training set of 2 . 100 polygons and a test set of 2 . 1000 polygons are generated. Two dissimilarity measures are studied. the Hausdorff and the modified Hausdorff distances. as defined in Sec. 5.5.
450
The dissimilarity representation f o r pattern recognition
Hausdom distance representation
Modified-Hausdotfl distance representation
Figure 9.33 Error curves for the 1-NN rule and the exponential classifier for the discrimination on the Convex polygon data represented by the Hausdorff (left) or modifiedHausdorff distanccs (right).
The distance of a polygon to itself is zero and it is positive for any pair of non-identical polygons. Distances vary in a continuous way with the changes in the vertex positions. The classifier defined in Theorem 9.1 can be described as a continuous function of dissimilarities computed to a finite set of objects. Any object, can be correctly classified by an appropriate rule based on nearest neighbors. The function f ( d ( z ,2);). = CsEwl exp ( - d ( z , .)/.) CZEW, exp ( - d ( z , x)/.) is a continuous decision function assigning an object I(: to the class w1 iff f ( d ( z ,2); C T ) > 0 and to the class w2 iff f ( d ( z , x);0) < 0. It performs the same classification. It classifies any object correctly if o is sufficiently small, i.e. if 0
451
Classification Table 9.6
Estimated values of E
~ E~~~
,
and 6.
0.360
0.207 0.170 0.225
Random subsets of m polygons per class (2 5 m 5 100) are dra.wn from the training set, resulting in a 2 r n x 2 m dissimilarity representation. A sigmoidal classifier is then trained. Test polygons are classified oil the basis of their 2 m dissimilarities to tlie training objects. The experinient is repeated 20 times (different random subsets of the same training set). Thc averaged errors are presented in Fig. 9.33 for the Hausdorff and modifiedHausdorff distances for some chosen values of the scaling parameter 0 . Errors are compared with those of thc 1-NN classifier. It shows that the linear classifier may perform better, in agreement with our earlier findings [Pckalska and Duin, 2002a,c; Pqkalska et al., 2002bI and that zero-error classifiers 011 the tests sets are found for small r e p r e s e n t h o n sets.
Remarks.
The overlap of pattern classes rriay be avoided by a. dissimilarity representation constructed from the data if the assumptions listed in Sec. 9.6.1 are fulfilled. We showed that linear classifiers in dissirnilarity spaces can outperform the nearest neighbor rulc: even for large training may he expe sets for which good performance of the "-rule the classes are separable, we did riot always succeed in constructing a zcroerror solution for the test set. This result certainly depends on tlie distance measure, the cardinality of R in relation to the chosen classifier. However; at the moment a suitable gap is constructed, a zero-error classification is possible. The challenge we see for the fiiture is to construct more locally sensitive classifiers t,l-ia,tneed just a fraction of the training examples for the representation set. Further research is needed to firid out how distance measures may be constructed siich that the potentially zero-error result can be obtained in practice.
9.7
Final discussion
This chapter discusses classification aspects of dissimilarity representations. Dissimilarity measures with different properties have been analyzccl for
452
T h e dissimilarity representatzon f o r p a t t e r n recognitzon
this purpose: Euclidean, non-Euclidean metric and non-metric measures. Our approaches to dissimilarity representations can handle non-metric measures as well. In our experiments, we demonstrated that simple linear or quadratic classifiers constructed either in dissimilarity or embedded spaces can significantly outperform the I%-NN rule for small representation sets, irrespectively of the properties of the dissimilarity measure. We argue that in fact it is more important that the measure itself is discriminative for the problem than its metric or Euclidean properties. Various prototype selection procedures have been studied for both approaches, indicating that systematic procedures, choosing prototypes in a supervised way by making use of the label information, are beneficial, especially for small numbers of prototypes. Selection based on the KCentres procedure can be considered as a good approach, since it is fast and works well on average. Also the prototypes (support objects) chosen by the sparse LP formulation are candidates for a good representation set in dissimilarity spaces. To gain a control over the number of prototypes selected, the KCentres-LP method can be considered, as it combines the advantages of both procedures. In embedded spaces, the prototypes chosen as the ones which yield the largest approximation error may be an alternative selection to the KCentres. Additionally, we observed that for representation sets consisting of 20% of the training objects, a random selection is advantageous in both dissimilarity and embedded spaces. In conclusion, our results encourage to explore meaningful dissimilarity information in new, advantageous ways, an example of which are our proposals. Under certain constraints on unambiguous labeling of objects arid properties of the dissimilarity measure, the 1-NNrule will allow for zeroerror recognition. This, however, might require very large training sets, and hence is infeasible in practice. The study of proper dissimilarity measures and suitable domain-based classifiers (which are like the 1-NN rule, but less local in their decisioizs) is open for further research.
Chapter 10
Combining
Whut is a comwLittee2 A group of the unwilling,picked from, the usnfit, to do the unnecessary. RICHARD HARKNESS IN
THE
NEW YORK TIMES, 1960
Well-performing pattern recognition system may be designed by coinbining information from different sources or individual learning strategies. The basic idea (and assumption) is that an assenibly of experts (say classifiers, approaches) tends to make a better decision than one single expert. This can be expected if the experts are different, possibly independent in their opinions, i.e. if their decisions are based on different principles. In classification, it means that the sets of misclassified examples should differ among the classifiers such that if an individual classifier makes a mistake, the others are able to correct it. Therefore, instead of relying on a single strategy, all suitable strategies are used for the derivation of the final consensus. Combining is usually done to increase the efficiency and/or accuracy of classification systems. The former can be done by designing hierarchical combination rules, where simple and computationally inexpensive classifiers are used first for the recognition of non-difficult objects, and more advanced classifiers are applied to more specific cases later on. To increase the performance (hence also the robustness), care should be taken that the individual (base) classifiers differ. This can be achieved, for instance, by using various feature-based representations or different training sets, e.g. sampled versions of the original one [Breiman, 1996a; Ho, 1998; Skurichina, 20011. An important study on classifier diversity measures is that of Kuncheva; see e.g. [Kuncheva and Whitaker, 2003, 2002; Kuncheva et al., 20021. Two basic classifier combination scenarios can be distinguished. In thc first case, the individual classifiers are designed on the same representation or its various subsets. The classifier outputs can be interpreted as e.g. by fuzzy membership values, evidence values or posterior probabilities, or transformed as such. In the probabilistic framework, classifiers can be
453
454
T h e dissimilarity representation for p a t t e r n recognition
assiinied to estimate the same posterior probability. Practically, it means that classifier ensembles are constructed in the same feature space (having the same type of features) or based on the same dissimilarity description. In the second scenario, tl-ic classifiers are built on different representations derived from physically different types of measurements (sensors),e.g. audioand video-related representations of a bionietrical identification, or from a focus on different aspects of raw measurements, e.g. representations defined for shape arid color characteristics in images. Since classifiers operate in different rrieasurement spaces, the estimated posterior probabilities do not refer to the same principal value. A common theoretical framework for classifier combination is discussed in [Kittlcr et nl., 19981, where fixed rules such as the sum rule, product rule, niin rule, niax rule, median rule and majority voting are derived for general cases. Fixed combining rules operate on the classifier outputs and use some strategy (like a sum) for the final decision. Alternatively, the classifier outputs can be considered as new features on which a final output classifier is trained [Duin, 20021. Many combination schemes have been proposed in feature spaces, and it has been experiineritally demonstrated that sorric of them consistently outperform the single best classifier, see e.g. [Kuncheva, 2004; NICSOO, 2000; MCS02. 20021. So we do not aim to develop new approaches, but, rather to focus on the specific type of representations that we are dealing with. A learning problem can be then approached by using a set of classifiers on a chosen dissimilarity representation. In practice, however, it is advantageous to use various dissimilarity measures focusing on different data aspccts or even diEerent measurement data, especially, if the information providcd is complementary. This leads to various dissimilarity representations. As discussed above, we cannot only combine various classifiers designed on a single represent,ation, but also classifiers built on different representations. One may go evcn a step further and combine not the classifiers, but the representations themselves, which we proposed in [Pqkalska arid Duin, 2001b; Pckalska et nl., 2004al. Discriminative properties of different' representations can be enhanced by a proper fusion. In this chapter, we study both combined dissimilarity representations on 'We want to emphasize that by different representations, we mean not only mathematically difEerent formulations, but more importantly, representations derived by using different principles or different Characteristics of the given phenomenon, or simply different measurements. For instance, in the support vector machine research, one may consider kernels with various nonlinearity aspects.
Combanang
455
which a single classifier is trained, as well as fixed and trained combiners applied to the outputs of the base classifiers, trained on single representations. An experimental asscssment is performed for one-class arid two-class classification problems, as also published in [Pqkalska and Duin, 2001b; Pqkalska et al., 2004al. Our results show that both combining approaches significantly improve classification performance compared to the results achieved by the best single classifiers. Even in terms of computational cost, the use of combined representations niight be more advantageous. In the process of combining classifiers, variability between base classifiers is essential for constructing a robust ensemble. Although various measures arid many combining rules have already been suggested, the problem of' designing optimal combiners is still heavily studied. The diversity between the base classifiers is therefore important. We propose to analyze the conceptual dissimilarity representation describing the pairwise diversity between the classifiers judged by e.g. their disagreements. As a visualization tool for analyzing the differences between base classifiers and as an argument for the selection of good combining rules we suggest using the classifier projection space obtained as a spatial configuration of the diversities. This idea relies on [Pekalska et al., 2002al.
10.1
Combining for one-class classification
As studied in Chapter 8, a one-class classification (OCC) problem is characterized by the presence of a target class. Additionally, outlier (nontarget) examples may be provided, although they are known to be nonrepresentative or to have unknown priors2. Since the outlier class is illdefined. As in complex problems it is hard to find an effective set of features for the discrimination between targets and outliers, it seems appropriate to build a representation on the raw data. The dissimilarity representation, describing objects by their dissimilarities to the target examples, may be useful for such problems, since it naturally protects the target class against unseen novel examples. Optimal representations and dissimilarity mcasures cannot be found if one class is provided and the other is niissing or badly sampled. On the other hand, when analyzing a particular phenomenon, one can capture the model knowledge by various dissimilarity representations describing different problem characteristics. In this way 'Remember that standard two-class classifiers should be preferred if the non-target class is wcll represented.
456
The dissimilarity representation f o r pattern recognztion
each additional representation may incorporate useful information and a problem is tackled from a wider perspective. Moreover, it seems logical to follow if no convincing arguments exist to prefer one dissimilarity measure over another. Combining one-class classifiers becomes, thereby, a natural technique needed for solving ill-defined (or unbalanced) detection problems. Although such problems often appear in practice, representative standard data sets do not yet exist. Our procedures here are not intended for general multi-class problems for which other, more suitable, techniques exist. Our methodology is applicable to difficult problems where the target examples are provided with or without additional outlier examples. For that reason, the effectiveness of the proposed procedures is illustrated with just a single, yet complex, application. The aim is to detect diseased mucosa in an oral cavity from autofluorescence spectra.
10.1.1
C o m b i n i n g strategies
As before, dissimilarity representations D ( T ,R ) will be interpreted in three lcarning frameworks: the pretopological approach, where the dissimilarity values directly denote the neighborhoods, the embedding approach, which builds on an embedded pseudo-Euclidean configuration and the dissimilari t y space approach, where the features are defined by the dissimilarities to particular representative objects. See Secs. 4.3 4.4 for more details. Here, we assume that the representation set R consists of the target objects only. The following OCCs are considered as examples of the three approaches: the nearest neighbor data description (NNDD), defined in Sec. 8.2.1, the generalized mean-class data description (GMDD), introduced in Sec. 8.2.2 and the linear programming dissimilarity data description (LPDD), described in Sec. 8.2.3. To study the behavior of one-class classifiers, we will make use of a ROC curve, which is a function of the true positive (target acceptance) ratio versus the false positive (outlier acceptance) ratio. To compare the performance of various classifiers, when niisclassification costs are not exactly known, we will use the AUC measure; see Sec. 8.1 for details. In our experiments, we will present the AUC performance as AUC. 100. Two approaches are compared within t,his medical application. The first one focuses on combining dissimilarity representations into a single one, while the second approach considers a combiner operating on the outputs of the OCCs. ~
Combining
457
Combined representations. Given various feature spaces, or in other words, representations. one usually combines classifier outputs of a number of classifiers trained on different representations. Learning from distinct dissimilarity representations can be realized by fusing them into a new representation, and then training a single one-class classifier. As a result, one hopes to obtain a more powerful representation, which will enable a better discrimination. Suppose that K representations D g ) ( T ,R ) , T = 1 , 2 , . . . K are given, a11 of them based on the same representation set R. We assume that dissimilarity measures are bounded in similar ranges, since they are scaled appropriately by non-decreasing functions f T (such as linear: logarithmic or sigmoidal functions), i.e. D?'(T, R ) = f T ( D ( ' ) ( TR , ) ) . This step is important, since only then the dissimilarity values can be related to each other; otherwise, we would instead of comparing the direct values, we would need to compare the corresponding percentiles. A series of dissimilarity representations can be combined into DconLbin the following ways: ~
(10.1)
The nonnegative weights aT are additionally used to emphasize the importance of the measures considered in different ways. Ideally, they shonld he learned for the problem a t hand. If an OCC is built by using both target and outlier examples, the importance of each representation can be weighted by its overall performance (the AUC measure) on the training data (or the validation data, if available). Knowing the AUC measures n i . i = 1 . 2 , . . . , n in a training (or validation) stage, dissimilarity representations can be weighted by their normalized versions az = a i / Cr=;=, a,;. If the target examples are only provided for training or if there is no a priori knowledge, all the weights a , are assumed to be equal. Dissimilarity representations are combined into one representation by using fixed rules, usually applied when the outputs of two-class classifiers are combined. The reason behind the use of a combined representation is the fact that such a series of representations can be interpreted as a collec-
458
T h e dasszmalarity representation f o r p a t t e r n recognition
tion of weak classifiers, providing support in favor of a particular object. So, a weak classifier is understood as a dissimilarity D$‘)(.,pi) to a particular object p i . In contrast to probabilities, a small dissimilarity value D c ) ( t j , p i ) is an evidence of’ a good ‘performance’, indicating here that the object t j is similar to the target p i . In general, different dissimilarity measures focus on different aspects of the data. As a result, each of them estimates a proximity D c ) ( z , p i )of an object I): to the target p i in its own way. so to say, by using partial knowledge. Combining such estimates by fixed rules is reconimerided [Kittler et ul., 19981. So, D,,, yields a11average proximity estimator. When, dissimilarity measures make independent estimations of object to object proximities (e.g. when two measures are built, one using the statistical properties of an object, while the other using its structural properties), the product combiner is of interest. Logically, both D,, and Dprod should integrate the strengths of various representations. Here, Dproclis expressed by a logarithmic transformation of the product of dissimilarities so that very small numbers (hence numerical inaccuracies) can be avoided when close-to-zero dissimilarities are multiplied. The niin operator chooses the minimal dissimilarity value D ( T ) ( x , p i )7, = 1,. . . . 6 , hence the maximal evidence for an object I): resembling the target ti. The niax operator works the other way around. Combining classifiers.
One-class classifiers are in practice realized by II: to the target class WT. To decide whether an object belongs to the target class or not, the threshold y on fprox should be determined. A standard way is to supply a fraction rfnof (training) target objects to be rejected by the OCC (a falsc negative ratio). So, y can be set up such that J Z ( f p r o x ( ~>, ~ Y)&(Z) ~) = rfn, where p is some measure; see also Sec. 8.1. One usually combines classifiers based on their posterior probabilities. However, OCCs do not directly estimate the posterior probahilities, since they rely on the information on a target class. Moreover, the soft, (proximity-related) outputs of the OCCs trained on different representations might not be comparable. One possibility is to convert such proximities (e.g. distances to the class boundary) to the estimates of probabilities. This can be achieved by the following heuristic mapping fprox (ZL.;WT) }. where s is a parameter to be fitted based on $(UTlI):) exp{the training set, as proposed in [Tax and Duin, 20011. Note that 1-$(wT/:c) is a probability estimation that x is an outlier. Consequently, standard fixed combiners, such as mean, product and majority voting, can be considered. a proximity function fprox(z, W T ) of an object
,~
Combining
459
Additionally, the raw or transformed proximity outputs can be further used as features for training the final OCC combiner.
10.1.2
Data a n d experimental s e t u p
The data consist of autofluorescence spectra acquired from healthy (target) and diseased (outlier) mucosa in the oral cavity. The measurements were taken by using six different excitation Wavelengths 365, 385, 405, 420, 435 and 450 rim. After preprocessing [de Veld et al., 20031, each spectrum consists of 199 bins. In total, 856 and 132 spectra representing healthy and diseased tissue, respectively, were obtained for each excitation wavclength. This means that one deals with six different measurement data: M I , . . . , h f c , corresporidirig to six excitation wavelengths. The spectra are normalized so that they yield a unit area; see also Appendix E.2. The measurement sets are divided into the training and testing sets in the ratio of 60 : 40 with respect to both target arid outlier class. So, thcre are 594 training (514 target and 80 outlier) examples and 396 testing (342 target and 52 outlier) examples. Two cases are here investigated: combining various dissimilarity rcpresentations derived for the spectra of a single excitation wavelength of 365 nm (experiment I) and combining representations derived for all excitation wavelengths (experiment 11).In both experiments, the combined representations and the combined classifiers are used. The basic difference between these lies in the use of single measurement data or multiple measureinciit data. So, in the experiment I, the derived dissimilarity representations arc different with respect to the measure applied to the data M I , while in the experiment 11, the computed dissimilarity representations are basically different with respect to the data sets M I , . . . , M 6 , so in fact a single measure might be used for their computation. Hence, all combining scenarios (cornbining classifiers on the combining representations, each considered on the same or diBerent measurement data sets) are captured in our cxperirnents. Five dissimilarity representations are used for the normalized spectra in experiment I (the wavelength of 365 nm); see also Sec. 8.3.2, where soinc of these representations were already investigated. The first three dissiinilarity representations are based on the 11 (city block) distances computed and their first and the secbetwecn the smoothed spectra themselves (01) ond order Gaussian-smoothed ( 0 = 3 samples) derivatives (Dfer arid Dyder$ respectively). D s A is~ based ~ on the spherical geodesic distancc, also known as spectral angular mapper [Landgrebe, 20031, & A M ( X . y ) = arccos(xTy).
460
The dissimilarity representation for pattern recognition
Table 10.1 Diseased mucosa in the oral cavity. Experiment I: AUC measure (AUC .loo), averaged over 30 runs, derived either for the OCCs built on the combined dissimilarity representations or for fixed and trained combiners applied t o the OCC outputs. All dissimilarity representations are considered for a single measurement data (the excitation wavelength of 365 nm). SO denotes support objects. Standard deviations of the average AUC values are given in parenthesis.
OCCs trained on a single dissimilarity representation I
I
I
I
I
_
1
I
I
1 1
CLPDD ISoI Di I 80.9 (0.5) I 77.0 (0.6) I 72.3 (0.7) I 2.5 I Dder 86.0 (0.4) 78.4 (0.5) 72.0 (0.7) 2.8 li:LI 86.7 (0.4) 78.1 (0.6) 78.1 (0.7) ; .2 81.8 (0.5) 76.6 (0.6) 68.0 (0.9) DBH 85.5 (0.4) 77.3 (0.5) 75.1 (0.6) 2.1 Ia. Combined dissimilarity representations: OCCs
1 1
C3-NNDD
1
CGMDD
I
I
I
I
I
I
I
I
I
I
C~FAD
I Pol
1 ;J; 1
79.6 (0.5) I 5.5 83.1 (0.5) 5.8 84.2 (0.5) 80.2 (0.5) 80.1 (0.5) 2.5 trained on Dcomb
I
I /so/
C3-NNDD CGMDD CLPDD POI C ~ F ~ D 95.5 (0.2) I 94.6 (0.3) I 93.0 (0.3) I 4.1 I 93.4 (0.3) I 5.1 95.7 (0.2) 94.9 (0.3) 93.6 (0.3) 93.6 (0.4) 7.6 4.6 Dgrod 14.6 87.1 (0.9) 85.6 (0.4) 84.6 (0.4) 84.7 (0.5) 15.7 D,,,, 93.5 (0.3) 90.6 (0.4) 84.7 (0.8) 89.0 (0.6) 10.5 7.1 D,, Ib. Fixed combiners applied to the OCC outputs from D1- D B H p u t Comb. CY-NNDD CGMDD CLPDD LPDD Mean 98.0 (0.2) 94.4 (0.4) 90.7 (0.6) 93.8 (0.3) Prod 98.0 (0.1) 81.3 (0.6) 87.8 (0.5) 91.1 (0.3) Min 93.3 (0.2) 91.0 (0.3) 88.8 (0.4) 92.0 (0.3) Max 89.6 (0.4) 79.0 (0.5) 74.1 (0.6) 81.9 (0.4) Voting 98.3 (0.1) 95.9 (0.2) 95.5 (0.2) 97.0 (0.2) -
Dcornb
D,",
I
-
-
-
-
-
-
-
-
5-means Parzen
-
-
-
-
88.0 (0.4) 90.5 (0.4)
-
91.1 (0.4) 94.5 (0.3)
-
-
Table 10.2 Diseased mucosa in the oral cavity. Experiment 11: AUC measure (AUC .loo), averaged over 30 runs of single one-class classifiers built on dissimilarity rcprcsentations for six measurement data sets (six excitation wavelengths). Only the worst and the best AUC values are shown. 'ALL' refers t o the results on all 6 x 3 (six wavelengths and three measures) dissimilarity representations. The number of support objects in LPDDs varies between 2 and 7.
I OCCs trained on single representations C:~-NNDD 80.9 - 84.8 CGMDD 77.0 - 79.4 62.8 - 72.4 CLPDD 78.3 - 81.7
(0.5) (0.7) (0.8) (0.9)
82.8 77.9 65.5 73.5 -
87.0 (0.5) 81.7 (0.6) 72.8 (0.8) 83.1 (0.7)
for the measurement data hfl-kf6 Df er 83.5 - 88.8 (0.5) 80.9 - 88.8 (0.5) 75.4 - 81.6 (0.6) 75.4 - 81.7 (0.7) 70.7 - 77.5 (0.8) 62.8 - 77.5 (0.8) 77.7 - 83.2 (0.6) 73.5 - 83.2 (0.6)
Combining
461
D B H is based on the Bhattacharyya distance, a divergenc.emea,surebetween two probability distributions as defined in Sec. 5.2.2. This measure is applicable, since the normalized spectra can be considered as unidirnensional histogram-like distributions. As a result, all dissimilarity representations emphasize different aspects of the spectra. In experiment 11, dissimilarity representations are derived for six measurement data: Ml-Me. Only the first three measures D1: Dfe" and Dfder) described above are used. As mentioned in the previous section, three base one-class classifiers are considered: the nexest, neighbor data description, the generalized meanclass data description (GMDD) and the linear programming dissimilarity data description (LPDD). The classifiers will be denoted as CS-NNDD,CGMDD a,nd C ~ D D Additmionally, . since the LPDD is able to incorporate the information on outlier examples, if they are used, the resulting classifier will be denoted as C,"FbD. To describe the experiments more clearly, the following division is introduced: 0 Ia or Ila denotes the experiments with the combined representation Dcalnb for which a single OCC is h i n e d . The dissimilarity representations are first scaled by the largest training value and then combined into Dcornb according to Eq. (10.1). Although the weights were estimat,ed based on the AUC performance on the training set (using outlier objects), they yielded little variability. So, for simplicity, equal weights are assumed (we have also found experimentally that the results for a weighted average are not significantly different than from a usual average). In the experiment 11, for each measure considered, six dissimilarity representations are combined over various measurement data MI, . . .,M 6 and in the end, all 18 dissimilarity representation (three measures and six data sets) are combined, as well. 0 Ib or IIb denotes the experiments with the fixed combiners applied to the OCC outputs. The OCC outputs are first converted to the estimates of' posterior probabilities, as described in Sec. 10.1.1 and then traditional mean, product, min, max and voting rules are used for the final decision. 0 Ic or IIc denotes the experiments with the trained combiners applied to the OCC outputs. Here, we like to proceed with the exact (proximityrelated) OCC outputs. To design a trained combiner, we focus on the LPDDs as the base classifiers. Let us denote, for convenience, the dissimilarity representations as D ( T )7, = 1 , .. . 6. Each LPDD is detcrrriined by a hyperplane H(') in a dissimilarity space D(')(T,R).The distances to
462
The dassamalaraty representation for p a t t e r n recognition
the liyperplarie are realized by weighted linear combinations of the form &)(ti) = , w , i T ) ~ ( ~ ) ( t i- p, p. j )AS a result, one may construct an n x 6 dissimilarity matrix D H = [d,(1)( T ) ,. . . , d(H)(T)] expressing the
c,,~.~+,,
non-normalized signed distances between the n training objects and m base classifiers. Hence, again an OCC can be trained on D H . This means that an OCC becomes a trained combiner now, retrained by using tlie same training set (ideally, an additional validation set should be used). The LPDD can be used again as a combiner, as well as some other feature-based OCCs. (Although the values of D H become negative for the targets and positive for the outliers, they are bounded, so the LPDD can be constructed based on the same principles.) Two other standard data descriptions (OCCs) are used, where a proxiniity of an object to the target class relies either on the information to the chosen k-mean vectors (k-means) or a density estimation by the Parzen kernels [Tax, 20031 (Parzcn), respectively. They interpret the LPDD outputs in a vector space. The experiment itself (I or 11) decides whether single or multiple measurement data are used. 10.1.3
Results and discussion
The following observations can be made from experiment I; see Table 10.1. Both an OCC trained on the combined representations (Ia) and a trained or fixed combiner on the OCC outputs (Ib and Ic) improve the AUC measure of each single OCC trained on the considered dissimilarity representations D1. Dfer,Dfder3 D S A Rand ~ D B H .Concerning the combined representations (Ia), the element-wise average and product combiners perform better than the rriin and max operators. The 3-NNDD seems to give the best results; they are somewhat better than the ones obtained from thc GMDD and the LPDD traincd on Dcomb(T.R). However, in a testing stage, both the 3-NNDD and thc GMDD rely 011 computing dissimilarities to all 514 ob,jects of the representation set R , while the LPDD is based on maximum 16 support objects (see #SO in Tablc 10.1; the support objects are determined during training). Hence, tlie LPDD can be recomniended from tlie computational efficiency point of view. The fixed and trained combiners on the OCC outputs perform well. In fact. the best overall results for the base OCCs considered (the 3-NNDD, tlie GMDD arid thc LPDD) are reached for the fixed voting combiner. However. combiners require more computations; first individual OCCs are
Cornbanang
463
Table 10.3 Diseased rriucosa in the oral cavity. Experiment IIa: AUC (.loo) measure, averaged over 30 runs, derived for one-class classifiers built on the combined dissimilarity representations. The representations are combined over six measurement sets A41 - h1fj of autofluorescence spectra (related to six excitation wavelengths) and a fixed dissimilarity representation. ‘ALL’ refers to the results on all 6 x 3 (six wavelengths and three measures) representations. SO denotes support, objects. Standard deviations of the average AUC values are given in parenthesis.
I
Combined representations: OCCs trained on Dcomb combined over M I - hl~j
I
trained on single representations and tlie final combiner is applied in the end. If some outliers are available for training the LPDD Cf;;,, then the testing stage becomes inexpensive as it relies on the cornpiitation of tlie dissimilarities to 27 objects (the sum of the support objects found for each representation separately). The experiment I1 considers different measurement data sets. The following observations can be made from the analysis of Tables 10.2 10.4. The AUC performance of single OCCs is significantly improved (by more than 10%) by combining approaches. Both an OCC trained on the cornhiried representations (IIa) by the avcrage and product. and a fixed (IIb) or trained (IIc) combiner on the OCC outputs help in this case (compare tlic! results in Table 10.2). Since tlie spectra derived from various wavelengths ~
T h e dissimilarity representation f o r p a t t e r n recognition
464
Table 10.4 Diseased mucosa in the oral cavity. Experiment IIb: AUC (.loo) measure, averaged over 30 runs, derived for fixed and trained combiners applied to the outputs of single one-class classifiers. The representations and classifiers are combined over six measurement sets hf1 - hf6 (related to six excitation wavelengths) and a fixed dissimilarity representation. ‘ALL’ refers to the results on all 6 x 3 (six wavelengths and three measures) representations. The number of support objects in the LPDDs is on average 6 for single representations and equals 13 and 16 for the LPDD arid LPDD-11, respectively, when all 18 representations are combined. Standard deviations of the average AUC values are given in parenthesis.
I
Fixed combiners applied to the OCC outputs
I
describe different information, an OCC built on their combined representation (where a single measure is used to derive dissimilarity representations over six measurement data sets) reaches a somewhat better AUC performance than an OCC built on the combined representation (where various dissimilarity measures are used to define the representations) considered for a single wavelength. This consistent behavior can be observed by comparing thc results of the experiments I s (Table 10.1) and IIa (Table 10.3).
Combining
465
The fixed voting rule applied to the OCCs outputs (IIb) gives mostly the overall best results (an exception holds for the dissimilarity representation D1 and the LPDDs as base classifiers). The trained combiners (IIc) on the LPDD outputs are somewhat worse (possibly due to overtraining) than the fixed voting combiner, however, they are similar to the results of the mean combiner. From the computational point of view, either an LPDD trained on the combined dissimilarity representation (IIa) or a fixed voting combiner on the LPDD outputs (IIb) should be preferred. By using all six measurement data sets and three dissimilarity measures (18 representations in total), all the combining procedures give nearly perfect performance, i.e. mostly 99.5% or more. These results are presented in the column denoted as ‘ALL’ in Tables 10.3 and 10.4. 10.1.4
Summary and conclusions
We studied approaches of detecting one-class phenomena based on a set of training examples, perforiiied in an unknown or ill-defined context of alternative phenomena. Since a proximity of an object t o a class is essential for such a detection, dissimilarity representations can be used as the ones which focus on the object-to-target dissimilarities. When considering a number of different dissimilarity measures, the problem can be described more accurately by combining various representations. Three different one-class classifiers (OCCs) are used: the NNDD (based on the nearest neighbor information), the GNMD (a generalized mean classifier in an underlying pseudo-Euclidean space) and the LPDD ( a hyperplane in the corresponding dissimilarity space), which offers a sparse solution. The additional advantage of using an LPDD is that a sparse solution is obtained, which means that in a testing stage, dissimilarities to a few objects need to be computed to make a decision. Dissimilarity representatioiis directly encode evidences for objects which lie in close or far neighborhoods of the target objects. Hence, they can naturally be combined (after a proper scaling) into one representation, e.g. by an element-wise averaging. This is beneficial, since only one OCC can be trained, ultimately. From our study on the detection of diseased mucosa in oral cavity, it follows that dissimilarity representations combined either by average or product have a larger discriminative power than any single one. We also conclude that by combining information of representations derived for spectra of different excitation wavelengths is somewhat more beneficial than by using only one fixed wavelength, yet different dissimilarity mea-
466
T h e dissimilarity representation f o r p a t t e r n recognition
sines. In the former case, all the OCCs on the combined representations performed about the same, while in the latter case, the LPDD trained on the targets seemed to be worse. The fixed OCC combiners have also becn applied to the outputs of single OCCs. The overall best results are reached for the majority voting rule. The trained OCC combiners, applied to the outputs of single LPDDs, performed well. yet worse than the voting rulc. Concerning the computational issues, either the LPDD built on the combined representations or the voting combiner applied to the LPDD outputs arc recommended. Further studies on new problems need to be conducted in the fiiltnre. 10.2
Combining for standard two-class classification
Se1ect)inga good dissimilarity measure becomes an issue for the classification problcm at hand. When considering a number of diffcrent possibilities for building a dissimilarity representation, there might be no convincing argunients t,o prefer one measure over another. Therefore, an int,eresting question is whether combining dissimilarity representations is beneficial. As in thc one-class classification, two combining possibilit,ies are investigatcti here. In the first case, the base classifiers (the NLC or the NN rule) are found on each dissimilarity representation and then combined into one decision rule. If the representations differ in character, a more powerful decision rule may he constructed by their combining. In the second case, instead of combining classifiers, representations are combined to create a new representation for which only one classifier has to he trained. Our experiments are conductcd on a few dissimilarity representations derived for the N E T handwritten digit set. They demonstrate that when the dissimilarity rcprcsentatioris arc of different nature, a much bctter classification performance can be rcached by their combination than by the use of individual representations only. 10.2.1
Combining strategies
To construct a decision rule on dissimilarities, the training set T of the cardinality N and the representation set R of the cardinality n will be used. In the lcarning process, a classifier is built on the N x ri dissimilarity matrix D ( T ,R ) . The information on a test set T+, of t new objects is given by their dissimilarities to R, i.e. as an t x n matrix D(Tt,, R). Two classifiers are used: the NLC (normal density based linear classifier) in a dissimilarity
Combznzng
467
space and the 1-NN rule directly applied to the dissimilarities. Assume that we are given the representation set R and K different dissimilarity representations D ( ~ ) ( R), T , D ( ~ ) ( R), T . ..., D ( ” ) ( TR, ) . Our idea is to combine base classifiers constructed on distinct representations. It is iniportant to emphasize that the dissimilarity representations should have different character , otherwise they convey similar information anti not, rrluch can be gained by their fusion. Two cases are here considered. In the first one, a single NLC is trained in each dissimilarity space D(Z)(T,R ) ,i = I,2, . . . , K separately and then all of then1 are combined. In the second case, the 1-NN rule is also included. The 1-NN rule and the NLC differ in their decision-making process arid their assignments. The 1-NN method operates on the dissimilarity information in a rank-based way, while the NLC approaches it in a feature-based way. Although for small representation sets, the recognition accuracy of the 1-NN method is often worse than of the NLC in a dissimilarity space [Pqkalska and Duin, 2002a; Pekalska et al.. 2002b, 2004b], still better results may be obtained when both types of classifiers are included in the combining scheme. In our approach, we will limit ourselves to the fixed rules operating 011 posterior probabilities. For the NLC, the posterior probabilities are h s e d on riornial density estimates, while for the 1-NN method, they i r e estimated from distances to the nearest neighbor of each class [Duin a r i d Tax, 19981. Another approach to learning from a iiumber of distinct dissimilarity represeritat,ions is t,o combine them into a new one and then train a single classifier. As a result, a more powerful representation may he obtained, allowing for a better discrimination. The first method for creating a new represeritation relies on building an extended representation Dext,in a inatrix notation given by:
D,,t(T, R)= [ D ( l ) ( TR) , D ( 2 ) ( TE. )
...
D ( “ ) ( TR)]. ,
(10.2)
It rriearis that a single object is now characterized by ~n dissimilarities from K various representations, but still related to the same representation objects. The requirement of having the same prototypes is not crucial, however, for the sake of simplicity, we will keep R fixed. In the second method, all distances of different representations are first scaled appropriately by a non-decreasing functions f 7 , i.e. D?’(T, R) = fT(D(‘)(T,R)).7 = 1 , .. . , K , to guarantee that they all take values in a similar range. This is necessary, since otherwise the dissimilarity values corning from different representations could not be directly compared. The
468
T h e dissimilarity representation f o r pattern recognition
Blurred-Mod.Hausd.
Blurred-Hamming Mod. Hausd.-Hamming
Blurred-Mod.Hausd.
Blurred-Hamming Mod. Hausd.-Hamming
0 0.20.40.60.8
0 0.20.40.60.8
0 0.20.40.60.8
Figure 10.1 Spearman coefficients (top) and traditional correlation coefficients (bottom) used for pairwise comparisons of the dissimilarity representations.
combined representation Dcolnb is then created, e.g. by computing their R ) or any other way as ( weighted average Davr(T,R ) = Cr=,a,D,’)(T, presented in Eq. (10.1). Some other possibilities for building a combined kernel for the support vector classifier are discussed in [de Diego et al., 2004; Muiioz et al., 20031. 10.2.2
Experiments on the handwritten digit set
To investigate the combining procedure, a two-class classification problem between the NIST handwritten digits 3 and 8 [Wilson and Garris, 19921, originally represented as 128 x 128 binary images is considered; see also Appendix E.2. Three dissimilarity measures are used: Hamming, modified-Hausdorff and ‘blurred’ Euclidean, resulting in the following representations: D H , D M H and Dg , correspondingly. The Hamming distance counts the number of pixels which disagree. The non-metric modifiedHausdorff distance, Def. 5.3, is found useful for template matching purposes [Dubuisson and Jain, 19941; To design D g , images are first blurred (smoothed) with a Gaussian kernel of the standard deviation of 8 pixels. Then the Euclidean distance is computed between such blurred versions. Such a smoothing process is meant to make the distances be more robust against small tilting, shifting and change in thickness. The resulting distances are called ‘blurred’ Euclidean. Each of the dissimilarity measures uses the image information in a particular way: binary information, contours or blurring. From the process of
Combining
469
the construction, it follows that our dissimilarity representations dif€er in properties. To prove, however, their different characteristics, the Spearman rank correlation coefficient is used to rank the distances computed to each prot,otype. For two variables X and y , the Spearman rank correlation3 is computed as:
( 10.3) where N is the number of values in both variables and Ri is the i-th rank. Basically, we want to show that the ranks differ between the representations. Therefore, for each pair of the representations, D H - D ~ J H D ,~ ~ H - D and B D B - D H , the Spearman coefficients bet,ween the dissimila.rity ranks to all the representation objects are computed. For instance, for the pair of D B and D I \ . ~ Hthe , Spearman coefficients are computed between D B (., p i ) and D M H ( . , ~for~ every ) p , E R. Histograms of their distributions are shown in Fig. 10.1. The coefficients vary between -0.05 and 0.4, where most of them are smaller than 0.3, which implies that the ranks significantly differ. This suggests that the 1-NN rule will behave differently on each representation. The traditional Pearson correlation coefficient is used to check whether the dissimilarity spaces of the individual representa,tions (and, therefore, linear classifiers built there) are different (high positive values indicate a linear correlation). Such correlation values are higher than those given by the Spearman rates, since now the vectors of dissimilarities are considered, which cannot completely vary from one representation to another. On average, the correlations are found to be: 0.56 between the blurred and modified Hausdorff distance representations, 0.39 between the blurred and Hamming representations and 0.28 between the modified Hausdorff and Hamming representations; see Fig. 10.1. Most Coefficients are smaller than 0.7, thereby, they indicate only weak linear dependencies. Consequently, we can say that our dissimilarity representations differ in character. The experiments are performed 30 times and the results are averaged. In a single experiment, the data, consisting of 1000 objects per class, are randomly split into two equally-sized sets: the design set L and the test set Tt,. Both L and Tt, contain 500 examples per class (so 1000 objects in total). The test set is kept constant, while L serves for the selection of training sets with various sizes. These are T I ,3, '& and 6 = L of the 31n fact, the Spearman rank correlation is the classical Pearson correlation coefficient cOu(x'y)
N X ,Y ) = J v a r ( X ) v a r ( y ) when the variables are converted to the ranks [Krysicki et al., 19951. A simpler formula is used for the computation
470
The dissimilarzty representation for pattern recognition
following cardinalities per class: 50. 100, 300 and 500, respectively. For each training set. the experiments are conducted for an increasing representation set R. Here, for simplicity, R is chosen to be a random subsct of the training set, where both classes are equally represented. In each run, every training dissimilarity representation D(') is scaled by the averaged dissimilarity d, , which also serves for scaling the test dissiniilarity representation. This is necessary to guarantee that the dissimilarities express similar valiies.
10.2.3
Results
Training sets of different sizes are considered to investigate the influence of the training size on our combining approaches. The results are presented in Figs. 10.2 and 10.3. All plots show curves of the generalization (test) error averaged over 30 rims. Each error curve is a function of the cardinality of tlie representation set R, where R is a random subset of T , not larger than half of the training size. Since our goal is to improve the performance of the NLC and 1-NN by combining, all the results are presented with respect to tjheir performance on tlie single representations. Considering single classifiers? it appears that the NLC consistently outperforms the 1NN rule for the t,raining sets 71- 7 4 . The best rcsults of a single NLC are reached on the blurred Euclidean dissirriilarity representation. Fig. 10.2 presents the generalization errors for thc NLC in dissimilarity spaces. It sliows the error curves obtained for three individual NLCs combined by t>hcmean arid product rules and the error curves of a single NLC operat,ing on a combined dissimilarity representation constructed from DB , D h f ~and D H . Two cases are here considered for the latter: an extended with representation Dext,Eq. (l0.2), and the average representation D,, equal weights (other combined representations, as mentioned in Eq. (10.1) give worse results). To keep to total number of prototypes the same, if D,,,. is defined on IRI objects, the extended representation Drxt is in fact based on lRl/3 different objects to guarantee the same total number of prototypes, i.e. the dimension of the dissimilarity space. Hence, it is more irnportaiit for Dextthan to D,, to have good prototypes selected. From Fig. 10.2, we can conclude that the product combiner is better than the mean combiner. (Other fixed combiners have also been considered, but t,he;ywere not better than the product combiner.) Also for smaller and 12) and smaller representation sets, the product corntraining sets birier is the best. For larger training sets (Ts and 7 4 ) and larger representation sets, a single NLC on D,, performs similarly or better than the
(z
Combmzng
471
Training set 71
Training set
7 2
+ Prod 3xNLC +- Mean 3xNLC
0
10
20 30 40 Total number of prototypes
---_Training set
’&
50
2O
20
40 60 80 Total number of prototypes
TraininE set
100
74
Total number of prototypes
Figure 10.2 combining in the NIST-38 problem: the averaged classification error (in 7%) of the combined NLC (by product or mean) and of a single NLC on the combined representations (Davror D,,t) as a function of the total number of prototypcs. Three dissimilarity represcntations are combined: D H , D n l ~ and D B . The resnlt,s of tlic NLC trained on single rcpresentations are plotted in dots. If there are less than three such curves in a plot, it means that the errors are larger than the presented scales. The best performance of the NLC is achieved for D g . The standard deviations of thc presented results arc: T I : 0.23% on average and maximum 0.58%, 7’: 0.19% on average and maximum 0.56%, T d : 0.20% on average and maximum 0.44% and 7 4 : 0.09% OIL average and maximum 0.51%. Note the scale differences.
product combiner. The performance of the NLC on DRXt seems to suffcr either from little variability among the prototypes or from riot siifficieiitly discriminative prototypes when the training set is small. It; howevcr, improves for larger training sets. In the latter casc, for an appropriate mmbcr. of prototypes, it may be as good as the product combiner. Fig. 10.3 presents the generalization errors for the 1-NN rule. obtained for combining three individual 1-NN classificrs by the mean and product rides and the error curves of a single 1-NN built on a corribiiied dissimilarity
472
T h e d i s s i m i l a n t y representation
p a t t e r n recognition
~ O T
Training set T-
Training set
72
5L 14tl
-5
- 12~ m W
.
k 8 U
a
H
'
NNonD,
6~ + NN on D 4-
- "on * Prod
1 IMean
~ ~ ,
D~~~ 3xNN 3xNN
2 - > - 1 , 0 10
++
50
50
100 150 200 250 Total number of prototypes
'0
20
40 60 80 Total number of prototypes
100
Training set 7 4
Training set 1 3
'0
Mean 3xNN
~~
20 30 40 Total number of prototypes
300
Figure 10.3 Combining in the NIST-38 problem: the averaged classification error (in %) of the combined 1-NN rules (by product or mean) and of a single NLC on the combined representations (Davror Dext) as a function of the total number of prototypes. Three dissimilarity representations are combined: DH, D M H and DB. The results of the 1-NN trained on single representations are plotted in dots. If there are less than three such curves in a plot, it means that the errors are larger than the presented scales. The best performance of the 1-NN is mostly achieved for D B . The standard deviations of the presented results are: 71: 0.38% on average and maximum 1.18%, 1 2 : 0.31% on average and maximum 1.72%, 7 3 : 0.21% on average and maximum 1.08% and T4: 0.19% on average and maximum 1.22%. The largest standard deviations appear for the mean and product combiners. Note the scale differences.
representation. Operating on posterior probabilities is motivated by the intention of combining both the NLC and the 1-NN method further on. Although the estimation of these probabilities is rather crude for the 1NN method, it still allows for an improvement of the combined rules. In all cases, the combination by the mean, or product operation gives much better results than each individual 1-NN rule. The larger, both training and representation sets, the more indicative gain in accuracy.
Combining
473
When a classifier ensemble consist of three NLCs and three 1-NN rules trained on D B , D M H and D H , the product combiner is still somewhat better than the mean combiner for smaller training sets, however, they behave similarly for larger training sets. The overall results are nearly the same as presented for the product combiner in Fig. 10.2, therefore, we judge that no new plots are needed. In summary, the mean and product combining rules perform significantly better than the individual 1-NN and NLC constructed on dissirnilarity representations. In general, the dissimilarity representations tend to be independent and, therefore, the product rule based on the NLCs is expected to give better results than the mean rule [Tax, 20011. Consequently, the product combiner is preferred. For the 1-NN rule, the posterior probabilities are very rough estimates from distances to the nearest neighbor and do not depend on the dimension of the problem. Therefore, both combiners perform about the same. 10.2.4
Conclusions
Combining a number of dissimilarity representations may be of interest when there is no clear preference for a particular one. It can be beneficial when dissimilarity representations emphasize different data characteristics. This is illustrated by a two-class recognition problem between the NIST digits 3 and 8 for three dissimilarity representations: Hamming D H , modified Hausdorff D M H and blurred Euclidean D g . We have analyzed two possibilities of combining such information, either by combining classifiers or by combining representations themselves. In the first approach, individual classifiers are found for each representation separately and then they are combined into one rule. Our experiments show that the product combining rule works well, especially for larger representation sets (with respect to the training size). This might be explained by not very high correlations between dissimilarity spaces (especially for smaller representation sets), hence possible independence between the NLCs constructed there. Adding the 1-NN rules to the classifier ensemble improves somewhat the mean combiner, but not the product combiner. In the second approach, dissimilarity representations are combined into a new one on which a single NLC is constructed. They are scaled so that their mean values become equal and then averaged out, resiilting in the representation DaVr. The NLC on D,,, significantly outperforms the individual NLCs. As a reference, the extended representation Dext is also
474
T h e dissimilarity representation f o r pattern recognition
considered. Eq. (10.2). The NLC on such a representation reaches a similar performatice as on D,,,, but for larger training sets. In general, we conclude that for this problem the product combiner of three NLCs is recommended for small training sets, while the single NLC trained on D,, is suggested for larger training scts.
10.3
Classifier projection space
In this section some standard classifier enseiiibles designed in feature spaces are considcrcd. Thc base classifiers are used to build a conceptual dissimilarity representation describing classifier pairwise diversities. Such a representation is used for the construction of a classifier projection space (CPS). which serves as a tool for investigating the classifier diversity. CPS is derived based on an (approximate) embedding of a matrix of classifier diversities, which may allows one for a visual analysis of the differences betwecn base classifiers. It can be used to find arguments for the selection of particular combining rules. The rationale behind is explained below. When a classification problem is too complex to be solved by training a single (advanced) classifier. the problem may be divided into subproblems. They can he solved one per time by training simpler base classifiers on subsets or variations of the problem. In the next stage, these base classifiers are cornbincd. Many strategies are possible for creating subproblems as well as for constructing combiners [Lam, 20001. Base classifiers are expected to be different since they should deal with different subproblems or operate on different variations of the original problem. It is not useful t o store and use sets of classifiers that perform alrnost identically. If they differ somewhat, as a result of cstimation errors, averaging their outputs may be worthwhile. If they differ considerably, e.g. by approaching tlie problem in independent ways, the product of their estimated posterior probabilities may be a good rule [Kittler et nl., 19981. Having significantly different base classifiers in a collection is important since this gives raise to essentially different solutions. The concept of diversity is, thereby, crucial [Kuncheva and Whitaker, 20031. There are various ways to describe the diversity, usually producing a single niiniher attributed to tlie whole collection of base classifiers. Here, we will use it differently. What wc arc looking for is a method of combining base classifiers that is riot sensitive to their defects resulting from the way their collection is constituted. We want to m e the fact that we deal with classifiers and not with arbitrary functions of the original features. To achieve t,hat, we
Combining Table 10.5 C, versus C,:
I C?
class 1
475 the counters. c7
class 1 a,,
class 2 b,,
propose to study the collection of classifier pairwise differences, an r i x n conceptual dissimilarity matrix D , before combining them into an output combiner. The dissimilarity value may he based on one of the diversity measures (Kuncheva and Whitaker, 20031, like the disagreement [Ho, 19981. Such a matrix D can be embedded into a space Rm, T L
10.3.1
Construction and the use of CPS
Let us assume n classifiers trained on a training set. The CPS will be coilstriicted based on the evaluation (test) set. For each pair of classifiers, their diversity value is determined, by using an evaluation set. This gives an n x n syinrrietric diversity matrix D. To take into account the original characteristics of the base classifier outputs, a suitable diversity measure should be chosen to establish the basic difference between classifiers. Studying the relations between classifiers in the CPS allows us for gaining a better understanding than by using the mean diversity only. The latter might be irrelevant e.g. for an ensemble consisting of both similar arid diverse classifiers, where their contributions might average out. The joint output of two classifiers, Ci and Cj can be related by counting the number of occurrences of correct (1) or wrong (0) classification. Then the counters used for binary features as described in Sec. 5.1 can be adopted
The dissimilarity representation for pattern recognition
476
Fourier features; Classifier projection space
Morphological features, Classifier projection space I
0 7-
*21 3
D31 3
0 6~
*
NMSC
0
I-NN
0 5~
0 4~
0 3~
TFUE
v k-NN
PRODC 0672
A
a
a65.9 072 7 *69 3
0.2i28.3
PARZEN
svc-I
+ ANN-20 f ANN-50
0101
1
-0 2~
t29.9
./o"pg MEANC
a45 1
D SVC-2
~44.8
f
4 8 . ~ ~ 4 8 8041 t424
-0 4~
-0.- 6
-04
-d2 1
0
+ 279 ' 02
0
-d.6
-04
-d2
0
v A
033 4 PRODC
a
0
02
04
NMSC
I-NN k-NN PARZEN
svc-I
D
svc-2
+
ANN-20 ANN-50
j
0.8
1
Figure 10.4 Two-dimensional CPS for the MFEAT digit data: Fourier features (left) and morphological features (right). Points correspond to the classifiers; numbers refer to their accuracy. The 'perfect' classifier (true labels) is marked as TRUE. Remember that the axes cannot be interpreted themselves.
appropriately such that a is the number of correct classifications for both C, and C,,etc. This requires the knowledge of correct labels, which niight not be available, e.g. for a test set. This can be avoided when the classifier assignments are compared. Many known (dis)similarity measures can be used; cxamples are given in Sec. 5.1; see also [Cox and Cox, 1995; Kuncheva and Whitaker, 20031. Here, we will consider a simple diversity measure, the disagreenient [Ho, 19981, which for two classifiers C,and C, and a two-class problem is defined as
where the counters used are defined in Table 10.5. This distance is the simple niatchirig distance, Table 5.2. Given the complete diversity matrix D . reflecting the relations between classifiers, the CPS is found by an approximate (non-)linear projection, a variant of multidimensional scaling, Sec. 3.6.2. Some examples of the use of the CPS are presented below.
Fixed combiners. To present a two-dimensional CPS, the ten-class MFEAT digit data set [MFEAT]is considered; see also Appendix E.2. For our presentation, Fourier (74D) and morphological (6D) feature sets are chosen with a training set consisting of 50 randomly chosen objects per class. The following classifiers are considered: the nearest (scaled) mean classifier. the NM(S)C, normal density based linear (the NLC), uncorrelated quadratic (the NUC) and quadratic (the NQC) classifiers, the 1-NN arid k-NN rules, Parzen classifier. linear or quadratic support vector clas-
478
The dissimilarity representation
for
p a t t e r n recognition
sifier, the SVM-1 or the SVM-2, decision tree, DT and feed-forward iieural network with 20 or 50 hidden units, the ANN20 or the ANN50. For each feature set, the disagreement matrix between all classifiers and the two combiners, the mean (MEANC) and the product (PRODC) rules, is dcrived from Eq. (10.4); see also Table 10.6. This is done for a test set of 150 objects per class. The diversity matrix served then for a construction of a two-dimensional CPS by the MDS procedure, described in Sec. 3.6.2. Such examples of the CPS can be seen in Fig. 10.4. Rerrierriber that thc points correspond to classifiers. Tlie Euclidean distances between tlieni approxiniate the original pairwise disagreement values. The hypothetical perfect, classifier, i.e. given by the original labels, marked as TR.UE, is also projected. Tlie numbers in the plots indicate the accuracy reached on a test set. In both cases, we can observe that the mean combiner is better than the product combiner. The latter, apparently deteriorates with respect to sornc, although diverse, but very badly performing classifiers. The mean rule seems to reflect the averaged variability of the most compact cloud (of classifier points). Note also that diversity might not be always correlated with accuracy. See, for instance, the right plot in Fig. 10.4, where tlie NMSC is more similar (less diverse) to the hypothetical classifier than ANN20, although the accuracy of the latter is higher.
Bagging, boosting and the random subspace method. Many combining techniques can be used to improve the performance of weak classifiers. Examples are bagging [Breiman, 1996a], boosting [Freund arid Schapire, 19961 or the random subspace method (RSM) [Ho, 1998; Skurichina. 20011. They modify the training set by sampling the training objects (bagging), by weighting them (boosting) or by sampling data featiires (the RSM). Next, they build classifiers on these modified training sets and combine them into a final decision. Bagging is useful for linear c:lassifiers constructed when tlie training size is about the data dimension. Boosting is effective for classifiers of low-complexity built on large training sets [Skurichina, 20011. The RSM is beneficial for small training sets of a relatively large dimension, or for data with redundant features (where the discrimination power is spread over many features) [Skurichina, 20011. To study thc relations within these ensembles, the 34-dimensional, two-class ionosphere data [Hettich et al., 19981 is considered; scc also Appendix E.2. The NMC is used for constructing the ensembles of 50 classifiers. The training is done on the sets TI and '& consisting of ran-
4 7Y
Combining
Ionosphere data; bagging 0 15
+
* u 01-
a
c A,
Ionosphere data; boosting
NMC-bag
+
03
baQ-Wmal bag-ma1 bag-mean bag-prod bag-minimax
t 3
-1 02
11
0 1 ~ -1
*NMC
0 *4
-00
~
'
125'-
-0 I
129
1 2141
1737 __
Z,+Ad?E v ox
5
+ -11
NMC-boos1 boosl-wmal boost-ma] boost-mean boost-prod
6'
DTRUE
~
~. 9 2949,"
+ 13
~
-0 1~
, -02
-015
-01
-005
0
0.05
01
015
02
-01
0
01
02
03
04
05
06
07
Ionosphere data, RSM 0 3-
0 2371
1
0 1 ~
0~ DTRUE
-0 1
v
RSM-ma] RSM-mean RSM-prod RSM-minima RSM-dtempl
025
035
2 x LI
-0 2~
-0 3-25 -0
0
-0 15
-005
005
0 15
045
Figure 10.5 Two-dimensional CPS for the ionosphere data trained with 71.The nuinbers correspond to the order in which classifiers are created. To maintain the clarity of presentation, only some classifiers are marked. Note the scale differences.
donily chosen N1 = 100 and N2 = 17 objects per class, respectively. This is done to observe a different behavior of base classifiers. The following combining rules are used: majority voting (maj) , weighted majority voting (wrnaj), mean, product (prod), minimum (min), maximum (max). decision ternplates (dtempl) and naive Bayes (NB). The test set consists of 151 objects. For each of the mentioned ensemble, the disagreement matrix betweeri the base classifiers and the combiners is derived, which serves further for obtaining the CPS (by the classical scaling). The hypothetical. perfect classifier, representing true labels (marked as TR.UE) has been added, as well; see Fig. 10.5. To understand better the relation between the diversity arid accuracy of the classifiers, while maintaining the clarity of presentation, another plots have been made; see Fig-. 10.6. They show a one-dirnensional CPS (rep-
T h e dissimilarzty representation for p a t t e r n recognition
480
Trained with 1 2
Trained with Ionosphere data: bagging
Ionosphere data, baqqinq TRUE
i
0
21 29
t,
bag ma, bag-mean
0 x
u bag prod
-
0 bag-minima
+b
v bag-dtempi 051
L
-015
-005
-01
0
005
01
015
-
-
,
,
IonosDhere data: boostino ,
~
I
TRUE
0 91
.
W
3
l
x
c
04
03
02
Differencein diversity
Ionosphere data; boosting ,
01
0
-01
-02
Difference in diversity
n 0
I: 0
boost-mal
boost-mean boost-prod boost-m#nlma
NMC-boost boast-wmal boost-ma)
7
boost-prod
0
boosl-mmlmax tempi
v boost-d
v boost-d tempi
A
0
boost-NB
I
121
Q
0 6-
05 17 375-
0 4~
5,9
~
-005
005
015
025
035
045
055
065
075
-01
0
Difference in diversitv
02
01
03
04
05
06
07
Differencein diversity
Ionosphere data, RSM
IonOSDhere data: RSM
TRUE
TRUE
RSM-mean ~1
RSM-prod
0
RSM-minimax RSM-dtempl RSM-NB
v A
-015
-005
005
015
025
035
Difference in d!verslty
Figure 10.6 Accuracy vs. one-dimensional CPS for the ionosphere data. The numbers correspond t o the order in which classifiers are created. To maintain the clarity of presentation, only some classifiers are marked. Note the scale differences.
Combining
481
resenting the relative difference in diversity) versus classifier accuracy. So, the differences between classifiers in the horizontal and vertical directions correspond to the change in diversity and accuracy, respectively. The following conclusions can be made from the analysis of Figs. 10.5 and 10.6. First of all, in the CPS, the classifiers obtained by bagging and the RSM are grouped around the single (original) NMC, creating mostly a compact cloud. The variability relations between the bagged and RSM classifiers might be very small. On the contrary, the boosted classifiers do not form a single cloud. In terms of both diversity and accuracy, from the set of 50 classifiers, they are reduced to 9 - 14 different ones (dcpending on the training set). A group of 5 - 8 poor classifiers is then scparated from the others, as well as from the bagged arid RSM classifiers. Secondly, for a small training set 7 2 , Fig. 10.6, right, the RSM and bagging create classifiers that behave similarly in variability, since the classifier clouds in the one-dimensional CPS are of the same spread. For a larger training set ‘TI, Fig. 10.6, left, the diversity for the RSM classifiers is larger than for the bagging case. Thirdly, the classifiers in all ensemblcs, even in boosting, seem to be constructed in a random order with respect to the diversity and accuracy. Concerning the combiners studied here, the miriirnuni rule (equivalent to the maximum rule for a two-class problem) achieves, in most cases, the highest accuracy. It is even better than the weighted majority. used for the boosting construction. For a small sample size problem, Fig. 10.6, right plot, most, of the combining rules for bagging and the RSM arc alike, both in diversity and accuracy. A much larger variability is observed for boosting; a collection of diverse both classifiers and combiners is here obtained. Finally, a striking observation is that nearly all classifiers, as well as their combiners, are placed in the CPS a t one side (i.e. not around) of the perfect classifier (this was less apparent for the MFEAT digit data; compare to Fig. 10.4). Image categorization problem. In the problem of image database retrieval. images can be represented by single feature vectors or by clouds of points. Usually, given a query image Q, represented by a vector, images in the database are ranked according to their similarity to Q. This similarity is measured e.g. by the normalized inner product. A cloud of points offers a more flexible representation, but it may suffer from the overlap between cloud representations, even for very distinct images. Recently, a novel approach has been investigated for describing clouds of points based on the support vector data description (SVDD) (Tax and Duin, 20041, which is a
T h e dissimzlarity representation f o r pattern, recognition
482
SVDD projection space
Image projection space
2~
0~
-6'
a
L A
-9
-7
-5
-3
-1
1
3
5
4 1
0 0
1 -9
-7
-5
-3
-1
1
3
5
Figure 10.7 Spatial representations: image projection space of D I (left) and the SVDD classifier projection space of Do,, (right). See text for details. Different marks refer to different classes.
boundary descriptor (an OCC) in a feature space describing the domain of such a cloud. For each image IJ in the database, represented as a set of points in a feature space, an SVDD C,& ,, is trained. The query image Q is represented as a set of points in the same space. The fraction r j ( Q ) of the query points rejected by the C&,, is a measure of a dissimilarity between the query image and the descriptor of the image 1,. A low value is expected to indicate that I, is similar to Q . The retrieval is based on computing the fractions r,7(Q) for all images in the database, ranking them and returning the images corresponding to the lowest ranks. We have found out that a single SVDD may suffer if the clouds of points between different images are highly overlapping (this happens if the features derived from images are not well discriminating the classes). However, combining of the SVDD descriptors improves the retrieval precision; see [Lai et al., 20021 for details. In our experiment, performed on a database of texture images, 23 different images are given. Each original image is cut into 16 128 x 128 nonOverlapping pieces; see also Appendix E.2. These correspond to a single class. Such pieces are mostly homogeneous arid represent one type of a textnrc. The images are, one by one, considered as queries, and the 16 best rankcd images are taken into account. The retrieval precision is computed using all N = 368 images. The details are in [Lai et al., 20021. Each image is represented as a combined profile. This means that the image I? is represented as p ( I j ) = [ r ~ ( [r2(I,) j) 7-N (I, )] ,which is in fact a conceptual dissimilarity representation, such that q.(I,) expresses a dissirnilarity between an image ( a set of points) and a model (a boundary descrip-
Combining
483
tion of an image). In the standard approach; to ret,rieve images the most similar to Q, one will find the smallest r k ( Q ) . In our approach: a dissirnihrity between t,he profile of the query Q, p ( Q ) = [ s q ( Q )!ra(Q) . . . T N ( Q ) ] and the profiles ~ ( 1of~other ) images is considered. For instance: a Euclidean distance can bc chosen. This approach is novel as it combines tlic image profiles of image one-class classifiers into a conceptual dissirriilarity representation. The retrieval is then based on ranking tlic distanccs d ( p ( Q ) , p ( l , ) ) and finding the images of the smallest ranks. In order to sce all relations between the images: a distance matrix D J consistirig of the Euclidean distances between the image profiles d ( p ( I i ) ~(1,)) , can be computed. The resulting spatial representation of D I becomes then an image projection space; see Fig. 10.7, left plot. On the other hand, we can build the CPS, now based on the differences between the SVDD classifiers. To do that, one need a SVDD-profile. which for the C&,, - the boundary description of image I ] is givcn as pocc(C&,DD) = [ r j ( I l )r J ( I z ) ., . . , T , (~I N ) ] . Then for instance a Euclidean distance matrix Do,, consisting of the distances between the SVDD prop,,,(~&,,,,,)) call be found. A spatial represeiitjation of files ci(pocc(~jVD,,), Do,, is a (one-class) classifier projection space. See Fig. 10.7. riglit plot. Remember that in this case, the classifiers correspond directly to the images. Comparing the two graphs, wc see that thc image space maintains in this case a better separation, which was confirmed by our good retrieval precision [Lai et al., 20021. A more profound study on combining image representations for image classification and retrieval can be found in [Lai et al.. 20041.
10.4
Summary
Two different combining approaches to learning from dissimilarity representations werc investigated for novelty detection problems and classification problems. When dissimi1arit)yrepresentations differ in character. it can be beneficial to combine either individual classifiers constructed on each single one of them or by creating a new representation. In our experiments, we showed that when distinct representations are 'combined into one. as a result, we may obtain a representation possessing better discriminative power. This does not only improve the classifier, but is also of interest computationally. Fixed combiners, such as majority voting, can also be advantageous, especially in the case of one-class classifiers.
484
The dissimilarity representation for pattern recognitaon
Additionally, a new way of representing classifiers is proposed. The classifier projection space (CPS), based on (approximate) embedding of the diversities between the classifiers, makes it possible to study the differences between them. This may increase our understanding of the recognition problem at hand and, thereby, offers an analytic tool based on which we can decide on the architecture of an entire combining system. The notion of the CPS extends further to spatial representation of conceptual dissimilarities (dissimilarities between classifiers or objects and models), which can be useful for gaining insight in an image retrieval problcrn, for instancc. Conceptual dissimilarity representations resulting from combined one-class classifiers or weak models can be useful for image retrieval [Lai et al., 2004, 20021.
Chapter 11
Representation review and recommendations What is a representation? It is not what it specifically seems to be, it is generally more. ANONYMOUS
The purpose of object representation for pattern recognition is to describe them such that they can be related to each other and compared t o enhance their commonalities. This chapter will briefly review the dissimilarity representation from a wider perspective than before, by focussing on some essential aspects and providing additional relations. We will discuss how object relations are modeled by dissimilarities and summarize the three approaches of generalization. Next, we will provide some practical recommendations for the use of dissimilarity representations.
11.1
Representation review
To represent an object by the use of dissimilarities, a number of possible reference sets may be considered: 0
0
0
0
All objects in the training set. This set may be too large for two reasons. The calculation and handling of all distances may become computationally prohibitive. Another issue is the accuracy problem. If the training set and the representation set have identical sizes, then the dissimilarity matrix tends to he singular. In such cases, additional transformations are needed to obtain a generalization. A selected subset of prototypes found in the training set. In this book. several ways of doing this were discussed. An intriguing point is that a random selection is not bad. A selected subset of prototypes selected from a learning set and different from the training set. Some typical, possibly idealized objects given by an expert. This is the way hunians often learn: from examples given by a teacher. The dis485
486
0
0
T h e dissimilarity representation f o r p a t t e r n recognitton
similarity rcpresentation makes it possible to exploit this ability of the teacher, and moreover, it may integrate it with an additional set of examples found by automatic means. Ideal examples constructed from the training set. It is not necessary that the objects in the representation set are real world objects. They may be virtually constructed from real measurements by smoothing or by interpolation, as long as the dissimilarities can still be computed. Also this process. as the one mentioned above, can be guided by a human expert. This is again a possibility to integrate available expert knowledge with new observations. Dissimilarities between objects and classes (instead of constructed class prototypes). This is the conceptual representation. The concept of a class is one step more abstract than virtually constructed prototypes and two steps more abstract than idealized constructed real world examples. There is; however, an essential difference. The dissimilarity between an object and the concept of a class has to be computed through the concept learning procedure, i.e. a classifier or a class model based on a training set. The dissimilarity of other, new objects with a class concept should be then defined on the relation between these objects and the c l a s s ifier. If a riurnber of such classifiers are given, the generalization over this conceptual dissimilarity rcpresentation is similar to the problem of combining classifiers.
We can conclude that expert knowledge can be incorporated in the process of constructing a dissiniilarity measure as well as in the choice of the representation set. Especially this last possibility is essentially different from traditional featurc-based representations, in which the expert knowledge is almost exclusively used for the definition of the feature set. The training sct is traditionally sclected at random. We think, however, that selective data sets should be studied for this representation as well. 11.1.1 Three generalization ways
Our basic representation is dissimilarity matrix D (R. R ) computed between pairs of objects from a representation set R. Three generalization frameworks were discussed in Chapter 4: 1. By a direct use of the dissiniilarities D ( x . R ) between a new object T and the representation set R. These dissimilarities are applied to a nearest neighbor based classification scheme. The matrix D ( R ,R) itself is either
R e p r e s e n t a t i o n rw2e.w a n d r e c o m m e n d a t i o n s
48 7
not used at all, or only by some editing-and-condensing scheme for the selection of a condensed set. Its main purpose is to speed up the classification procedure by reducing R without loosing accuracy. Sometimes, however, some accuracy is lost, sometimes a slightly better accuracy is obtained as: due to the reduction, class overlap is removed. 2. By constructing dissimilarity spaces. Effectively, the distances are used as features. Their meaning as dissimilarities is, however, mostly lost. Given metric distances, a dissimilarity space can be seen as a result of a specific embedding to a max-norm space. In general, metric distances can be projected to a dissimilarity space by a contraction, hence a continuous mapping. However, this approach may he good for very wild (heavily non-metric) dissimilarity measures. If the objects I(: and y are alike and the dissimilarity value d ( z ,y) is small, then for other objects z, the dissimilarities d ( z , z ) and d ( y , z ) might not be similar if the measure d is nori-metric. However, is the dissimilarities of 2 and y to the set R are inspected, one can expect that, although the individual values will differ, the vectors D ( z ,R) and D ( z ,R ) are correlated in their entirety. If so, then the representations D ( z , R ) and D ( z , R ) are close in a dissimilarity space. The notion that larger numbers indicate a smaller resemblance of objects is in practice forgotten, unless special classifiers are used for this space. What is gained, however, with respect to the above mentioned direct use of dissimilarities, is that hereby a vector space is constructed by using tlie representation set R. This allows one to define arbitrary classifiers. This space may be filled with an additional training set T by which the classifiers gain in accuracy, without enlarging the computational complexity during classification. 3 . By Euclidean embedding. The distances in an embcdded space between the objects of the representation set are equal, or approximately equal to their original distances. The distances of a new object, projected to the embedded space, with tlie representation set, may deviate from the true distance. The difference is minimized by the projection procedure and may be used for building an augmented space which has one dimension more. The mutual distances between newly projcctcd objects may deviate from the true distances. However, this is not to be expected for large representation sets, larger than the intrinsic dimension. An extension is an embedding to a pseudo-Euclidean space; which will result from an ernbedding of any non-Euclidean dissimilarities.
488
The dissimilarity representation for pattern recognition
It is not yet fully understood how an embedding should be determined for a conceptual representation, as we do not have a unique way to compute dissiniilarities between concepts (classes) that are consistent with the dissimilarities between objects and classes. In one of our studies we experimented in this direction IPekalska et al., 2002aI. What do we loose and what do we gain by these three approaches? If we consider a dissimilarity space based on squared dissimilarities, there is a linear transformation between the complete pseudo-Euclidean space found by a linear embedding and such a dissimilarity space. Consequently, if D*2(p,,R) is the dissimilarity space representation of a particular object p z m d if xi is it,s representation in the augmented embedded space, then the two posterior probabilities for a particular class w are equal: P~('~'lo*'(pi,R =) P~(w/xi). ) A question is to what extent this holds for their estimates (based on a Parzen density estimation or a mixture of Gaussians). These posterior probability estimates could also be derived by applying the k-NN estimator directly on the dissimilarities D ( p i ,R),resulting in an estimate F'k (wlpi). For Euclidean dissimilarities the nearest neighbors in the embedded space are the same as found directly from the dissimilarity matrix, so the density estimates are about the same (not exactly as they are found in different ways using a finite set of objects): l ' ~ ( w l D * ' ( p ~ , R )=) PE(WIX2) 25 P k ( ' L ' l P i ) .
The consistency of the class posterior probabilities between the three representations just holds for finite representation sets and in the case of Euclidean distances. If a larger training set is projected in the dissimilarity space or in the embedded space then better density estimates, and consequently more accurate posterior probabilities can be estimated. These do not have to be equal anymore (unless the representation set is sufficiently larger than the intrinsic dimension) as different objects might be mapped mc point in the dissimilarity space but on different points in the embedding space. The direct use of the dissimilarity matrix, however, will certainly yield different results, as it cannot profit from a larger training set. In thc case of non-Euclidean dissimilarities, the situation becomes more complicated. Embedding will result in a pseudo-Euclidean (PE) space. Density estimates, treating this space as a vector space (or the associated Euclidean space) will result in different class posterior probability estimates than based on the usage of distances. Consequently P~(wIxi) # Pk(wIpi). The relation to the dissimilarity space (of square dissimilarities), however,
Representation review and recommendations
489
still exists for the representation set, as there is a unique one-to-one rnapping with the embedded pseudo-Euclidean space. Again, estimates may differ and the relation is lost for larger training sets projected in the spaces. For the moment, this is just an intuitive discussion, as the reasoning should be better grounded and needs a proper formalization. We hope to work on this point in the future. 11 .I.2
Representation formation
We will now recapitulate how we may arrive at dissirnilarity representations for real world objects. Some examples will be given, possible dissimilarity measures as well as the consequences for the characteristics of the ernbedding space that is found. We will distinguish between definite and indefinite represent at’ions. A dissimilarity representation is definite if d ( p 1 , p z ) = 0 iff p l = p 2 . This implies that the objects p l and p2 are identical if and only if their dissirnilarity is zero. Objects are uniquely labeled in many real world problems. Related definite dissimilarity representations will result in non-overlapping classes and a zero-error classifier, see Sec. 9.6.1. Note that a definite representation is not the same as a definite dissimilarity measure. A clear example is a feature vector space. Usually, the feature-based description reduces objects to fcature vectors from which they cannot be reconstructed anymore. If the number of features is limited, then different objects may have the same feature values. So, the feature representation of objects is indefinite. However, the frequently used Euclidean distance between feature vectors, is definite: the distance between vector representations of objects is zero if and only if the vectors are identical. See also Table 11.2, in which this discussion is surnniarized This example makes clear what can be gained from the study of reprcsentations. If we wish to have very accurate pattern recognition systems, we need to start from a definite representation. Any definite distance nieasure applied to the real world object will be so. Such measures may asymptotically enable a zero-error classification. For some measures, this capability may become apparent for a small number of objects. It has mainly to bc determined from the background knowledge of the problem and the study of the most promising dissimilarity measures. Based on the training set, the best measure can be selected, or optimized for some parameters. In Table 11.2, this has been summarized as the pixel representation, or the mean square error (ILISE) between objects, assuming that the sampling density
490
The dissimilarzty representatzon for pattern recognition Table 11.1 The dissimilarity matrix between the four objects, as described in the text.
table CUP 0
1
1
Figure 11.1 Pseudo-Euclidean representation example based on the single linkage dissimilarity measure. (right) Some real world objects. Dissimilarities are given in Table 11.1. (left) The 2-D Pseudo-Euclidean representation using the ‘floor’, the ‘table’ and the ‘cup’ for representation. Afterwards the ‘plate’ has been projected into this space. The two drawn diagonal lines represent the set of objects having a zero dissimilarity to the ‘table’.
is sufficiently high to reconstruct the objects. As discussed in Chapter 5, many of the dissimilarity measures used in pattern recognition are non-Euclidean. It has been observed many times that human perception and human judgement also essentially differ from the Euclidean characteristics. For instance, the mean square difference between an image and its reconstructed version (after decoding or noisefiltering) appears t o be a bad criterion for an error, in agreement with human judgement [de Ridder et ul., 2003al. Some of the practically used iion-Euclidean measures are still metric. However, they will be ernbedded in a pseudo-Euclidean space, but the representation will be definite if applied to the original objects without a reduction; see Secs. 2.7. 3.5 and 4.5. Moreover. the representation objects will be described by positive distances. In theory, new examples may be projected everywhere, but if the representation objects are chosen appropriately t o represent the dissimilarity information well, they will be projected in neighborhoods of the representation objects. Only in pathological cases they may have a negative distance to a representation object.
Representation review and recommendations Table 11.2 Summary of various representation pseudo-Euclidean embedding of dissimilarities Representation Problem feature representation pixel representation metric dissimilarities non-metric dissimilarities siniilaritv and dissimilaritv contributions
Euclidean
cases of
49 1
Euclidean
and
definite
There are also examples of non-metric dissimilarity measures. A wellknown example is the modified-Hausdorff distance or normalized cdit distance. Another example can be given as follows. Imagine a table on the floor and a cup on this table. A cup on this table has a zero distance to the table as it touches the table. The table has also a zero distance to the floor it is standing on. The distance between the cup and the floor, however, is non-zero. In Table 11.1 the dissimilarity matrix is given. Let the distance between the cup and the floor be 1. In Fig. 11.1 the pseudoEuclidean space is shown representing these three objects. It is clear that this representation is indefinite as different objects have a zero distance. However, note that the associated Euclidean space, that treats all distance contributions as positive, is definite, as it has only zero-distances to the objects themselves. Next, we put also a plate on the table at a distance 1 to the cup. Its projection is shown in t,he figure as well. Note that in this projected space, due to the projection error, a negative distance to the table occurs. The squared projection error equals 1. The squared projected distances of the plate to the table, the floor and the cup are -1, 0 and 0, respectively. This example makes clear what the added value of the Pseudo-Euclidean space can be: it may contain some structural information about the configuration of the represented objects. All dissimilarities are non-negative in this exaniplc. Therefore, after embedding, the representation objects still have nori-negative squared distances. After the projection of new examples, the squared distances between them and representation objects may become negative, due to a projection error. It is possible, however, to construct a dissirnilarity measure for which its squared value can be either positive or negative. This may happen if the measure itself is constructed from positive arid negative contributions. The positive ones contribute to the dissimilarity, thereby indicating that they measure differences. The negative ones will reduce the
492
The dissimilarity representation for p a t t e r n recognition
dissimilarity and may, therefore, be caused by similarities: object properties that indicate that their resemblance or belong to the same class. If the similarities surpass the dissimilarities, the overall result becomes negative as the objects are more similar than dissimilar. This is reflected in the representation as negative squared distances are obtained between objects in the representation set. In the table, this is indicated by pseudo-Euclidean 2. We have no examples of such measures, but we can imagine that they might be useful in areas like psychology and sociology. Here, the important point is that the negative directions in the pseudoEuclidean space indicate a contribution of similarity. It may not be explicitly visible in the dissimilarity measure, but the overall picture of the representation indicates that such contributions exist. Some ideas in this direction are also presented in [Laub and Muller, 20041.
11.1.3
Generalization capabilities
It is directly clear that if non-identical representations, obviously of nonidentical objects, may have a zero distance, one of the dissimilarity or embedded spaces should be preferred over the direct use of dissimiIarities. The difference between such objects may not be observed from the nearest neighbor relations. This is, however, an extreme example of what may happen in any case: far away objects contribute to the representations by the dissimilarity space and by embedding, but their influence cannot be observed from the nearest neighbor relations in the dissimilarity matrix. As observed before in Sec. 11.1.1,there exist a one-to-one mapping between the dissimilarity space (of square dissimilarities) and an (augmented) cmbedded space. It entirely depends on the application and the choice of the classifier which of the two performs better. The two representation spaces are thereby equivalent. They have to be preferred over the direct usage of the dissimilarity matrix for two reasons. First, this usage has to be based on nearest neighbor relations and thereby neglects information from more remote objects. Second, the direct us of' dissimilarities cannot incorporate additional training objects (without extra computational costs). However, as a nearest neighbor approach emphasizes the local behavior of the classes for some problems this still may be the optinial approach. This happens when the representation set is so large that local information is sufficient and the contributions of remote objects are only disturbing. In case of definite dissimilarity representations there is no class overlap
R e p r e s e n t a t i o n review a n d r e c o m m e n d a t i o n s
493
and a zero classification error can be reached by the nearest neighbor rule for a sufficientJy large training set,. Thereby, we conclude tjhat the usage of the two representation spaces is of particular importance in case of training sets of a limited size and especially if a small representation set is required for computational reasons. This is, at the end, the interest of pattern recognition: how to learn from a small set of examples such that it can he applied in a feasible way.
11.2
Practical considerations
In all experiments performed on various dissimilarity data sets, the conclusion is that both dissimilarity and embedded spaces defined on dissimilarity representations D ( T .R ) offer a good compromise between learning accuracy (precision) and computational effort, often better than the nearestneighbor methods directly applied to the dissimilarities. The best approach is problem-dependent. On the other hand, if your representation set is huge and the dissimilarity measure used is meaningful for the problem, the nearest neighbor rule would be ultimately the best. This, however, would require huge storage and computation costs. In this section, we will give recommendations and formulate a few useful suggestions for the use of dissimilarity representations. This is mostly to share the most important observations. As we could not study the whole plethora of existing learning approaches, only some of our experiments are reported in Chapters 6 10. The recommendations below are more general. First of all, one needs to understand the problem and the data. One may start from the observations of the distribution of the dissimilarities in the form of a histogram and by deriving simple statistics as the mean, standard deviation, modes, kurtosis. skewness, etc. Visualization techniques, as described in Chapter 6 should be used to get further insight into dissimilarity data. Low-dimensional spatial maps of dissimilarities offer a way to inspect the relations between the objects. Classical scaling results and PCA-dissimilarity space should be studied first. Later on. Sammon mapping So and Isomap can be used. To analyze a hierarchical organization of objects, an ultrametric dissimilarity tree can be constructed by single-linkage clustering or a minimum spanning tree. Additionally. we recommend to study intensity images of the dissimilarity relations to analyze the discrimination properties between classes (if examples are labeled), the identification of outliers or the exis~
494
T h e dissimilarity representation for p a t t e r n recognition
tence of' possible clusters. If the data are unlabeled, they can be permuted such that the potential clusters may be revealed, e.g. based on their growing average dissimilarity (computed to all other objects) or the VAT criterion, as described in Sec. 7.1.2. Before moving to the task-specific suggestions, we will first give some general recommendations: (1) It is important to bound the dissimilarity values. This is necessary to bound the data domain and to avoid numerical problems. If they are very large, e.g. the average dissimilarity is larger than 100, they should be linearly scaled to a reasonably small interval such as [0, I] or [O: 101 or by L, where N is the number of (training) objects; see fi Proposition 4.2. Alternatively, a nonlinear scaling can be applied, such as a sigmoidal function. ( 2 ) If objects contain an identifiable, structure or organization, a structural approach used to derive the dissimilarity representation will be beneficial. ( 3 ) If the classes (or expected clusters) have different spreads (e.g. one is conipact, while the others are widely spread), the analysis should rely on D k 2 ( TT , ) instead of on D ( T ,T ) . The square dissimilarities emphasize the class differences even more. (4) If the dissimilarities take only a limited riuniber of different values (the measure is not continuous), the embedding approach is preferred. (5) If different measurement sources are available, e.g. by using different excitation wavelengths to measure various sets of respond spectra, they should be used. It is beneficial to study a combination of a number of dissiniilarity representations, if the corresponding measures differ in their character or properties. From the computational point of view, it is better to derived one combined representation. (6) The dimension of an embedded space should be determined based on the number of dominant eigenvalues in the pseudo-Euclidean embedding of D ( T . 7'). The number of dominant eigenvalues is the point when the eigenvalue-curve (a curve interpolating the eigenvalues) changes its steepness (or convexity). The first rapid change suggests the major breaking point, after which the subsequent eigenvalues describe the dimensions of a lesser importance. (7) The dimension of a PCA-dissimilarity space should bc determined either based on the retained variance, such as 90% or on a fixed riuniber of dimensions, such as z O.llTI.
Representation review and recommendations
11.2.1
495
Clustering
Remember that clustering is a subjective task, since the data can be partitioned differently depending on what is taken into account. Moreover, one cannot discover structures, which are not encoded in the dissimilarities. For instance, if the dissimilarity measure is poor, compact clusters will iiot be found. Therefore, one may need to transform the given dissimilarity representation to better emphasize a particular structure. If connectedness is an important issue, one needs to re-compute the dissiniilarities by deriving them in a path-connected way. This cari be practically realized by finding a neighborhood graph, connecting those objects which belong to their knearest neighbor neighborhoods. New distances are derived as the shortest path distances in such a weighted graph and they are used for the search of clusters. Also sigmoidal transformations can be applied to emphasize the local information. Let us hypothesize K clusters in the dissimilarity data D ( T . T ) . To label the objects, the following approaches can be applied:
(1) A specific hierarchical clustering method. For instance, single linkage algorithm looks for elongated, chain-type clusters, while complete linkage looks for compact clusters. A complete dendrogram can be analyzed arid cut at a specified level to determine K clusters. (2) The K-centers algorithm. ( 3 ) Algorithms in embedded arid PCA-dissimilarity spaces. If T is large, D ( T , R ) cari be used instead, where R is a representation set chosen by the k-centers method, e.g. k M 0.21T/. Recall that the enibeddirig is defined on D ( R , R ) and the remaining examples D(T\R,R) are then projected to E . NLC-clustering or NQC-clustering (or other classifier-clustering procedure) call be used in embedded and PCA-dissimilarity spaces. Alternatively, a mixture of probabilistic PCA based on the EM algorithm, can be used. as described in Appendix D.4.3. These algorithms should be repeated A l times with different initializations. The best psrtitioning should be chosen e.g. by the goodness-of-clustering measure JGOC, Eq. (7.2). Alternatively, rnaximuni log-likelihood can be used for probabilistic models. It is important to inspect various clustering results by visual judgment:
(1) Low-dimensional spatial maps obtained by classical scaling or the Samnion map.
The dissimilarity representation for pattern recognition
496
(2) Intensity images of D ( T , T ) as suggested in See. 7.1.2.
A criterion regarding cluster separability and cluster compactness should be specified, e.g. by the use of Eq. (7.2) to judge the clustering results and select. the best one. If the number of clusters is unknown, one needs to evaluate the results for a various number of clusters. In probabilistic approaches to clustering, likelihood-ratio or penalized likelihood measures, such as Bayesian information criterion, Appendix D.3, can be used. For hierarchical approaches, a criterion suggested in [Fred and Leitao, 20031 can be exploited. If vectorial representations are considered, the gap statistics can be used; see Sec. 7.1.1. 11.2.2
One-class classification
Suggestions for approaching one-class classification problems:
(1) If the original measurements, such as images or spectra, are noisy. an adcquate non-metric dissimilarity measure can be considered, as it tends to suppress large differences. Such an example is [,-distance. with p < 1. If a metric distance is required, use the p-th power of the !,-distance; it is metric by Corollary 3.2. (2) It is useful to analyze the distribution of dissimilarities in the target class. If the tail of large dissimilarities is long (thc skewness is highly positive), an element-wise nondecreasing concave transformation should be applied, such as f,(d) = d P , p < 1, or fq,gm(d)= 2/(l exp{+}) - 1. The transformation parameters p or s should be chosen by using a validation set or by some stability criterion for one-class classifiers [Tax and Muller, 20041. Such transformations do not affect neighborhood-based OCCs, so there is no need of applying them there. (3) If the target set is srnall or the computational aspect is not important, a neighborhood-based descriptor such as the k-nearest neighbor data description is recommended. If outlier examples are available, a linear programming data description (LPDD or LPDD-11) in a dissimilarity space will likely be useful. Alternatively, if the distance representation is (nearly) Euclidean (all eigenvalues in the linear embedding are nonnegative), a support vector data description on the positive definite Gaussian kernel K = exp{ $} can be considered. (4) For poor dissimilarity representations, a weak classifier, such as a generalized nearest mean data description will tend to work well.
+
Representation review and recommendations
497
(5) To emphasize different properties of the problem, it is useful to study a few dzfferent dissimilarity measures and derive the corresponding representations. If the computation cost is important, such representations should be combined into one by a weight,ed average and a single oneclass classifier, such as the LPDD should be trained. or, alternatively. the majority voting rule can be used to combine a number of LPDD outputs trained on single representations. Otherwise, one may study variants of the nearest-neighbor one-class classifiers on the combined representations. (6) If the computational cost is less important, traditional density-based or reconstruction-based one-class classifiers are suggested in a PCAdissimilarity space (with 90% of retained variance, for instance). Alternatively, k-centers method can be used to select a representation set to build a dissimilarity space. If the target class has some evident clusters (which may be visually judged in the PCA-dissimilarity space), a Gaussian mixture model or a mixture of probabilistic PCA models is recommended.
11.2.3
Classification
Before any classification experiment, it is important to learn about possible outliers, modality of the classes and their spread (e.g. their average within-class and between-class dissimilarities). Outliers should be determined either by using OCC methods or by detecting objects with very large dissimilarities to other objects. Removing outliers is more important for the embedding approach than for the dissimilarity space approach. General suggestions are: 0
0
0
If t,he number of dominant eigenvalues in the linear embedding is large and the eigenvalue curve does not flatten reasonably fast (which means that the possible intrinsic dimension is high), the k-NN methods and the dissimilarity space approach should be preferred. If the classes have different spreads (so the average and maximum withinclass dissimilarities are very different), D*2(T.T ) is recommended instead. If the chosen dissimilarity measure is based on sums of differences, built from a number of components of similar variance, normal density-based linear or quadratic classifiers are expected to perform well in dissimilarity spaces.
498
0
0
0
0
0
The dissimilarity representation f o r pattern recognition
If the number of classes is not too large, a sparse LP machine, Eq. (4.241, slioiild be always investigated. If the coniputational cost is less important;, a support vector machine can be constructed, either on the complete or reduce dissimilarity representation D ( T ,T ) or D ( T ,R ) , respectively. R should be selected in a suitable way. e.g. by the k-centers algorithm or sparse linear programming. If clusters can be distinguished within the classes, as e.g. visually inspected from spatial representations of the dissimilarity data, a probabilistic model is worth investigating. For instance, each class can be represented by a Gaussian mixture model or by a mixture of probabilistic PCA models in a dissimilarity space or in an embedded space. An object is assigned to the class according to its maximum a posterior probability, as judged by the models. If the classes have complex shapes and possibly complex decision boundaries, radial basis neural network in a dissimilarity space can bc studied. If a number of different dissimilarity measures, emphasizing different characteristics in the initial data (images, spectra, graphs, etc) can be designed for the problem, their (appropriatcly scalcd) wcighted average corribination will likely be useful for representation.
The choice of suitable parameters of the classifiers has to be done carefiilly. In our experiments, we often applied some reasonable heuristics, we used an additional validation set or performed a crossvalidation to select the paramors. Note, however, that for dissimilarity representations, an m-fold cross-validation should be employed on a squarc representation D ( T , T ) . T is randomly split into m folds, T ( l )T('); , . . . . T ( T n such ) that in the i-th fold, D ( T ( - " )T , ( - z ) ) ,wherc T ( - i ) = {T(1),, , , , T('L-1) T(i+l) T ( m ) } is , used for the designed of a classifier (including additional scaling and prototype selection), which is t,est,c:rI on D ( T ( " ) 7>7 - 2 ) ) . Usually, one considers a list of potential values for the parameter of interest. To select a suitable value, the classification error or other criterion (e.g. based on log-likelihood for probabilistic niodels j is determined in a rn-fold cross-validation', and the one is chosen which yields thc smallest error or the optimum criterion value. Note, however, that if the same data is used for the parameter estimation and training the classifier. the results will be biased. Ideally, ari additional validation set should be i m d for the parameter (model) selection. 'As the result of cross-validation depend on the random split, we think that repeating this procedure a number of times and deriving the average cross-validation error is more reliable.
Representation review and recommendations
499
As an example, we will sketch a possible classification experiment. Assume a single (given, optimized or combined) dissimilarity represcntatiori D ( L , L ) ,where L is a learning set. To find the best pattern recognition approach, perform 11.1 times, for instance M = 50, a 90%-10% hold-out experiment2, i.e. split randomly all objects from L into the training set T and the test set T,, such that T consist of 90% of t,he examples and the remaining 10% are assigned to Tt,. Consider the dissimilarity space and neighborhood-based approaches. hi each step: (1) Det,ermine the following representation sets as subsets of T:
(a) RLP by applying sparse linear programming (LP) to D ( T ,T). Use formulation Eq. (4.23) with y = 1 or, alternatively, forniulation Eq. (4.24) with p being a (rough) estimate of the generalization error (e.g. estimated as the leave-one-out 1-NN error). The latter forrriislation should bc more useful, when the classification problem is difficult. (b) REC by using a 1-NN editiri~-condcIisiiigalgorithm [Dcvijver and Kittler, 19821 on D ( T ,T ) . (c) R k - L p by using the k-centers procedure to preselect a set R, of K objects, c.g. consisting of 20% of the training cxamples, and then applying the sparse L P formulation (4.23) to D ( T ,R*). If you suspect that the dissimilarity data are very noisy, use editing to remove such objects. Let Tedit is the edited set. Select the representation objects out of Tedit by applying suitable methods on D(T:Tcciit). (2) Train NLC, NQC, PCA-subspace classifier, a Gaussian mixture model classifier (use regularized versions if needed) and the standard nonsparse L P machine (4.22) in the dissimilarity spaces D ( T . R L P ) ~ D ( T . REG)arid D ( T ,R k - ~ p ) . If the computational cost is not an issue, train a polynomial or SVM. Use an additional cross-validatioii loop or some heuristics (especially if you have little data) to determine tlie necessary parameters. Find the test classification errors. (3) Make use of the same representations sets as above to compute the 1NN error on the test set, as well as the 1-NN and the k-NN error using all training objects (optimize k on the training set by tlie leave-one-out procedure). 2Alternatively, consider a 5- or 10-fold cross-validation experiment repeat,c:d e.g. 20 times.
500
The dassamalarity representation for pattern recognation
(4) If T is small or has a moderate size, or the computational aspect is not important, train the NLC, the NQC and the standard non-sparse LP machine (4.22) in the PCA-dissimilarity space (perform the PCA on D ( T ,T) and select the dimension corresponding e.g. to 95% of the preserved variance). Compute the test errors. (5) Additionally, if D ( T , T ) is (nearly) Euclidean, build an SVM on the **2 positive definite Gaussian kernel K = exp{ +}. Select CT in a crossvalidation scheme. In the embedding space approach, in each step: (1) Determine the dimension m of the approximate linear embedding of D ( T ,T ) as the nuniber of significant eigenvalues. (2) Find the following representation sets of m + l objects:
(a) R, by applying the k-centers algorithm to D ( T ,7'). (b) RAPEby selecting prototypes which yield the smallest average approxiniation error. (c) Rp by selecting pivot object as in the FastMap technique. Use the same procedures as above to select 2m+1 objects. This leads to six representation sets in total, three sets for m+l objects and three sets for 2 m f l objects. ( 3 ) For each selected representation set R above, use D ( R ,R ) to find the m-dimensional embedded space 1.Project the remaining objects T\R to this space and train a linear or quadratic classifier there. Project Tt, to 1 and test the classifiers. (4) Determine also the embedding in m-dimensional space based on the complete data D ( T ,T ) . Train the NLC, NQC, PCA-subspace classifier arid a Gaussian mixture model classifier (use regularized versions if needed) there. (5) Train an indefinite SVM. Sec. 4.5, on the (non-degenerate) indefinite kernel K derived from the dissimilarity representation D ( T . T) either -D*2 as K = -L(l2 lnl l T ) D * 2 ( 1 or as K = exp{7}. In the latter case, you have to take care for a selection of a suitable CT.
illT)
Perform the same experiment as above on a sigmoidal transformation of the dissimilarities fsigm(D*2(T, T ) ) . Apply some heuristics to find the parameter s of fsigm, e.g. as 0.5d,,, d, or 2d,,, where d,, is the average dissimilarity on the complete training set. Find the average classification error (or weighted classification error if costs are included) of all
Representation review and recommendations
501
approaches. Choose a decision rule and a representation set a,s a trade-off between performance and computational effort for the evaluation of new objects. Additionally, if very small representation sets (of few objects) need t o be selected, then make use of the forward feature selection method with the criterion based on the classification error.
This page intentionally left blank
Chapter 12
Conclusions and open problems
W e shall not cease from exploration And the end of all our’ exploring Will be t o arrive whesre we sta.sted And knww the place for the first time. .‘FOUR QUARTETS: LITTLE GIDDING” , T.S.
The notion of proximity is furidarrierital in learning from
ELIOT
set of exaniples. Depending on the function it serves, a relative proximity or a co~iceptual proximity can be distingiiished. The former describes a relation between pairs of objects, while the latter relates objects (or concepts) to a concept, such as a Gaussian model of a class. Ohjects are often bound together by relative proximity (quantifying their degree of comnionality) to form a class. This is the necessary condition on which the cornpactiiess hypothesis relies. justifying the use of a learning algorithm. In a learning phase, a concept of a class is modeled. Any decision concernirig tlie assign nient of an object to a class is grounded in tlie conceptid proximity. This is the basic principle in pattern recognition. Pattern analysis usually starts from ineasurernents describing a set of objects. Such rneasurcrnents are further preprocessed to derive a suit>ahle description. This is a representation that can be built based on two distinctive principles: statistical or structural. Both make use of some kind of basic characteristics. In the statistical framework, these are thc features. i.e. object attributes encoded as numerical variables. They are assiinied t,o be discriminative for the object classes. A set, of features constitutes a feature vector space; where each object is represented as a point. Additional structures such as inner product norm and Euclidean distance are usually imposed to enrich this vector space. Learning is then inherently connectcd to the mathematical methods that can be used in this space. Although any flexible discrimination function can be designed, it will at most discover what can be inferred from the statistics of a set of features. The striictural organization that an object possesses, such as connectivity of shape elements, is not incorporated in the representation. ~
503
A
504
The dissimilarity representation for pattern. recognition
In the structural approach, the basic descriptors are primitives, i.e. structural elements, such as strokes, corners or stems of words, encoded as syntactic units for the construction of objects. This approach is advisable for problems with objects which contain an inherent, identifiable, structure or organization e.g. shapes, spectra, images or texts'. There is some imderlying factor in the objects, such as order, time, hierarchy or functional relationships (as between the words in sentences) that describes the inter-relationships between the morphological primitives. In the structural approach, it is assumed that there exists sufficient and suitably formulated problem knowledge, often developed and encoded with the assistance of an expert, such that a structural description of objects and classes can be constructed. Learning then relies on defining syntactic grammars or a way of comparing objects, usually in a matching process. In principle, specific criteria are used for that purpose, so the whole process is domain-specific. In summary, the strength of the structural approach lies in encoding domain knowledge and relationships within an object, capturing its internal structural organization. The strength of the statistical approach lies in a well-developed mathematical theory of vector spaces. Since these approaches are complementary, their combination should compensate for their drawbacks, while conserving their advantages. A number of attempts in this direction have been made. For example, one can associate the statistical information with structural elements to resolve some ambiguities [Fu, 19821. Other possibilities include the construction of classifiers in both frameworks arid combining their decisions. Such strategies are, however, hybrid. Looking at the properties of both frameworks, the unification should be reached at the representation level. In a chain of events, first a description based on structural information is derived, which is then encoded to obtain a numerical representation, which can be used in statistical learning. A natural candidate is a proximity representation, developed by us. This is a relative representation, in which each object is described by a set of proximities to so-called representation objects. A conceptual proximity representation can also be constructed which measures proximity of objects to classes or the decision boundaries induced by classifiers. Our basic observation is that proximity representations bridge the gap not only between the statistical and structural approaches to learning, but also information-theoretic approaches. The latter is true thanks to the 'Currently, the majority of learning tasks is concerned with this type of data. So, there is a need for designing good learning strategies, possibly incorporating both statistical and structural approaches, as they are complementary.
Conclusions and open problems
505
recent development of a universal distance measure based on algorithmic complexity [Li et al., 2003; V i t h y i , 20051. This is a measure constructed within the minimum description length principle, in which learning is related t o data compression. Consequently, proximity representa.t,ionbecome very general. Objects can be compared in a variety of domains by using various approaches or their combinations: statistical, probabilistic, structural and information-theoretic. To limit the scope of the study, proximity is modeled as a dissimilarity, to focus on the class and object differences. This is not an essential rcstriction. Since siniilarity and dissimilarity are intimately connected, many issues discussed here can be applied to similarities after suitable adaptations. The main goal was to provide a foundation and to develop (statistical) learning methodologies for dissimilarity representations. The statistical learning framework is naturally chosen as the one which offers good generalization capabilities for a further development of structure-aware dissimilarity measures. The proposed dissimilarity representation is a dissimilarity matrix D ( T :R ) , where R is a set of representation objects, also called prototypes. and T is a set of training objects. The dissimilarity measure does not need to be a metric, but not any measure is acceptable. It should be meaningful to the problem and fulfill at least the compactness hypothesis, stating that similar objects are close in their representations.
12.1
Summary and contributions
To develop learning methodologies, dissimilarity representations have to be interpreted in appropriate frameworks. Since dissimilarities express the relative differences between pairs of objects, while learning algorithnis optimize a kind of an error for the chosen numerical model, one will deal with numerical representations of the problems. The numbers have, therefore, a particular meaning within the frame of specified assumptions and models. Spaces with different characteristics lead to different interpretations of the dissimilarity data, and, as a result, to different learning algorithrns. Chapter 2 briefly introduces topological, (indefinite) inner product, norm and metric spaces. Although most of the material presented there is not new, Krein spaces are not usually treated in the standard works. Our major contribution is to present the relations between the spaces arid the development of the Krein space, later discussed in the form of a pseudo-
T h e dissimilarity representation f o r p a t t e r n recognition
506
Euclidean space of a finite ernbedding. The introduction of these spaces prepares the way for a mathematical framework for handling arbitrary dissimilarity data. Metric dissimilarities have advantageous properties, since many nunierical methods operate in metric spaces, or more specifically in Euclidean spaces. In Chapter 3 , dissimilarities are further characterized with respect to Euclidean arid metric propcrtics. Further on: a linear pseudo-Euclidean ernbcddiiig is studied, as well as nonlinear multidimensional scaling. This prepares the ground for one of the learning approaches defined in Chapter 4. Tlircc? rriairi frameworks have been introduced for learning on dissimilarity representations. They rely on the following interpretation of dissimilarities: (1) as rrlwtions between the olr?jects based on dissimilarity-ball neighborhoods , (2) iii an enibeddcd space, where the original dissimilarities are preserved, found by a linear pseudo-Euclidean embedding, (3) in a dissimilarity space, where each diniension is a dissimilarity to a particiilar object. These three approaches are discussed in Chapter 4, wherc the learning stratcgics are introduced. A natural question that arises now is how these st,rategies differ from the standard learning techniques in feature spaces. If one relies on (Euclidean) dist,arices in a fcat,ure-based represeiitation, the methods applied on such distances refer to a topological space. The difference lies in thc accompanying feature space arid the metric distances. The iise of embedded and dissimilarity spaccs is novel. However, it rnight be seen as a generalization frairiework of the support vector inachines (SVM). An SVnl can be seen as a linear classifier in a high-dimensional space defined hy the (conditionally) positive definite kernel. In our approach, a linear classifier in the dissirriilarity space can be interpreted as a quadratic (or linear) classificr iri a high-dimensional Kreiii space. Since one deals with finite sainples, such a KreYn space siniplifies to a finitc-dirriensiorial pseudoEuclickan embedded space. Basically, the SVM is a mathematically elega,nt, but specific procedure in our framework. Basic dissimilarity measures and a brief overview of measures used in practical applications have been discussed in Chapter 5. Clinptcrs 6 10 constitute the experimental part of this thesis, in which dissiiriilarity representations are practically analyzed. A systematic approach is presented to such an analysis, hence the niost basic questions concerning t,hc data understanding are handled first. ~
Conclusions and open problems
507
Chapter 6 investigates a number of well-known visualization techniques and their usefulness for dissimilarity data. The conclusion is that mult,idimensional scaling techniques and Isomap provide useful insights into the relations in the data. Chapter 7 focuses further on methods that help in data exploration. Three main issues are investigated concerning both structure and c o m plexity in the dissimilarity representation: clustering techniques, intrinsic dimension and sampling. A number of clustering methods in thc three interpretation frameworks is presented. Preliminary results of the clustering in dissimilarity spaces are promising. Additionally, a statistical estimate of the intrinsic dimension from a Euclidean distancc representation of a, hypcr-spherical Gaussian sample is derived. Finally, a number of criteria are proposed and examined that can be used in quantifying whether a representation set contains a sufficient number of objects to describe a. class. The most useful criteria are the ones based on the number of dominant cigenvalues either in PCA-dissimilarity spaces or in pseudo-Euclidean embedding, skewness and mean relative rank. Chapter 8 moves on t o the construction of one-class classifiers (OCCs) on dissimilarity representations. Currently existing OCCs are built eithcr on features in traditional feature spaces or or1 Euclidean distances derived there. Two new OCCs, one in embedded space and one in dissimilarity space: arc proposed and successfully applied to a few practical problems. Non-metric dissimilarity measures seem to work well for noisy data in such domain description problems. Usually, only metric measures are used. Chapter 9 is concerned with classification issues. Dissimilarity measures with different properties (Euclidean, non-Euclidean metric and non-metric) are analyzed for this purpose. Experiments demonstrate that simple linear or quadratic classifiers constructed in dissimilarity or embedded spaces may significantly outperform the k-NN rule for smaller representation sets, irrespective of whether the dissimilarity is metric or not. We also investigated ways of transforming the dissimilarity measure to make it (more) Euclidean (hence more met,ric) for the purpose of discrimination. We have found that the imposed Euclidean behavior does not guarantee a better performance. It is more important t,hat the measure describes cornpact classes than its strict Euclidean or metric properties. Various prototype selection criteria are proposed and studied for. botjh ernbedded and dissimilarity spaces, indicating that systematic procedures (niaking use of the label information) are beneficial, cspecially for a sniall number of prototypes. For very small representation sets a supervised selection based on the cross-validation error of a classifier or a forward featiire
508
T h e dissimilarity representation for pattern recognrtron
selection method also based on the classification error are the best. In general, for all representation set sizes, the k-centers clustering finds good prototypes, especially for multi-modal data. In dissimilarity spaces, the rcpresentatiori set selected by a sparse linear prograniniing gives a good discrimination. The drawback is the lack of control over the number of selected prototypes. That is why the k-centers selection, followed by the sparse LP may offer a bctter result. In embedded spaces, except for the kcenters procedure, alternatively, the prototypes selected as the ones which yield the average approximation error can be chosen. Additionally, we have observed that for representation sets consisting of more than 20% of the training objects, a raridoni selection is useful. Combining information originating from different sources or combining individual learning strategies can be effective for designing a good pattern recognition system. Such issues are discussed in Chapter 10. Combining is a natural way of integrating the statistical and structural representations into one framework. Some ways are proposed of combining dissiniilarity represeritatioris into a new one on which a single final classifier can be trained. In our experiments on two-cla,ss and one-class classification problems, we found t,ha.t dissimilarit,y representations conibincd by either a (weighted) average or product have a larger discriminative power than any single one. Classifiers built on such combined representations outperformed the best classifier (of the same type) constructed on single representations. This is especially useful if the final classifier works in a reduced dissimilarity space, as offered by the linear programming data descripliori (LPDD) for one-class classification tasks. Additionally, we have observed that classifiers, first trained on single representations and then combined, work well. Especially, the product rule combiner seems to be good for small representation sets in two-class classification problems, while majority voting may be advantageous for oneclass classifiers.
12.2
Extensions of dissimilarity representations
Dissimilarity representations are a finite nunierical representations. If they are used as originally given without any larger context, a limited reasoning can only be applied. The tools of statistical learning are available in vector spaces, therefore, dissimilarity representations are interpreted in this way. All the approaches choose a space with favorable properties, in which dissiiiiilarity relations are either interpreted or expressed. A dissimilarity
Conclusions and open problems
509
representation is then 'embedded' in a (pre)topological space, in a dissimilarity space and in a pseudo-Euclidean (or Krein ) space. These are particular choices and many more can be studied. If the measure is metric, then a dissimilarity space results from an isometric embedding of D ( R ;R ) into a max-norm space; see Lemma 3.1. Our approaches do not explicitly make use of this fact. They are based on the intuition that two objects 2 1 and x2 are similar if their vectors of dissiniilarities, d ( z 1 , R ) and cl(z2, R ) are similar, or in other words, positively correlated. This leads to a vector space D(., R ) equipped with the traditional inner product, where a large value of the inncr product between d(z1, R ) and d ( x 2 ,R ) describes a high similarity between x1 arid 5 2 . Dissimilarity space is only an example of a possible class of vector spaces that can be considered for learning. These are vector spaces, which result from a data-dependent mapping & : X R k . To introduce them, assiinie that R = { p l , p ~ ,... , p n > is a representation set. Let P ( T .R),S(T, R) and D ( T ,R ) denote a proximity, similarity and dissimilarity representation, respectively. Examples of proximity-related spaces are given below: --f
Proximity space. Define 4(x) = [p(z,p~),p(x,p~), . . . i p ( ~ , p , L )wherc ], p ( z , p i ) is a proximity between x and the representation object pi. The dimension of this space is n. Bounded proximity space. Choose a prototype p k . Define tlie mapping d .> = [ P ( Z , P l ) - P ( Z , P k ) , ' . . , P ( Z , P k - I ) - P ' ( Z ? P k ) . P ( Z , P k + l )p ( x , p k ) , . . . , p ( z , p l L )-p(x,pk)]. The dimension of this spacc is ri-1. Proximity-difference space. Choose k distinct pairs from tlie objects of R , { p i , p f > , 1 = 1 , 2 , . . . , k . If one deals with a classification problem, choose the pair objects such that they come from different classes. Then
0
4 ( ~=)[ p ( z , p : ) - p ( z , p ? ) , P ( ~ , P ~ ) - P ( ~ , P. ;. .) , p ( x , p ~ ) - p ( x , p ~ The )]. features in the resulting vector space are defined as differences hetwecn proximities to particular objects. The dimension of this space is k . SimDissim space. Select two subsets of R, R1 = { p : , p i , . . . . p k , , } and RZ = { p : , p i , . . . , p i 2 >, either the same or different. Define ~!J(z)= [ s ( z , p i ) ., . . , s ( z , p k , ) ,. . . , d ( z , p : ) , . . . , d ( ~ , p $ ~ The ) ] . dimension of this space is kl k2. SimDissim-difference space. Define the data-depending mapping as d x ) = [ d ( z , p ~ ) - s ( z , p ~d )( x, , p z ) - d ( x , ~ z ) ., . . ,d(x,p,)-.~(z,p,,)]. The features in the resulting vector space are differences between dissimilarities and similarities to particular objects. The dimension of this space is n.
+
0
510
0
The dissimzlarity representation for p a t t e r n recognition
Extended proximity space. Assume m proximity representations, P ( l ) ( TB),P(')((T, , R ) ,. . . , P ( m ) ( TR). , Define the mapping q!(x) = [ P ( l ) ( w l ) ., ' . ,P(1)(x:P,),P(2)(2,P1), The dimension of this space is mn. Note that all p (i) may be constructed on different representation sets. Combined proximity space. Assume m proximity representations, P ( I ) ( T . R ) ,P ( 2 ) ( T , R ) , , P(")(T,R). Let f be a function of m variables, such as (weighted) average, product, maximum or some other function. f is meant to act as a combiner of the proximity values p(')(~,p,),p(~)(2,pi), . . . ,p(m)(z,p,)into one, more powerful proximity. For simplicity, we will denote it as flPm(z,pi). Such a function may use label information, if available. Then 4(x) = [ f l - m ( x ; p l ) , f l - , , L ( R : , 1 ) 2 ) , , flP7,(x,pn)lT. The dimension of this space is TI,. Sorric of such spaces have been explored in Chapter 10, where cornb i n d representations are discussed. Other inspiration with respect to kernels can be found in the work of [de Diego et al., 2004; Muiioz et al., 20031.
All the spaces mentioned above are examples of particular direct (Cartesian) product spaccs. They are assumed to be equipped with the traditional inner product structure. Given a nuniber of proximity representations, one may also consider a direct product of pseudo-Euclidean spaces; each resulting from an (approximate) embedding of particular proxiniity representations. Also a combined space consisting of embedded and dissimilarity spaces can be studied. In fact, the nuniber of possibilities is enormous, including all sort of nonlinear transformations. Another interesting possibility is to study approximate embeddings into norrned and Banach spaces, also via Lipschitz mappings; see Def. 3.3. Theoretical foundations in this direction for defining large-margin classifiers in metric spaces are laid down by von Luxburg et al. in [von Luxburg and Bonsquet, 2003, 20041. Other ideas can be found in [Bourgain, 1985; Johnson et al., 1987; MatouSek, 19901. However, many questions of practical importance are still to be answered.
12.3
Open questions
This book can serve as a foundation for continuing research into learning from dissimilarity representations. The aim is to renew the pattern recognition area by the integration of structural and various statistical approaches.
Conclusions and open problems
511
At the fundamental level, a few topics of interest are mentioned below.
(I) We think that neighborhood-based pretopological and topological spaces are important for a further development of pattern recognition. They allow one to use weaker type of relations between objects (without additional structures of an inner product or a norm), hence novel types of relational classifiers could be potentially constructed. These should be domain-based decision functions, in contrast to probability-based decision functions. Although they rnight not be able (at this time) to compete with the advanced techniques of inner product spaces, they might stimulate new ways of thinking. Moreover, as geometry and topology are closely related, and geometry can be discussed by the use of distances, a broader framework, which integrates all these concepts, should be searched for. ( 2 ) The possibility of zero-error dissimilarity-based classifiers has been introduced. Ultimately, it is related t o the compactness hypothesis and a true representation. They both put constraints on a dissimilarity measure which should be such that riot only similar objects similar are close in their representations, but also the other way around. This issue has to be studied more theoretically. For instance, for shapes in images, this would include a study on robustness of a measure against object position arid orientation, small perturbations and occlusions. (3) It appeared in our study that metric or Euclidean properties of a dissimilarity measure are less important than their discriminative properties. Although we developed some scientific intuition about non-metric and non-Euclidean behavior, it needs to be better characterized. Properties of the measures should be studied fundamentally in relation to topological, embedded and dissimilarity spaces, as well as in relation to the domain where measurements are collected. New types of measures could be developed, especially in the structural approach, and applied to a dissimilarity-based framework, without imposing metric constraints. Consider, for instance, a donlain of binary shapes in images. A good dissimilarity measure should be small for similar shapes and large for different shapes. Ideally, the measure should be developed such that it is invariant to rotation, shift and scaling and also to small abberations and changes in the images. One may therefore study topological properties of transforma.tions which have small effect on the resulting measure. The derived dissimilarity will, therefore, have particular
512
T h e dissimilarity representation, f o r pattern recognition
properties with respect to the given domain. If tlie shapes are then reduced to a (dissimilarity-related) vectorial description, an interesting question is how to relate the original characteristics of the measure t o tlie properties of the chosen vector space. (4) An understanding is needed of the topological relations between the three spaces: pretopological, embedded and dissirrlilarity spaces. Non-decreasing nonlinear transformations of the dissimilarity measure change the topological properties of embedded and dissimilarity spaces, while t,hey do not affect the dissimilasity-ball neighborhoods. Our results suggest that concave transformations, like sigmoidal ones, can be beneficial for discrimination, since they diminish the effect of possible outliers. ( 5 ) The design of morphological (structure-aware) dissimilarity measures, both general and specific for the problem at hand, is an open issue. This would require the definition of a suite of structure detectors, general enough for the data types such as images, time-signals, spectra etc. Thc intriguing question is not only how data type specific detectors should be found, but more importantly, how a measure can be learned from a given set of examples. Inspiration can be found in [Goldfarb, 1990; Goldfarb et al., 2000a,b; Goldfarb and Golubitsky, 20011. (6) In general, some foundation for learning from dissimilarities has been laid down, but much more should be done. Research effort should be devoted to the further development of the proposed framework, aiming at integration of both statistical and structural approaches.
On the niethodological level, topics of investigations include the following issues. (1) Thc use of dissimilarity neighborhoods is very popular in clustering, so
many algorithms have been developed so far. Preliminary results on the use of embedded and dissimilarity spaces give promising results. Theoretically well-founded methods can be developed. (2) A number of techniques have been studied for the selection of a representation set appropriate for learning in dissimilarity and embedded spaces. The methods should be investigated further in a number of' applications. The next step relies on designing new prototypes at the level of rneasiireinents. This means that new prototypes encompassing the information on a number of original objects are created and used for learning. This would mean that e.g. the information on a set of spectra, where each spectrum describes a particular case, could be cap-
Conclusions and open problems
513
tured by their most representative spectrum, which beconies a mcrnbcr of the representation set. One could expect that if domain-based ways are used to derive new prototypes, the resulting representatiorl set can be powerful. (3) As we assumed that the coniputation of dissimilarities is very costly. we mostly focused on linear and quadratic classifiers in ernbedded aiid dissimilarity spaces. They may suffer from the curse of dirnensionality [Jain and Chandrasekaran, 19871 if large representation sets create spaces of a high dimension. The use of decision trees, appropriately reformulated for dissimilarities, niight be an alternative in such cases. This is open for investigation. (4) In the area of combining, a priori knowledge, e.g. label information. could be incorporated in the combined dissimilarity representation and in the final classifier. A study in this direction can be found in [de Dicgo et al., 2004; Muiioz et al., 20031. It can also be advantageous to cornbine representations which are derived by eniployirig both statistical arid structural approaches. Another intriguing point of interest is to combine the three learning frameworks: to benefit from the strength of each of thc interpretation spaces. The k-NN rule is locally sensitive, while a linear (or nonlinear) classifier in the embedded or dissimilarity spaces is globally sensitivc, as it relies on all representation objects. How to combine such information is a point for research. (5) This book is mostly concerned with inductive learning principles. The next step is transductive learning [Vapnik, 19981, which may be considered in the context of combining local and global learning approaches to dissimilarities. Additionally, new research areas are open for study: learning from unlabeled data (partly related to clustering) arid act'ive learning (6) New applications, especially from structural pattern recognit,inn,should be considered. In conclusion, the use of proximity representations opens a new possibility for integrating statistical, structural and information-theoretic approaches to learning from a set of examples.
This page intentionally left blank
Appendix A
On convex and concave functions
The presentation of basic facts on convex and concave furictioris relies on Chapter 3 in [Boyd and Vandenberghe, 20031. Recall that a set X is convcx in a real vector space if NZ (1 - a ) y E X for all 2 , y E X and all <.IE [O, 11.
+
Definition A . l (Concave and convex functions) Let f : X + R be a fuiiction on X , where X is a convex subset of a vector space. Then .f is
0
+
+
+
+
convex if f ( a z (I - a ) y ) 5 a f ( z ) (1 - a ) f ( y ) aiid a € [0, 11. If the strong inequality holds then f concave if ~ ( Q Z (1 - n ) y ) 2 n f ( r ) (1 - cv)f(y) and N E [O. 11. If the strong inequality holds then ,f
holds for all II', 71 E X is strictly cor/,71cz. holds for. all :x7 EX is strictly conca11e:.
Let of(.) denotes a gradient vector, a vector of partial derivatives of a function f . Convex and concave function have sonie important, propcrtics. Positive and negative semidefiniteness is explained in Def. B .4. Theorem A . l A diflerentiable fLnction .f : X subset X of R",is
+
R defined on, a convex
convex ,iff (a) f ( y ) 2 f ( x ) V ~ ( Z ) ~-( x) W holds for all 2 , EX (b) the Hessian, o f f is a positive semide,finite ma,triz.
+
concave iff
+
f ( y ) 5 f ( z ) V f ( ~ ) ~- (3:)y h,olds for all z , ~ E X (b) the Hessian of f i s a negative semidefinite matrix.
(0,)
This means that a d{fferen.tiablef m c t i o n of one variable i s conwe:c (co'ncaue) on. X iff if i t s derivative i s monotonically non,-decreasing (nm-increasing) on. X . A twice differentiable function, of one variable is convex (comarue) on X iff i t s second derivative i.s non-negatiue (non-positaue) there. Theorem A.2 (Properties of convex and concave functions) [Boyd and Vandenberghe, 20031
A function f i s concaiie iff (-f ) i s consuex. 515
516
The dissimilarity representation for pattern recognition
A nonnegative weighted s u m f = CyZlaifi of convex {concave) functions { f i } is convex (concave). A nonnegative, non-zero weighted s u m f = Cy=laifi of strictly convex (concave) fumctions { f i } is Strictly convex (conmwe). Let f X X Y + R. I f f ( 2 ,y) is convex (concave) in x for each y E Y and w(y) 2 0 for each y E Y , th,en the function g ( x ) = w ( y ) f ( z ,y)dy is con.vex (concave) in x , provided that the integral exists. A n y locd minimum of a convex function is also a global minimum. Any local maximsum of a concave function is also a global m a x i m u m . .lensen's inequality'. Let f X + R be a convex function. T h e n
sy
\i=1
\
n
1
i=l
2 2 , . . . , x, E X and nonnegative a1, a2, . . . , an such that f is concave, then the above inequality is reversed. Fu,n,ct%ort.composition. Let h : Rk + EX and yi : R", Rk,i = 1:.. . , k f = h o g R + R such that f(x) = Define be twice differentiable. h(gl(Z),g2(2)-. . . > g k ( x ) ) .T h e n (a) f is convex if h is convex and non-decreasing in each argument and all gi are convex. (b) f is convex i'f h is convex anld non-increasing in each, argument and a11 gi are concave. (c) f is concave if h is concave and non-decreasing in each argument and all gi are concave. (d) f i s concave i f h, is concave and non-increasing in each argurnen,t and all gi are convex.
holds for any xl,
C:=,01%= 1. If
---f
Example A . l Examples of convex functions on their domains are:
'In general, the Jensen's inequality holds for probability spaces. Let p ( z ) 2 0 be a probability density function on R C X , i.e. Jc2p ( z ) d z = 1. Let f : X 4 R be a convex function. Then f ( J a p ( z ) z d z ) 5 J a p ( z ) f ( z ) d z .If f is convex, then the inequality is reversed.
Conclusions a n d o p e n problems
0 0
f ( x ) = log(C, ex$ on R". f(x) = log(C, e g ~ ( on ~ ) IW" if gz are convex.
Examples of concave functions on their domains are: 0
f ( z ) = J.P for p E ( O , 1 ] .
. f ( z )= l o d z ) . 0
0 0
0
f ( x ) = min{zl,za,. . . , z n } on RTL. f(x) = ( ~ , z f ) on h I W for ~ p E (0,1). Geometric mean: f ( x ) = .,)A on IWT. f(x) = gz(x)) on IPif g L are concave and positive. f(x) = C,log(lc,) on RI;. f ( x ) = C ,log(y,(x)) on R" if gz are concave and positive.
(n:=, +
(nI,"=,
517
This page intentionally left blank
Appendix B
Linear algebra in vector spaces
Basic facts of matrix algebra are recalled here to make it easier to introduce similar notions in pseudo-Euclidean spaces.
B.l
Some facts on matrices in a Euclidean space
Let r be a field, such as R or @. Recall that any nxm matrix A defines a linear transformation A : V + U , from a vector space V = to a vector space U = I?". If these vector spaces are equipped with the traditional inner product (which defines the norm), then the adjoint matrix A X describes a linear transformation between the algebraic dual' spaces A X : U * + V * ; see also Def. 2.59. Moreover, such finite-dimensional vectors spaces are self-dual, i.e. U* = U and V* = V . The adjoint of an operator is defined below.
rTrL
Definition B.2 (Adjoint) Suppose V = rm and U = r" are Euclidean spaces. Let A be a matrix of a linear transformation V + U . A X is the matrix of a linear transformation U* + V* defined as ( A z , y ) = (.,AX?/) for all II: E V and all y E U . The adjoint A X is equivalent to (1) A X = A T if r = R .
(2) AX = A t = A
T .
ifr=@.
Definition B.3 (Special matrices) Let A be a real or complex matrix of the size nxm,. 0 0
The range of A is the subspace of vectors y = A x for some x. The rank of A is the dimension of the range of A, corresponding to the number of linearly independent rows or columns of A. B is a left inverse of A if B A = I . B is a right inverse of A if A B = I . If B A = A B = I then B = A-' is the inverse of an n x n matrix A. 'In a normed space, the algebraic dual is equivalent t o continuous dual. 519
520
0
0
The dissimilarity representation for pattern recognition
The pseudo-inverse (or Moore-Penrose pseudo-inverse) of A is the unique matrix A- that satisfies: (a) A A - A = A (b) A-AA- = A (c) ( A A - ) X= AA(d) (A-A)' = A - A The Gram matrix2 of A is the matrix G = A XA. It is positive semidefinite and Hermitian (see below).
An nxn matrix A is 0 0 0
0 0 0
0 0 0 0
an identity matrix I if' A is a diagonal matrix with aii = 1 for all i . a permutation matrix if it results from a permutation of the columns of I . singular if it has no inverse, and non-singular, otherwise. A is singular iff det(A) = 0. symm,etric if A = AT for a real matrix A. Hermitian if A = At for A E P x n . ,nornzal if A X A = A A X . orthogo,nal if ATA = 1 for a real matrix A. unitary if AtA = I for a complex matrix A. an idempotent matrix if A X = A' = A. a projection matrix if it is symmetric (Hermitian) and idempotent.
The trace of an n x n real or complex matrix A is the sum of its diagonal elements, t r (A) = C&,aii. Definition B.4 (Definiteness) Let A E Rnx" be a symmetric matrix, A = AT. A is 0
0
0 0 0
positive definite (pd) if x T A x > 0 holds for any nonzero XEIW~~. positive semidefinite (psd) if x T A x 2 0 holds for any nonzero x. conditionally positive (semi)definite (cpd) if x T A x 2 0 and xT1 = 0 holds for any nonzero x . negative definite (nd) if x T A x < 0 holds for any nonzero x. negative semidefinite (nsd) if x T A x 5 0 holds for any nonzero x. conditionally negative (semi)definite (cnd) if x T A x 5 0 and xT1 = 0 holds for any nonzero x.
The same definitions hold for a Hermitian matrix A = A t , provided that the transpose operation is replaced by the conjugate transpose. 'Remember that in this book, objects are represented as row vectors in the matrix
X. As a result, the Gram matrix is expressed as G = X X x , when learning aspects are discussed.
Conclusions and open problems
521
In general, positive (semi)definiteness and negative (se1ni)definiteness can be applied to any matrix [Mathworld]. A necessary and sufficient condition for a real or complex matrix to be pd is that the symmetric or Hermitian part, ;(A A X ) is pd. In this chapter we will always refer to symmetric (Hermitian) matrices.
+
Definition B.5 (Eigenvalue, eigenvector) Let A be a real or complex matrix, A E P x n .An eigenvalue is a value X E r' such that there exists a non-zero vector X E P called , eigenvector, for which A x = Ax holds. Theorem B.3 (Eigendecomposition) Let Q be a matrix of eigenvectors of a given square matrix A and let A be a diagonal matrix w corresponding eigenvalues on the diagonal. If Q is a square matrix, then A yields an eigendecomposition as A = Q A Q - l . Furthermore, if A is symmetric (Hermitian), then the columns of Q are orthogonal vectors and A=QAQX.
Let A be a n nxn, Theorem B.4 (On positive-definite matrices) symm,etric or Hermitian matrix. The following assertions are equivalent: (1) A is pd (psd). (2) All eigenvalues o,f A are positive (nwnnegatiue). (3) All its principal minors, i.e. the determinants of all upper-le,ft square submatrices, are positive (nonnegative).
Moreover, A E rnxnis pd iff the bilinear form, (x,y ) =~xxAy defines an inner product f o r all x,y E I?*. Theorem B.5 (Properties of eigenvalues and eigenvectors)
If x is an eigenvector, then also ax is for a non-zero scalar 0. There,fore, eigenaectors are usually normalized to a unit length, x x x = 1. 0 The eigenvalues of a diagonal matrix are the diagonal elements. s Eigenvalues of a symmetric (Hermitian) matrix are all real. 0 A symmetric psd matrix of a rank r has r positive eigenvalues an,d ( n r ) zero eigenvalues. 0 Let A be an n x n real or complex matrix and let be the eigen,values. Then, t r (A) = Cy=lX i . Also, det(A) = Xi. 0 The non-zero eigenualues of A B are equal to the non-zero eigenvalues of BA. Hence, t r ( A B )= t r ( B A ) . 0 If A is symmetric, then the pairs of eigenvectors xi and x,i that correspond t o the eigenvulues A, and X j , i # j are orthogonal. I f X, = Xi, 0
~
n7=l
522
The disszmilarity representation for pattern recognition
then th,e corresponding eigenvectors need not be orthogonal, but they can always be chosen t o be orthogonal. Let A be a'ri Theorem B.6 (Properties of Hermitian matrices) n x n symmetric or Hermitian matrix. T h e properties below apply t o both Hermitian and symmetric matrices:
0 0 0
Hermitian, matrices are closed under a,ddition, multiplication by a scalar, raising t o an, integer porwer, and, if non,-singular, the insverse operation. Hermitian matrices are normal, i.e. A XA = A A X. A i s Hermitlan and psd i f f there exists a matrix B such, that A = B XB . Th,e eigenvalues a Hermitian matrix are real. A n y real matrix C has a unique decomposition C = A + B w h w e A-C+CT and C-CT are symmetric. Any complex matrix C has a unique decomposit%onC = A + i B where A = and B q are Hermitian and i is irnaginarg, i 2 = 1.
Theorem B.7 (Properties of unitary matrices) Geometrically, unitary (orthogonal) matrices are rotations and reflections. T h e properties given. below apply t o both unitary and orthogonal matrices. 0
0
0
0
Unitu.ry matrices are closed under multiplication, raising t o a n integer p0uie.r and the inverse operation. Unitary matrices are normal, i.e. A XA = AAX A i s unitary iff 1 lAxl1 = 11x1 1 f o r all x. Th,e eigenvalues Xi of a unitary matrix all fu@ll J X i ] = 1. I det(A)l = 1 for a unitary matrix A. A m,atrix i s unitary i f f its columns form an orthonormal basis.
Definition B.6 (Hadamard operations on matrices) Consider nxm matrices in W n X m . A Hadamard operation, denoted by *, is an element-wise operation on matrices. The Hadamard product (also called Schur product) is a matrix A * B = ( a i j b r 3 ) . The Hadamard unit matrix is a matrix E whose all entries are 1. A matrix is Hadaniard invertible if all entries are riori-zero, and A*(-1) = (a;') is the Hadaniard inverse of A. The k-fold Hadamard product of A with itself, A*k = ( a t j ) , k 2 0 is the Hadarnard power. So, A*' = I (the convention 0" = 1 is used). If A is Hadamard invertible, then the Hadamard power i s defined as above for negative k . Theorem B.8 (Schur) Let A, B be nxn real p d (psd) mutrices. T h e n A * B is also pd (psd). In particular, A*k i s pd (psd) for all non,-negative imk9er.s k:.
Conclusions a n d open problems
B.2
523
Some facts on matrices in a pseudo-Euclidean space
Suppose I = ( J R ( P > q ) ,(., . ) E ) is a pseudo-Euclidean space and the associated Euclidean space is 1 11 = (RP+q,(.. .)). The key point to understand matrix operations in a pseudo-Euclidean space is to remember that onc deals with a usual vector space, endowed with an inner product, though specific. As there is a linear relation between the inner product operations in E and IEl, given by ( 2 ,y ) ~ = (IC, Jpqg), where Jpq is the fundamental syrnmetry, one may use the associated Euclidean space Il to understand the operations in E . Operations on matrices are well known if both R” and Iw” are equipped with the traditional inner product. If A : E 3,then the adjoint operation is defined as A* :F+ €*. We well first recall the definition of’ an adjoint operator arid repeat some facts from Sec. 2.7. Other details on pseudoEuclidean and K r e h spaces can be found there. ---f
Definition B.7 (Adjoint) Assume E = PQ(P>q)and 3 = JR(”’>q’) are pseudo-Euclidean spaces. Let A be a matrix of a linear transformation E + F. A* is the matrix of a linear transformation .F 4 E* defined as (A z, y ) ~ = (x.1 4 * y ) ~ for all G E € and all y E F. The adjoint A* is equivalent to (1) A* = JPfq/ATJpq if 7’# E . (2) A* = JPqATJpqif F = €. ( 3 ) A* = ATJpq if 3 = IEl. (4) A* = AT if E and 3 are both Euclidean. Some definitions from the previous section do not change as they refer to general properties of matrices in a vector space.
Definition B.8 (Special matrices) Let A be an n x m real matrix 0
0
The rank of A is the dimension of the range of A , corresponding to the number of linearly independent rows or columns of A. B is a left inverse of A if B A = I . B is a right inverse of A if A B = I . If B A = A B = I then B = Apl is the inverse of an n x n matrix A. The J-pseudo-inverse of A is the unique matrix A;’ that satisfies: (a) A A i l A = A (b) A g l A A i l = A c l (c) ( A A i l ) * = A A i l (d) ( A i l A ) * = A i l A
524
0
The dissimilarity representation f o r p a t t e r n recognition
The Gram matrix3 of A is the matrix G = A*A.
An r i x n matrix A is 0 0 0
0
0
0
0 0
an identity matrix I if A is a diagonal matrix with aii = 1 for all i . a permutation matrix if it results from a permutation of the columns of I . singular if it has no inverse, and non-singular, otherwise. A is singular iff det(A) = 0. & s y m m e t r i c (J-self-adjoint) if A = A* for a real matrix A. If A : R(”>q)+ R(Piq), then A = Jp,ATJp,. 3 - n o r m a l if A*A = AA*. 3-orthogonal if A*A = I for a real matrix A. If A :R(p.q)4 R(p,q),then 3.,AT.&,A = I. an J-idempotent matrix if A* = A2 = A. a projection matrix if it is 3-symmetric and J-idempotent.
The truce of an nxn real matrix .4 is t r (A)= C,”=, aii.
Definition B.9 (Definiteness) Suppose E an nxn J-symmetric matrix, A* = A. A is 0
0
= R(pi4),p + q = rb. Let
A be
3-positive definite (3-pd) if x*A x > 0 holds for any nonzero x E E . This is equivalent to stating that ( J p q A )is positive definite in lEl. J-negative definite (J-nd) if x * A x < 0 holds for any nonzero x E E . This is equivalent to stating that (&,A) is negative definite in I€].
To understand the issues below, the reader is referred to the indefinite least square problem discussed in Sec. 2.84.
+
Proposition B.l Let E = R(”>q),p q = n be a pseudo-Euclidean space and let X be a n mxn, real matrix, representing rn ro’w vectors in n-dimensional space E . 0 0
T h e Gram matrix is deJned as G = X * X = XT&,X; see also footnote 3. Th,r pseudo-inverse o f X is X;’ = G-’XTJp, = (XT3.qX)~1XT&q. Let the vectors x1,x2,. . . , x, span a subspace of € such that X = [XIx2 . . . x,]. T h e n the projection matrix onto this subspace is defined as Px = XX;’ = x ( X T J p , , X ) - 1 X T 3 p q .
One can easily check that X;’ and Px fulfill the conditions of Def. B.8. Note that we represent objects as row vectors in this book. As a result, the YRemembcr that in this book, objects are represented as row vectors in the matrix X . As a result, the Gram matrix in a pseudo-Euclidean space is expressed as G = X X * , when learning aspects arc discussed.
C o n c l u s i o n s and o p e n problems
525
formulas above should be read by setting X = XT. For instance, the Gram matrix is XJpqXT. Eigenvectors and eigenvalues are defined identically as in Def. B.3. Proposition B.2 (Eigendecomposition in a pseudo-Euclidean space) Let AE be a J-self adjoint matrix in R(P>q). The eigen,decornposition of AE i s AE = QEAEQ;, where Q& is a 3-orthogonal matrix of eigenvectors of A€ and RE is a diagonal matrix of the corresponding eigenvalues.
Proof. Let AE be an n x n 3-self adjoint matrix in 1. Then, AE = A; holds by definition. The eigen-pair (qi,X i ) in E is defined by A&qi= Xiqi, i = 1 , 2 , . . . , n. Consider a pair of vectors qi and qj for i # j . By making use of'the facts that A& = A; and (qi,qj)E= qlqj = q:Jpqqj, one has: Xiqj = AEqj Xiqj = Azqj
X i qt qj Xiq:&,qj
= q,* A; qj T = qi Jpq
(.4&Sj1
X i q l J p q q j = XjqTJpqqj
Xi(q,,qj)€= xj(q,,qj)E
Since X i and X j are different, in general, (qi,qj)E= 0. This proves that the eigenvectors are orthogonal in E , hence they are form a basis in €. Assume that the eigenvectors are ordered such that the first of them describe the positive definite subspace IWP and the remaining q eigenvectors describe the negative definite subspace Rq. Then, I /qi1 ; > 0 for qi E Rp and 1 lq,j1 ; < 0 for qj E R4.Denote by Q the resulting matrix of orthogonal eigenvectors. Then, QTJpqQ = JPqE,where E = Jpqdiag (llqill;). Scale all qi eigenvectors by 2 -to make them orthonormal in E arid store them K again in Q . This is possible since scaling by a constant does riot influerice the eigen-equation. Note that Q is now 3-orthogonal by construction, as QTJpqQ = Jpq, see Def. 2.93. Consequently, AEQ = &A, where A is a diagonal matrix of the corresponding eigenvalues. So, AE = QAQ-l. 0 Since Q-l = Q* by Def. B.8, then AE = QhQ*. Remark B.l (Determination of the eigen-pairs) Let AE be an 12x71. J - s e l f adjoint matrix in E . By Def. 2.93: A = AEJ~, is a symmetric matrix in l€l. Let Q be 3-orthogonal. Since Q* = JpqQTJpq, then the eigen,decomposition AE = QhQ* can be written as = QA(JPqQTJpq)Jpq. This
526
T h e dissamilaraty representation f o r p a t t e r n recognation
is equivalent to A = QAJpqQT,which leads to A = QAIEiQT,as J p q J p q = I . Th,is is a traditional eigendecomposition of a symmetric matrix A in a Euclidean space. So, standard methods can be now applied to determine the matrix Q of eigenvectors of A and a diagonal matrix AIEl of eigenvalires of A . Based on the original formulation, A& = QA3pqQT3p,, it follows that such determined Q is a matrix of eigenvectors in E and A = AIEl&, is a diagonal matrix of eigenvalues in, E . Although Q has the same f o r m in E and in / E l , they are different operators, as the adjoint in E is Q* = JpqQTJpq, while in, IEl is QT. Proposition B.3 (Covariance matrix in a pseudo-Euclidean space) Assume a set of vectors { X I , X Z , . . . , x,} in a pseudo-Euclidean space E = R(P,q), stored in a matrix X . The estimated covariance matriz of X is defined as [Goldfarb, 19851:
where X = $ C:=l xi and C ( X ) is the estimated covariance matrix in the associated Euclidean space IEl. Remark B.2 Such a covariance matrix is not positive definite, contrary to th,e intuition one develops in Euclidean vector spaces. It is, however, J positive definite and defined an agreement with the definition of a covariance operator in vector spaces C ( X ) = c:="=,xi- X) (xi - X)", where " denotes the adjoint. By Def. B.7, the adjoint of x in E is x* = xTJpq,which leads to a pseudo-Euclidean covariance matrix as expressed b y Eq. (B.1). Many properties mentioned in the previous section can be redefined for eigenvalues, eigenvectors and 3-symmetric, and 3-orthogonal matrices by replacing the traditional adjoint " by the pseudo-Euclidean adjoint * and by using (., .)E instead of (.: .), They will riot be described here.
Appendix C
Measure and probability
We will present here basic concepts from probability t,heory. To learn more about measures and probability theory, one is referred to [Halmos. 1974; Billingsley, 1995; Chung, 2001; Ghahramani, 2000; Feller, 1968, 19711. Also, the Website h t t p : //www/probability .net can be consulted. Definition C.10 (a-algebra) Given a set R,a cr-algebra is a collection A of subsets of R satisfying the following conditions: 1. O E A . 2. A E A + A"EA, where A" = R \ A is the coniplemerit of A. 3. Any countable iinion of elements Ak € A is in A, i.e. A k €A.
u,"=,
Given any collection B of subsets of A, the a-algebra genera,t,ed by B is defined to be the smallest a-algebra in R containing B as an element.
Definition C . l l (Borel algebra) Borel algebra is a pair ( X ,A): where A is the minimal a-algebra on the topological space X containing open sets of X . An element of Borel algebra is a Borel set. A Borel algebra on X is denoted as B ( X ) . Definition C.12 (Measure, measure space) A mensurable space is an ordered pair ( n , A ) ,where f2 is a set and A is a collection of subsets of R,which is a o-algebra. A (countably additive) measure on ( 0 2 ? A is ) a fiinct,ion p : A + 'w$ such that p(0) = 0 and p is countably additive. Ak) = C k p ( A k ) for any sequence of disjoint sets A k E A. Tlie i.e. p, A, p ) is called a measure members of A are meas~erablesets. Tlie triple space.
(uk
(a,
Note that if A1 and A2 are measurable sets, then A1 C ~ ( A z hence ) , the measure ,u is a non-decreasing function.
A2
=+-p ( A 1 )
Example C.2 Example measures are: 0 The counting measure, which counts the number of elements in a set. 0 The Lebesgue measure; see Def. (2.14. 0 Probability measure; see Def. C.16. 527
5
T h e dissimilarity representation f o r p a t t e r n recognition
528
A measure is defined above to take nonnegat,ive values. It is, however, possible t o consider other measures, which are defined as countably additive set functions with values in the real or complex numbers [Halmos, 19741. One may also consider measures in Banach spaces or in indefinite inner product spaces. Definition C.13 (Measurable function) Let ( X ,B ( X ) )and (Y,B ( Y ) ) be two measurable spaces. Then f : X + Y is a measurable function if the preimage of every set of B ( Y ) is in B ( X ) ,i.e. f - l ( B ( Y ) ) iB ( X ) , where
f - l ( B ( Y ) ) = {f-'(E)I=B(Y)). Let A be a subset of R. let Definition C.14 (Lebesgue measure) L ( I ) be the length of an interval I C R, i.e. if I = ( a ,b ) , then L ( I ) = b - a. The outer measure of A is defined as p * ( A )=
inf U,A,>A
E.L(Al), 3
where the infinium is taken over all countable collections { A j } of sets from R that cover A. A is said to be Lebesgue measurable if, for any B C R, p*(B) = m * ( An B ) p*(A n BC), where BC = R \ B is the complement of B and p*(A) is the outer measure of A. If A is measurable, then the Lebesgue measure of A is p ( A ) = p * ( A ) . Lebesgue measure on R" is the n-fold product measure of Lebesgue measure on R.
+
Definition C.15 (Lebesgue integral) Let X be an interval in R". Let f : X R U { f c m } be a measurable function on a measure space ( X ,B ( X ) ,p ) . The integral of f , J ,; f dp, is defined such that: ---f
If f = Z(A) is the characteristic function of a set A E B ( X ) , then
,J
Z(A) dP = d A ) . If f is a simple function, i.e. a finite linear combination of characteristic functions, f = C kK= l ~ k Z ( A k ) where , A k E B ( X ) and a k E R,then .fxf dP = a k sx ZA! d p = Qkp(Ak). I f f is a nonnegative function (possibly reaching co at some points), then J, f dp = sup {Jx g d p : g is simple and g ( x ) 5 f ( x ) VZEX}. Let any measurable function f (possibly reaching the values 00 or -cm at some points), be decomposed as f = f + - f - where f + = max(f, 0) and f - = max(- f , 0). Then f dp = f + dp f - dp, provided that f t dp and f - dp are not both 00.
cf=i=,
sx
sx
cF==,
s'
sx
sx
If p is a Lebesgue measure, then the integral defined above is the Lebesgue integral.
Conclusions and open problems
529
Definition C.16 (Probability space, probability measure) A probability space is a measure space ( 0 ,d,p ) , where P : d + [0,1] is the probability measure defined on a a-algebra such that P ( 0 ) = 1. The set 0 is called the sample space, the elements of A are called the events and P ( A ) is the probability of the event A. A probability distribution is a probability measure. Definition C.17 (Probability axioms) Let (Q, A, P ) be a probability space. Probability measure is defined for sets A E B(R). It satisfies the following properties: 0
P(A)2 0. P ( 0 )= 1 (hence P(0) = 0).
0
For any countable sequence of pairwise disjoint events
0
P(uAk) = c
k
A1.Az.. . .
P(Ak).
Lemma C . l Let (0,A, P ) be a probability space. O n e has: 0
0
P ( A U B ) = P ( A )+ P(B)- P ( A n B ) . P(R\A) = 1- P ( A ) . P ( A n B ) = P ( A )P(B1A).
Definition C.18 (Conditional probability and independence) The conditional probability of an event A assuming that B has occurred is defined by ((2.2)
where P ( A f l B )is the jo,int probability. Two events A and B are independent if their probabilities satisfy
P ( An B ) = P ( A ) P ( B ) .
(C.3)
Theorem C.9 (Bayes) Let ( A k ) be a sequence of pairwise disjoint events which completely cover the sample space R. Let B be a n y event. For any Ak the Bayes’ rule states that
Remark C.3 In the m o s t simple case, the Bayes rule leads t o P(A, B) = ) ( A ) , where P ( A ,B ) denotes the j o i n t probP ( A ( BP ) (B) = P(B(AP ability P ( A n B ) . An extension t o three events gives P ( A ,B , C ) =
530
T h e dissimilarity representation f o r pattern recognition
P(AIL3,C )P ( B ,C ) = P(AIB,C )P(BIC)P ( C ) . Assuming n events which depend on somx eiient Y ; a chain rule is obtained: P ( X I > X 2 ,. .. , X T L=) P ( X , : X Z , .. . , X n / Y ) P ( Y ) = P ( X I I X 2 , . . . , X,!
Y ). . . P(X,IY)P(Y).
((3.5)
Definition C.19 (Random variable) A random variable is a measurable function from a probability space to some measurable space, usually to B(R). A random variable X is discrete if it attains only values from a finite or countable set U ; C U E u P ( X= u)= 1. A random variable X is co,ntinuous if its density function f x is absolutely continuous, which means that for any subset of real numbers A which is constructed from intervals , f x (z)dz. by a countable number of set operations, P ( X E A ) = J Definition C.20 Let X be a random variable. A cumulativc distribution function (cdf) FX : R 4 [O; l] is defined as F x ( z ) = P ( X 5 x). The probability that X lies in [a,b] is then P ( a 5 X 5 b ) = F ( b ) - F ( a ) . Probability density function (pdf) is a Lebesgue measurable function f x : fx(z)dp. such that P ( u 5 X 5 b ) =
s,”
R+
Definition C.21 Let X be a real-valued random variable and let f ~ ( x ) be a probability density function. Then 0
Expectation of X , usually denoted by p, is defined as
E [ X ]=
i,rT;
C , zfx(z),
if X is discrete,
z f x ( z ) d z , if X is continuous:
if the corresponding sum or integral exist. Recall that E[.]is a linear fimction, i.e. E [ a X b] = a E [ X ] b holds for any a , ~ E R .
+
+
Variance of X, usually denoted by a 2 , is defined as V [ X ]= E [ ( X E [ X ] ) 2 ]provided , that the expectations exist. Equivalently, it can be computed as V [ X ]= E [ X 2 ] E [ X I 2 . Recall that V [ a X b] = u 2 V [ X ] holds for any a: h E R .
+
~
0
0
k-th morrient of X is defined as E [ X k ]and k-th central moment of X is E [ ( X - E [ X ] ) ’ ] ,provided that the expectations exist. Skewness of X is defined as E [ ( X - E[Xll31
(V[X])+
Conclusions a n d o p e n problems
0
Kurtosis of X is defined as
531
~ [ ( ~x[ ~ 1 ) 3.~ 1 v[x1)2 -
-
Definition C.22 (Random vector) A (multivariate) rundom, vector X is a vector X = ( X I ,X z , . . . , X N ) such that X , are real-valued raiidom variables on the same probability space (n,A, P ) . Every random vector gives rise to a probability measure on R n , known as the j o i n t distribution, with c--algebra specified by the Bore1 algebra. The distributions of each of the component are marginal distributions. Example C . 3 A few important probability distributions: 0
Uniform distribution, X U ( a ,b). X is a uniform (continuous) random variable with parameters a and b if f x ( z )= for X E [ab ] , 0, otherwise. E[XI= and V [ X I = 12 N
&
9
0
0
0
0
’
X is a uriiforni (discrete) random variable with parameter N if fx(x)= (N2-1)’ arid V [ X ]= 7. N1 ,z = { 1 , 2 , . . . , N } . E [ X ]=
-
Normal distribution, X N ( p ,0’). X is a normal (Gaussian) random variable with the mean p and variance m2 # 0 if fx(n:)= 1 e x p {( z-- pFL ) 2 } . When p = 0 and 0’ = 1, X is a X U standard riorrnal random variable. E [ X ]= p and V [ X ]= 0 2 .
-
Binomial distribution, X B(n.p) X is a binomial randoni variable with parameter p if it is the iiurriber of successes in a Bernoulli trial. A Bernoulli trial is an experiment in which only two outcomes are possible: success, with probability p , and failure, with probability 1 - p . The probability of exactly k successes is given by P [ X = k] = (;)pyl - 4 7 1 4 . E [ X ]= n p and V [ X ]= npq.
X is a central chi-squared random variable with n degrees of freedom, (L)Z
x xi,if fx(z) = 6 z 5 - l exp{-+z},
n: > 0, wkwe gamma function r(t)= J, ~ ~ - l e - ~ dt > z, 0. E [ X ]= n, and V [ X ]= 2n. N
r represents the
cc
0
Multivariate normal distribution X arid a covariance matric c.
-
N ( p ,C) with the mean vector p
The dissimilarity representation for pattern recognition
532
Multivariate normal probability density function is defined as p ( x ) = 1 exp -i(X - p)TC-l(X - p ) (27r)s (det(C))i E [ X ]= p and V [ X ]= C.
{
Lemma C.2 Some properties of the normol distribution:
-
+
+
If X N ( p ,02)and a and b are reals, then a X b N ( a p b, ( a a ) 2 ) . If X N ( p x ,o-:) and Y N ( p y ,o$) are independent normal ran,dom variables, then their sum is normally distributed, ( X Y ) N ( p x py,& &) and their difference is normally distributed, ( X - Y ) 2 N ( P X- P Y , ox +~ $ 1 . If XI, Xz,. . . , X, are independent standard normal variables, then X; + X; . . . X,”has a xi squared distribution with n degrees of freedom. N
N
N
+
+
N
+ N
+ +
Let X 1 , X z , . . . be indeTheorem C.10 (Central limit theorem) penden,t random variables with probability distribution functions FI, Fz, . . ., such that E [ X k ]= p, < 00 and V [ X k ]= oz 0 . L e t S , = X l + X z + . . . +X , a n d s , = , / m = , / r ? + . . . + a ; . Then the normalized partial sums S,-E[S,,l converge in distribution to a random ,variable with normal distribution N(O,1)if the following Lindeberg condis7L
tion is satisfied:
Appendix D
Statistical sidelines
This appendix provides background on parameter estimation and some probabilistic models.
D.l
Likelihood and parameter estimation
This brief summary relies on [Ripley, 1996; Bishop, 1995; Bilmes, 19971. Suppose we are given a finite sample X = {XI, x2,.. . , XN} in a vector space Rn. We assume that these vectors are independent and identically distributed according to the distribution p . This density function p(xl8) is, however, governed by a set of parameters 8 E 0. For instance, p can be a Gaussian distribution and 8 denotes its mean and covariance matrix. Thanks to statistical independence, the joint probability distribution is given by N P ( X l , X 2 , .. . , X i v P ) =
nP(x,le) = L ( @ ( X ) .
P.1)
z= 1
C ( 8 l X ) is called the lzlcelrhood of 8, given the data. It i s a function of 8 for the observed (fixed) data X. The aim is to determine the value of 8. This is usually done by the maximum likelihood or maximum a posteriori estimator. Before introducing them, let us define the Fisher information. It nieasures how much information is available about the parameter 8. Definition D.23 (Fisher information) Given a statistical model (fx(xl8): 8 E O} with the log-likelihood function CL(8lx) = logfx(xl8), the score function is defined to be the gradient of C L , i.e. VCC = If 0 is a single parameter, then the Fisher information I ( 0 ) is the variancc of the score, i.e. :
%.
533
534
T h e dissimilarity representation f o r pattern recognition
as the expectation of the score is zero. If 8 is a vector, then Z(8) is the Fisher information matrix. Equivalently,
Theorem D . l l (Cramer-Rao inequality) 8 . T h e C r u m e r - R u o inequality states t h a t
Let
8
be arc estimator of
which for u single p a r a m e t e r 8 becomes
Given (p(xl8) : 8 E O}, the goal is to determine the value of 8. The muxrmum-lzke//hood(ML) estimator selects this 8 which rriaximizes the likelihood for the given data, or
8,,,
= argmaxL(81X).
8
0 6 )
Since a logarithniic fimctiori is monotonically increasing, one usually rnaximizes the log-likelihood instead, sirice it is analytically easier. This gives: N 8 ~ 1 ,= argmaxLL(8IX) =
8
argmax~log(p(x,/B)).
(D.7)
z=1
For sufficiently well-defined problems (where log-likelihood is differentiable and the maximum exists in the region of possible parameters 0 ) .the maximum likelihood will appear at the stationary point of loglikelihood. This way is commonly practiced to deterniinethe optimal 8. However, the optimal 8 , ~may lie on the boundary of 0 . In general, the maximum likelihood estimator may not be unique or may not even exist [Bishop, 19951. ML estimators are asymptotically unbiased and have a Gaussian distribution with covariance matrix equal to the Fisher information matrix. Another way to estimate the parameters 8 is the use of Bayesian paradigm, where 8 is assumed to be a random variable. Hence, a prior distribution p ( 8 ) can be considered. By making use of Bayes formula, the posterior distribution of p(8) is defined as follows
Conclusions and open problems
535
where p(x) = s b , p ( x ] ( 8 ) p ( 8do. ) A maximum, a posteriori estimator of 8 maximizes p ( 8 l X ) . Since p(x),the denominator of the posterior distribution does not depend on 8, it will play no role in the optimization. As a result, tlie MAP estimator of 8 is
sn,,w = argrnaxy(xI(8)p(o). 8
(D.9)
Using the logarithm instead will give: = argmaxLL(8IX)
8
+ log(p(8)).
(D.lO)
So, the ML estimator is a particular case of the MAP estimator, when the uniform prior distribution p(8) is used. The MAP estimator focuses on the modes of the density. According to Ripley, 'MAP estimators are most useful as a simple suiiiniary of a highly concentrated posterior distribution.' [Ripley, 19961. ML and MAP estimators could be computed analytically or by numerical optimization techniques such as the Newton-Raphson method [Prcss et al.: 19921 or (scaled) conjugate gradient method [Shewchuk, 1994; IbIder, 19931, when the first and second derivatives are evaluated analytically or numerically. This is often intractable or the equations become very hard to derive. An easier alternative is tlie expectation-maximization (EM) algorithrn. For the MAP estimators, an extended EM should be used. D.2
Expectation-maximization (EM) algorithm
EM algorithm is used for finding maximum likelihood estimates of the parameters c3 in probabilistic models, where the model depends on nnobserved latcnt (hidden) variables, denoted as y [Dempster et al., 19771. EM alternates between performing an expectation step, E-step, which computes the expected value of the latent variables, arid a maximization step, M-step. which computes the maximum likelihood estimates of the parameters given the data and setting the latent variables to their expectation. Suppose X is the observed data originating from some distribution. We assume that X is incomplete, as some unobserved hidden variables y exits. Z = ( X , J ' ) is the complete data. We also assume a joint relationship between the observed and missing values, given in the form of' their joint probability density:
536
T h e dissimalarity representation f o r pattern reco.qnition
The likelihood function, the complete likelihood, is defined as C ( 0 l Z ) = C(OlX, Y ) = p ( X , Y P ) ,
(D.12)
giving rise tot the following log-likelihood:
L C ( 0 l Z ) = log(p(X,YIO)).
(D.13)
Starting from some initial estimates Oo, the EM algorithm alternates between the E-step and M-step until convergence: 0
E-step: Find the expected value of the complete log-likelihood C C ( 0 l Z ) = l o g ( p ( X , Y / O ) )with respect to unknown data Y given the observed data X and the current estimates 0'. The new parameters 0 have to optimize:
Q(0, OZ)= E[log(p(X,Yl0)) I X , 0", where X and 0% are constant. The expectation above is
0
where p(ylX, 02)is the marginal distribution of the hidden variables, which is dependent on the observed data and the current parameters. M-step: Find the parameters which maximize the expectation computed above: @if1-
argmaxQ(0,O'). 0
This procedure is guaranteed to improve the log-likelihood at each iteration. Usually one starts from a random initialization. The whole process is repeated A4 times, say, M = 20, and the solution maximizing the likelihood among those M runs is chosen. Note that instead of maximizing p ( X , Y l O ) ,one may maximize the joint probability p ( X ,Y ,0 ) = p ( X , Y l O ) p ( O ) ,where a prior term is simply included. This formulation comes from MAP estimation, as described in the previous section. The procedure above stays the same, except for an added term of log(p(0)).
D.3
Model selection
In probabilistic setting, one considers models which can be thought of as a parameterized set of probability distributions of the form { P Q , ~E' O},
Conclusions and open problems
537
where increasing (decreasing) 0, increases the model complexity. Model selection relies on choosing a model {Po of a right complexity for the given task; see [Ripley, 19961 or [Raftery]. An example is the determination of the number of clusters in the k-means clustering, or a smoothing parameter in the radial basis function networks. Consider a set of possible models, M I , A d z , . . . , M K estimated for N data vectors X = { x I , x ~. .,. , X N ) . Let C k ( A I k ( X ) = p ( X ( M k ) denote the maximized likelihood under the k-th model. The likelihood ratio test is a statistical test comparing two models, a relatively more complex model to a simpler model. This can only be used to compare hierarchically nested models, which means that the more complex model differs from the simple model only by the addition of one or more parameters. Adding additional parameters will always result in a higher likelihood score, however, a t some point additional parameters will not yield significant improvement in fit of a model to the data. The test is the ratio of the likelihood scores of the two models:
-2A(X) = 2(log&
-
(D.14)
logLk+l),
which is asymptotically x ~ , , , ~distributed ~, with m = rnk+l-m.l; degrees of freedom. The value -2X is compared to the upper (1 - a ) percentile point of the x$ distribution with m degrees of freedom. The model Mk is rejected if -2X(X) > xf-,,,, where x:-,,, is the (1 - a ) quantile of this distribution. Other methods try to penalize the fits by the complexity of the model (which, in the statistical sense, is the number of free parameters; note that this is not true, in general, in the Vapnik’s sense; see footnote 4). The generalized information criterion (GIC) for the model h f k is given in the form of a penalized log-likelihood: GIC(Mk) = -2 l O g ( c k )
+ Q ( N ) +pk, m k
where m k is the number of independent parameters in b k depend on a particular criterion: 0
0
0
h f ~ and ,
(D.15)
a ( n ) , and
Akaike information criterion (AIC) [Akaike, 19731: a ( N ) = 2 , for all N , and ,& = 0. Bayesian information criterion (BIC) [Schwarz, 19781: a ( N ) = log(N), and p k = 0. Kashyap information criterion (KIC) [Kashyap, 19821: a ( N ) = log(N), and /3k = 0, where nBk is negative of matrix of second partial derivatives
538
The dissimilarity representation f o r pattern recognition
of the log likelihood with respect to the parameters, evaluated at their rnaxiniurn likelihood estimates. The expected value of B k is the Fisher information matrix, Def. D.23. When two models are compared. one chooses a model with the largest GIC value. AIC has a drawback that as the sample size N increases, the more corriplcx model starts to be preferred. AIC generally chooses a model with more parameters than the others. Since for n > 8, log n,> 2, SIC will choose a model no larger than that chosen by AIC for n > 8.
D.4
PCA and probabilistic models
This scctiori discusses some basic models that are usually applied to model thc data. These are Gaussian model, principal component analysis and its probabilistic version as well as their mixtures. Throughout this section a set of vectors X = { X I , x2, . . . , x ~ in } a vector space Rn is considered.
D.4.1
Gaussian model
Gaussian model assunies that data vectors x originate from a Gaussian distribution:
where p is the mean vector and C thc covariance matrix over all x E X . Different models can be constructed by constraining C: 0
0
0
0
C is a full covariance matrix, leading to an elliptic Gaussian model. C = diag(a,,), a diagonal matrix, leading to an elliptic Gaussian with major axes aligned with axes; only variance in each dimension is taken into account. C = a21,a diagonal matrix with equal values on the diagonal, leading to a spherical Gaussian. C = I , identity matrix; only the mean is uscd.
Often only models with the diagonal covariance matrix are used. The sample mean and covariance matrix are given be their maximum likelihood
Conclusions a n d o p e n problems
539
estimators:
(D.17)
The negative log-likelihood logp(xl{ C, p } ) , or normalized Mahalanobis distance, expresses the distance between a vector x and the estimated Gaussian model G, or how likely it is that x is generated by G: ~
n d ( x jG) = - log(27r) 2
+ 21 log(det(C))+ -21 (x -
-
X)TC-'(x
-
X).
(D.18)
This is the Mahalanobis distance normalized for the volume introduced by C . If C = I , then the squared Euclidean distance between x and p is obtained:
d(x,G) = jjx - X1I2.
(D.19)
If data cannot fill the space sufficiently well and a full covariance matrix is used, the estimated C will be poorly conditioned. That is, tlet(C) will be very small and log(det(C)) distance can still be used:
+ -m.
D ~ I ( xG) , = (X
However, the standard Mahalanobis
-
(D.20)
X ) T C p l ( ~- X),
If the data lie in a subspace, C may become singular. To circiiniverit this problem, C is usually regularized, e.g. such that Greg = (1 - A) C X I for a suitable X > 0. Another possibility is to use PCA to rnap the data to a
+
lower-dimensional space to retain a certain proportion of variance, say 90% and fit the model there. Assume a collection of vectors X = {XI,x2,.. . . XN} in a vector spacc R". We will describe the most popular linear technique of dimensionality reduction by feature extraction. This is principal component analysis [Hotelling, 19331. D.4.2
A Gaussian mixture model
Let X = {XI,x2, . . . , XN} be N independent, identically distributed vectors in R". In the Gaussian mixture model (MoG), x is assumed t,o arise
540
T h e dissimilarity representation for p a t t e r n recognztzon
independently from a mixture with density: K
f(xl@)= ~ ~ k
i u k ( x l r P k . w ) >
k=l
where O = {{nk,p k ,Ck}f=(=l}, 7r-k are the mixing coefficients, o < 7rk < 1, K k = 1,.. . , K such that Ck=l 7rj = 1. We assume that p k ( x l { p k , C k } ) = N(pk, C k ) . The incomplete log-likelihood becomes
ccAtoG(@~x) =Clog i=l
c,
)
.
C r I ; : , ~ k : ( ~ i ~ { p k , ~ k ) )
This is difficult to optimize, so the existence of hidden variables y = {yl, . . . , y ~ is} assumed to inform which component density generated each data vector. Hence yi E (1,. . . , K } , and yi = k if the i-th vector originatcs from the k-th mixture component, M k . The posterior density R k j = p(Mklxj), called also responsibility, is the probability that Mk generates xj. The function Q becomes then: N
K
3=1 k=l
as log(p(x,, ~ k ( @ ) =) log(p(x,(Aifk,O))+log(p(Mk(@))and 7r-k = p(M~,l@). Thanks t o the Bayes rule, in the E-step, the responsibility of the model h i f k for generating point xJ is found as
(D.22)
This gives the following updates in the M-step:
(D.23) N
C o n c l u s i o n s a n d open problems
541
D.4.3 P C A Principal component analysis (PCA) is one of the most popular linear techniques of dimensionality reduction [Hotelling, 1933: Manly, 1994; Duda . . ,XN} in RTL. PCA et al., 20011. Given a set of vectors X = {xl,x2.. finds a linear m-dimensional subspace, where the vectors are orthogonally projected to y = Q(x -
(D.24)
such that the retained variance is preserved as well as possible. The N x n matrix Q contains the PCA prqectron vect0r.s as its rows. The m projection vectors that maximize the variance of y, i.e. the prznczpl axes, are the eigenvectors q,, q,, . . . , q, of the sample covariance matrix N C= CL=l(x -Z)(x -%)T corresponding to the largest non-zero eigenvalues A,, X2, 5 . . . ,A These vectors are found by solving the set of equations
Cqi = X i q i ,
i = 1 , 2 , .. .,N .
(D.25)
and sorting the q, by the associated eigenvalues A,. Tlic vectors q, are known to be orthogonal. They are first made orthonormal, so that thc eigenvalues are proportional to the variance in the eigenvector directions. The proportion of variance retained by the PCA projection to m dimensions is described by the normalized sum of these m eigenvalues:
(D.26) This condition is also used to find the number of dimensions m required to retain at least a proportion r of the variance. Two other important properties of PCA are: 0
0
uncorrelated representation: the covariance of the projected data is diagonal, E[QQT]= A, where A is a diagonal matrix of eigenvalues. least squares reconstruction: PCA projection minimizes the squared reconstruction error. This means that if the projected vector q = Q(x- p ) is projected back into the original space as x = Q-Q(x - p ) , where Q- = QT(QQT)-' is a pseudo-inverse. Since Q is an orthogonal matrix, then Q- = QT,as QQT, then the squared reconstruction error ll(x - p ) - QTQ(x- p)1I2 is minimal. A PCA projection is the optimal projection in the least squares reconstruction sense.
T h e dissimilarity representation f o r p a t t e r n recognation
542
The distance of a vector x to a PCA subspace P specified by the paranieters { p .Q} is the reconstruction error: d 2 w ) D.4.4
II(X-
-
Q’Q(X
-
(D.27)
P)II’.
Probabilistic P C A
Probabilistic PCA (PPCA) is an extension of’ traditional PCA in the probabilistic setting. It was proposed by [Tipping and Bishop, 19991. In traditional PCA, the dimensions ‘outside’ the subspace are simply discarded. In PPCA however, these are assumed to contain independently and identically drawn Gaussian noise and are incorporated into the model. An n-dimensional observed variable x is believed to originate from an mdiineiisiorial latent variable q (rn 5 n ) as x = wq
+ p +E.
(D.28)
T/I/ is an iniknowri matrix nxm and E is a iid spherical Gaussian noise c N ( 0 ;0’1).The latent variables are assumed to have a standard normal
-
distribution (this is the Gaussian prior),
(D.29) and the conditional distribution of the obscrved variables is modeled by a Gaiissian:
Consequently, the distribution of x can be written as: P(X) = -
s
p(xlq)p(q)dx
(D.31)
1 n
(27r) 7
(det(Cw))-i exp
+
in which Cw = u21 WWT is the model covariance matrix. The loglikelihood of observing the entire data set X is N C P C A ( { C ;p } l X ) =
(D.32)
clogdxz) 2=1
=
N (nlog(27~) log(det(Cw)) t r (CG’C)) , 2
--
+
+
Conclusions and open problems
543
N
Cz=l(xi - p)(xi - p)T is the sample covariance matrix where C = of X. The log-likelihood is maximized when the columns of W span the principal axes of the data. Hence, W = QT(A - C J ' I )R~, where QT consist of column eigenvalues of C with the corresponding diagonal matrix A of eigenvaliies and R is any orthogonal matrix, which can be chosen to he the identity matrix. For this maxinium likelihood estimator of W , 0 ' is g,riven ' b n
(D.33) i=m,+I
which is the average variance in the discarded dimensions.
D.4.5
A mixture of probabilistic P C A
In a mixture model setting with K subspaces, the log-likelihood becomes N ~ n r o P c A ( { ~ , , ~ 3 l ~ K= l XC)l o g 2=
(D.34)
1
where 7rJ is the mixing weight, T~ 2 0. VJ and x7rJ= 1. Similarly as for a Gaussian mixture model. Appendix D.4.2, the responsibility of modcl PJ for generating point x, is found in the E-step as
(D.35) The maximum likelihood solution can be found by taking derivatives of C~~I~P with C Arespect t o p , W and o2 [Tipping and Bishop, 19991. This gives the update equations for the M-step, for i = 1,2 , . . . , k : %j
1 N
-
N
C
Rji
i=l
(D.36) N
after which Wj and crf can be found by applying standard PCA based on Cj = ?j, and Q,? = W j .
544
T h e dissamalaraty representation for p a t t e r n recognition
Tlic EM algorithm may become unstable when one of the models shrinks to only one point [Bishop, 19951, as (T -+0 and the log-likelihood goes to infinity. To circumvent this problem, one may regularize cr or re-initialize collapsed rnodels.
Appendix E
Data sets
All information is imperfect. We have to treat it with humility. JACOBBRONOWSKI
Data sets used in our study have different characteristics. They should be representative for a number of learning problems dealing with dissirnilarity representations. Some of the dissimilarity data matrices are visualized as intensity images, where each pixel intensity corresponds to a dissimilarity value between a pair of objects. The darker the pixel, the srrialler the dissimilarity. The black line on the diagonal corresponds to zeros. The usage of the data sets described here is summarized in Table E.l.
E.l
Artificial data sets
We will consider a number of artificial data sets describing two-class discrimination problems. Gaussian data refer to normally distributed classcs. Studying artificial data are useful, since we can control their parameters or properties, such as the initial dimension arid class overlap. Therefore, some insight can be gained while different dissimilarity measures are used for the represent ation.
Ringnorm. This is an implementation of Breiman's ringnorni example [Breiman, 1996b], taken from [DELVE]. The data consist of two classes in a 20-dimensional space. Each class is drawn from a multivariate normal distribution. The first class has a zero mean ml = 0 and the covariance matrix of C = 4 . I . The second class has the nican m2 = 2/sqrt(20) 1 and the identity covariance matrix. Breiman reports the theoretical expected misclassification rate of 1.3%. A Euclidean distance is used for the representation; see also Fig. E.l. This data set is used in See. 7.1.2 for the illustration of' clustering approaches. 545
546
The dzsszmzlaraty representatzon SOT pattern recognatzon Table E.l
Data Ringnorm Hypercube Banana Polygon Convex polygon Ionosphere Wine Ecoli MFEAT Pump vibration Cat-cortex Protein Ball-bearing Heart disease Diseased mucosa Geophysical spectra ProDom NIST digit NET-38 digit Zongker digit Pen-based digit, Newsgroups Texture
Data sets used in the book
Usage Clustering: Chapter 7 Visualization: Chapter 6 Illustration and visualization: Chapters 3 , 4 and 6 Classification: Chapter 9 Classification: Chapter 9 Combining: Chapter 10 Classification: Chapter 9 Classification: Chapter 9 Combining: Chapter 10 Visualization: Chapter 6 Clustering: Chapter 7 Clustering: Chapter 7 One-class classification: Chapter 8 One-class classification: Chapter 8 One-class classification and combining: Chapters 8 and 10 Classification: Chapter 9 Classification: Chapter 9 Exploration and classification: Chapters 7 and 9 Classification and combining: Chapters 9 and 10 Visualization and classification: Chapters 6 and 9 Classification: Chapter 9 Visualization and clustering: Chapters 6 and 7 Combining: Chapter 10
Hypercube data. This data set consists of 600 points generated according to a uniform distribution and equally confined in two hypercubes in a 100-dimensional space. The leftmost corner of both hypercubes is set to the origin. The edge lengths of the hypercubes are 0.5 and 1, correspondingly. This meaiis that the first hypercube contains the other one and. in fact, the sampling density in the small hypercube is larger than outside it. The Euclidean distance representation has been considered for these data, which will give an indication of a clear cluster corresponding t o the points of the small hypercube. Due to the coarse sampling of the points outside this hypercube, their distances become relatively larger. Moreover, in such a space, they tend to lie close to the boundary. Note that this is the wellknown effect of the curse-of-dimension. The volume of the small hypercube with respect to the large one is (O.5/1)lo0 zz 7.9. Not surprisingly, the points in the small hypercubc are close. while others are remote. In order to realize that the data points are uniform in both hypercubes. one
Conclusions and open problems
547
Figure E . l Euclidean dissimilarity representations for the ringnorm data (left,) and for the hypercube data (right).
would need, lo1'' sampled points, for instance. This is not feasible, so for any coarse sampling, we should perceive two clusters: one coriipact arid the other spread out. This fact can also be clearly observed while studying the corresponding dissimilarity matrix, see Fig. E.l. This data set is used in Chapter 6 for visualization. Banana data. This data set consists of two banana-shaped classes in a two-dimensional space. It, is mainly used for illustration purposes when a number of different dissimilarity measures is considered. See Fig. E.2 for an illustration.
Figure E.2
Banana data (left) and its Euclidean distance representation (right)
Polygon data. The data consist of two classes of polygons: convex quadrilaterals and irregular heptagons; randomly generated. See Fig. E.3 for some examples. The polygons are first scaled and then the inet>ric Hausdorff distances, Def. 5 . 2 , and non-metric modified Hausdorff distanccs, Dcf. 5 . 3 , are computed between their vertices. In total, 2000 objects per class are available. The intensity plots of the derived dissimilarity representations are presented in Fig. E.4. This data set is used in Chapter 9 for classification.
The dissimilarity representation for p a t t e r n recognition
548
Quadrilaterals
Figure E.3
V A V h D V P
Polygon data: examples of quadrilaterals convex and irregular heptagons.
Hausdorff
Modified-Hausdorff
Figure E.4 Dissimilarity representations for the polygon data.
Convex polygon data. The data consist of convex pentagons and heptagons. For the generation of a polygon, p vertices (5 for pentagons and 7 for heptagons) are first regularly positioned on the unit circle such that the Euclidean distances between two consecutive vertices are equal. Next, twodimensional noise is added to each vertex t o perturb the polygons. Similarly as for the polygon data above, the Hausdorff and modified-Hausdorff distance representations are considered. Some examples are shown in Fig. E.5. The data set is used in Chapter 9 for building zero-error classifiers. Pentagons
Heptagons
0 0 0 Q 0 0 0 0 Figure E.5
Convex polygon data: examples of pentagons and heptagons
Conclusions and open problems
E.2
549
Real-world data sets
Our goal is to show the usefulness of dissimilarity represerhtions for novelty detection and classification problems. To be representative, real data setzs will have various characteristics. There are examples, in which raw data are collected by a sensors and represented in a digitized form by spectra, shapes, or images. There are also cases, in which the original feature-based data are of mixed types or lie in a high-dimensional space.
Ionosphere data. This radar data, coming from UCI Repository [Hettich et al., 19981, was collected by a system of 16 high-frequency antennas with a total transmitted power of about 6.4 kW in Goose Bay in Labrador. The targets were free electrons in the ionosphere. Positive examples are those for which the evidence of the structure in the ionosphere was shown. Negative examples refer to the cases where nothing was returned, thus the signals went through the ionosphere. The received signals are preprocessed by using an autocorrelation function with the argumentasbeing the time of a pulse and the pulse number. For 17 pulse numbers present,, each instance in these data are described by two attributes per pulse number, corresponding to the complex values obtained from the complex electromagnetic signal. Hence, the data are described by 34 features. The positive class consists of 225 examples and the negative class posses 126 examples, yielding 351 examples, in total. This data set is used in Sec. 10.3 for the illustration of the classifier projection space being a spatial representation of classifier diversities in an ensemble of classifiers.
Wine data. The Wine data come from Machine Learning Repository [Hettich et al., 19981 and describe three types of wines described by 13 features. In each experiment, when the data are split into the training and test sets, the features are standardized as they have different ranges. A Euclidean distance is chosen for the representation. Ecoli data. The data come from Machine Learning Repository [Hettich et al., 19981 and describe eight protein localization sites. Since the iiuniber of examples in all these classes is not sufficient for a prototype selection study, three largest localization sites are selected as a sub-problern. These localization classes are: cytoplasm (143 examples), inner membrane without signal sequence (77 examples) and perisplasm (52 examples). Since the features are some type of scores between 0 and 1, they are not normalized. Five numerical attributes are taken into account to derive the and 10.8 distance representations, denoted as Ecoli-pl and Ecoli-pU8, respectively.
550
The dissimilarity representation f o r p a t t e r n recognition
Rcmenibcr that the 1, distance between two vectors xi and xj is computed d,(xi:x,f)= (CTxllziz - ~ j ~ l P ) ~and / p it is metric for p 2 1.
MFEAT data. This data set consists of sets of features derived for handwritten numerals ’0’-’9’ extracted from a collection of Dutch utility maps. 200 patterns per class have been digitized in binary images. The digits are represented by six feature sets, as used in [Jain et al.; 20001. Here two feature sets are used: Fourier describing 76 Fourier coefficients of the character shapes and m,orphological describing six morphological features; see [Hettich et al., 19981. This data set is used in Sec. 10.3for the illustration of the classifier projection space being a spatial representation of classifier diversities in an ensemble of classifiers. Pump vibration data. Pump vibration was measured with three acccleronieters mounted on a subniersible pump which operated in three states: normal, presence of imbalance and presence of bearing failure. Moreover, the bearing failure was measured at three different operating speeds. The data consist of 500 observations with 256 spectral features of the acceleration spectrum (see [Ligteringen et al.. 19971). It is known [Ypma et al.. 19971 that the data has a low intrinsic dimension arid that it probably lies in a nonlinear subspace of a 256-dimensional space. The city block distance representation has been considered for this set, as it can be observed in Fig. E.6. The data are used in Chapter 6 for visualization.
Figure E.6
City block dissimilarity representation for the pump data.
Cat-cortex data. The cat-cortex data set is provided as a 65 x 65 dissimilarity matrix describing the connection strengths between 65 cortical areas of a cat. It was collected by Scannell [Scannell et al., 19951 and used for classification in [Graepel et al., 1999b,a] and for clustering in [Denoeux and Masson. 20041. The data set is obtained from [Deiieux and et al., site]. The dissimilarity values are measured on the ordinal scale and take the following
Conclusions and open problems
551
values: 1 for a strong and dense connection. 2 for an intermediate connection, 3 for a weak connection and 4 for an absent or unreported connection [Graepel et d., 1999aI. Concerning the cortex functions, four regions can be distinguished: auditory (A), frontolimbic (F), somatosensory (S) and visual (V). The class cardinalities are 10, 19, 18 and 18, respectively. The above mentioned classes can be identified in Fig. E.7, left. One may also observe that the classes are not homogeneous arid that there is a confusion between the frontolimbic class and other classes. The dissimilarity data arc highly non-Euclidean. This set is used in Scc. 7.1.2 for the ilhistration of clustering approaches.
Figure E.7 Cat cortex dissimilarity data (left), Visible clusters of blackish rcctangles are presented in the following order: A , F, S, and V. Protein dissimilarity data (right). Visible globin clusters of blackish rectangles are presented in the following order: G, HA, HB, and M. See text for details.
Protein data. The protein data are provided as a 213 x 213 dissiniilarity matrix comparing the protein sequences based on the concept of an evolutionary distance. It was used for classification in [Graepel et ul., 1999a] and for clustering in [Deriocux arid Masson, 20041. The dat,a set is obt,ained from [Dencmx et al., site]. The proteins are originally assigned to four classes of globins: heterogeneous globin (G), hemoglobin-rv (HA), hemoglobin-0 (HB) and myoglobin (M). The class cardinalities are 30, 72, 72 arid 39, respectively. The above mentioned classes can be identified in Fig. E.7 (right), however the globin class is very weak. Not surprisingly, the hemoglobin classes are similar, while the myoglobin class is distinct. One may also observe that the classes are not homogerieous and that there is a confusion between the frontolimbic class and other classes. The dissimilarity data are nearly Euclidean. This set is used in Sec. 7.1.2 for the clustering approaches.
552
T h e dzssimilarity representation for p a t t e r n recognition
Ball-bearing data. Fault detection is an important problem in machine diagnostics. A detection of four types of fault in ball-bearing cages is considered, a data set [Fault data], as used in [Campbell and Bennett, 20001. Each data item consists of 2048 samples of acceleration taken with a Bruel and Kjaer vibration analyzer. After preprocessing with a discrete Fast Fourier Transform, each signal is characterized by 32 attributes. There are five categories: normal behavior, NB, corresponding to measurements made from new ball-bearings and four types of anomalies A1 -A4: the outer race completely broken ( A l ) ,broken cage with one loose element ( A z ) ,damaged cage with four loose elements (As) and a badly worn ball-bearing with no evident damage (A4); see Fig. E.8 for some examples. The data representation is based on Euclidean, city block and 10.8 distances together with their power and sigmoidal transformations. This data set is used in Chapter 8 for training one-class classifiers. Heart disease data. The data come from the UCI Machine Learning Repository [Hettich et al., 19981. The goal is to detect the presence of heart, disease in the patient. There are 303 cases, where 139 correspond to ill patients. This database contains 75 attributes, but all published experiments refer to using a subset of 13 of them, so we use them as well. The attributes are: age, sex ( l / O ) , chest pain type (1 - 4), resting blood pressure, serum cholesterol, fasting blood sugar > 120 mg/dl ( l / O ) , resting electrocardiograph results, maximum heart rate achieved, exercise induced angina (1/0), the slope of the peak exercise ST segment, ST depression induced by exercise relative to rest (1- 3), number of major vessels colored by fluoroscopy (0 - 3 ) and heart condition (3 - normal, 6 - fixed defect, 7 - reversible defect). Hence, the data consist of mixed types: continuous, dichotonious and categorical variables. There are also several missing values. Gower's dissimilarity, as defined in (5.5)) has been chosen for the representation. See also Fig. E.9. This data set is used in Sec. 8.3.3 in the one-class classification problem. Diseased mucosa in oral cavity. The data consist of the autofluorescence spectra acquired from healthy and diseased mucosa in the oral cavity; see [Skurichina and Duin, 2003; de Veld et al., 20031 for details. Autofluorescence spectra were collected from 97 volunteers with no clinically observable lesions of the oral mucosa and 137 patients having lesions in oral cavity. The measurements were taken at 11 different anatomical locations using seven different excitation wavelengths 350, 365, 385, 405, 420, 435 and 450 nni. We will, however, concentrate on the wavelength of
Conclusions and open problems
553
Training data
Testing data Test examples normal behavior
Test examples anomaly T I
Test examples anomaly T2
Figure E.8 Examples of the pre-processed acceleration samples (interpolated by lines) for the ball-bearing data.
Figure E.9
Gower's dissimilarity representation for the heart data.
365nm, since the corresponding spectra have the smallest riuniber of' out,liers. After preprocessing [de Veld et al., 2003]. each spectrum consists of
554
T h e dissimilarity representation for p a t t e r n recognition
Healthy patients
Emission wavelength (nm)
Diseased patients
-
Emission wavelength (nrn)
Figure E. 10 Examples of normalized autofluorescence spectra for healthy (left) and diseased (right) patients.
199 bins (pixels/wavelengths). In total, 857 spectra representing healthy tissue and 112 spectra representing diseased tissue were obtained. Two normalization techniques have been used here: identical area, i.e. the bins are scaled such that their sun1 is 100, or standard normal variate (SNV) transformation where each spectrum is standardized to have a zero mean arid a unit standard deviation; see Fig. E.10 for some examples. A nurriher of dissimilarity measures has been considered for normalized spectra. First, the city block distances between first order Gaussiansmoothed ( 0 = 3 samples) derivatives of the spectra are computed. The zero-crossings of the derivatives indicate the peaks and valleys of the spectra, so they are informative. Moreover, the distances between smoothed derivatives contain some information of the order of bins. In this way, the property of a continuity of a spectrum is somewhat taken into account. Next, a spherical geodesic distance, Def. 3.16, is also considered, also called a spectral angle mapper, since it is popular to measure the similarity between the spectra. The spectra (when properly scaled) can also be treated as histograms-like distributions, which allows us to compare them by divergelice measures; Sec. 5.2.2. This data set is used in Chapters 8 arid 10 for training and combining one-class classifiers. Geophysical spectra. The geophysical spectra data set describes two classes. Both classes are geologically heterogeneous, hence multi-modal. Each class is represented by 500 examples. The objects are described by large wavelength spectra, since (hyper-)spectra are popular in remote sensing [Landgrebe, 20031. Since the data are confidential we cannot provide more details. The spectra are first normalized to a unit area and then two dissimilarity representations are derived. The first one relies on the spectral
C o n c l u s i o n s and o p e n problems
555
angle mapper distance (SAM)[Landgrebe, 20031 defined for the spectra si ~ dSA&I(si, sj) = arccos IIs 6% IIs II (which is in fact a spherical disand s , as 3
,
2
3
2
tance; see Def. 3.16). The second dissimilarity is based on the lil distance between the Gaussian smoothed (with CJ = 2 bins) first order derivatives of the spectra [Paclik and Duin, 2003b,a]. Since by the use of the first derivative, the shape of the spectra is somewhat taken into account, we will refer to this measure as to the shape dissimilarity. Hence, the geophysical data are denoted as GeoSam and GeoShape, respectively. GeoSam
GeoShaae
Figure E. 11 Dissimilarity representations for the geophysical spectra
ProDom. ProDom is a comprehensive set of protein domain families [Corpet et al., 20001. A ProDom subset of 2604 protein domain sequences from the ProDorri set [Corpet et al., 20001 was selected by Roth [Roth et al., 20031. These are chosen based on a high similarity to at least one sequence contained in the first four folds of the SCOP database. The pairwise structural alignments are computed by Roth [Roth et al., 20031. Each SCOP sequence belongs to a group, as labeled by the experts [Murzin et ul., 19951. We use the same four-class problem in our investigations. Originally, a structural similarities sij are derived, from which the dissimilarities are derived as dij = ( s i i + s j j - 2 s i j ) i for i # j . D = ( d i j ) is slightly ~ I I Euclidean and slightly non-metric.
NIST digit data. This data set describes 2000 handwritten digits from the NIST database [Wilson and Garris, 19921, each represented by 128 x 128 binary images; see Fig. E.12 for some examples. Each digit class is represented by 200 examples. Two dissimilarity measures are considered here: Euclidean on the blurred images and modified-Hausdorff, Def. 5.3 on the digit contours. When needed, the images are blurred by the use of the Gaussian function with a standard deviation of 8 pixels. The motivation for such a preprocessing is to avoid sharp edges of the digits and, thereby,
-
556
The dissimilarity representation for pattern recognition
make the distances robust to small tilts or variable thickness. This set is used in Chapter 9 for the classification task.
Figure E.12
Figurc E.13
Examples of the NIST digits, resampled to 16 x 16 pixels
Euclidean
Hamming
Hausdorff
Modified-Hausdorff
Dissimilarity representations for the NIST-38 digit data.
NIST-38 digit data. Within the collection of the NIST digit, a two-class problem is also separately considered. represented by the digits '3' and '8'. Here. each digit class consists of 1000 examples. Four dissimilarity measures are considered: Hamming (Sec. 5 . 3 ) Euclidean on the blurred (Gaussiansmoothed) images, Hausdorff (Def. 5.2) and modified-Hausdorff (Def. 5 . 3 ) on the digit contours. This set is used in Chapter 9 for a simulation of a missing value problem and in Chapter 10 for combining strategies.
Zongker digit data. The data describes the NIST digits [Wilson and Garris, 19921, originally given as 128 x 128 binary images. Here, the similarity measure. based on deformable template matching, as defined by Zongker and Jain [Jain and Zongker, 19971, is used. Let S = (sZ3) denote the similarities. The symmetric dissimilarities D = ( d t 7 )are computed as follows:
Conclusions and open problems
Figure E. 14
557
Non-metric dissimilarity representation for the Zongker data.
+
= (s,i s3j- sij - sji)4 for i # j and d,i = 0, since the data are slightly asymmetric. Note that the latter can be obtained in a traditional way as well, Theorem 3.18, second item, if first the corresponding similarities s,] and sji are averaged out. Since the original S and its averaged out, version S,,,.are not positive-definite, then D is non-Euclidean. Moreover, D is non-metric, since the triangle inequality does not hold. Since sij E [0,1], in some other cases, we will also distinguish the dissimilarities derived for the averaged similarities as D = (1 - S a u r ) , $These . are also non-metric. To have an impression of the non-Euclidean aspect of both dissimilarities, an indication can be given by the estimated ratio of IXrnin//Xmax E [0.31,0.38],that is in the pseudo-Euclidean embedding process this is the ratio of the largest in magnitude negative eigenvalue to the largest positive one. The overall contribution of negative eigenvalues in terms of the generalized average variance, see Sec. 3.5.4, is about 35%. These numbers imply a significant 'deviation' from Euclidean behavior. This data set is used in Chapter 6 for visualization and in Chapter 9 for discrimination. Dr Douglas Zongker and prof. Anil Jain are acknowledged for providing the template-matching dissimilarities on the NIST digits. dij
Figure E.15
Examples of the pen-based handwritten digits
Pen-based handwritten digit data. This data set comes from the UCI Machine Learning Repository [Hettich et al., 19981 and was created by Alpaydin and Alirnoglu. They used a pressure sensitive tablet with an
558
The dissimalaraty representation / o r pattern recognition
integrated LCD display and a cordless stylus. Samples hand-written by a number o f subjects are described by the 2 and y coordinates within 5 0 0 500 ~ pixel box. Hcnce, each digit is presented as a sequence of points in a twotlirriensional space. First, the data are remmpled such that the distances between any consecutive pair of points equd some chosen A. Then, from the transformed sequence s = ( q , y 1 ) ( z m ywL), , a string z = z1 . . . z, is derived siicli that zi is the vector pointing from (xi,v i ) to ( x i + l , y i + ~ ) . Each digit is then represented by a string. The distance between the strings is an edit distance with a fixed insertion and deletion costs, tins = Cdel = and with some substitution cost cslrt>.Two different substitution costs are considered as an angle between the vectors arid the Euclidean distance between the vectors. Different definitions of c,,b lead to different distance rneasures, hence different dissimilarity representations called Pen-angle and Pen-clist, respectively; see also [Bunke et al., 2001, 20021. Here, we only consider a part of the pen-digits data, consisting of 3488 digit examples originally assigned as the ‘test’ data on the UCI Repository Web-pagc (actually. all but first samples of each test class are used). The digits are unevenly represented with the class cardinalities varying between 334 arid 363. For some examples of original pen-digits data can be seen in Fig. E.15. This data set is used in Chapter 9 for classification. We are grateful to prof. Horst Bunke and Simon Giiriter for providing the edit-distance data. Pen-angle
Pen-dist
Figurc E. 16 Edit-distance representations for the pen-digit data.
Newsgroups data. This is a small subset of the 2ONewsgroups data, as considered by Roweis [Newsgroups data: a subset]. The original data set is a collection of approximately 20000 messages, partitioned
Conclusions and open problems
559
(nearly) evenly across 20 different newsgroups. Each newsgroup corresponds to a different topic. Some of the newsgroups are very closely related to each other, while others differ substantially. The full list, partitioned according to the subject matter is: comp.graphics, comp.os.ms-windows.niisc , comp .sys.ibm.pc.hardware, conip.sys.mac. hardware, comp.windows.x, rec.autos, rec.motorcycles, rec.sport.basebal1, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, rnisc.forsale, talk.politics.misc, talk.politics.guns, talk.politics.mideast, talk.religion.misc, a.lt,.atheism and soc.re1igiori.christian. The small subset used here consist of all the 'cornp.*', 'rec."', 'sci.*' and 'talk."' groups combined into four classes. Each rnessage is then described by an occurrence for 100 words across 16242 postings. Hence, the messages are described by occurrence vectors in a 100dimensional space. The non-metric correlation-based dissimilarity measures D,,,. and Dcora, defined in Table 5.3 are used to construct the News-cor and Newscor2 dissimilarity representations, respectively. Since the occurrence vectors can be treated as describing the event only (a particular keyword has appeared or not), they might be then simplified to binary variables for which some measures can be defined. Also the Jaccard, dice, simple matching and Haminari measures were investigated; see Table 5.2. However, sirice the used keywords are not representative, these measures were found very poor. Therefore, we skipped them from the analysis. This data set is uscd in Chapter 6 for visualization and in Chapter 7 for illustration of some clustering approaches. News-car
News-cor2
Figure E. 17 Dissimilarity representations for the newsgroup data.
Texture data. These data are created from 23 large images obtained from MIT Media Lab [Texture data] and used as illustration for an image
560
The dzsszmilarity representation f o r p a t t e r n recognztion
database retrieval problem. Each original image is cut into 16 128 x 128 nonoverlapping pieces. These represent a single class. Therefore, our database consists of 23 classes and 368 images. These images are mostly homogeneous and represent one type of a texture. Each image is described by the responses (in terms of magnitudes) of ten Gabor filters. They are chosen by a backward feature selection from a set of 48 Gabor filters defined by different smoothing, frequency and direction parameters; see also [Lai et al., 20021.
Bibliography
Agarwala, R., Bafna, V., Farach, M., Paterson, M., and Thorup, M. (1999). On the approximability of numerical taxonomy (fitting distances by tree metrics). SIAM Journal on Computing, 28(3), 1073-1085. Aha, D., Kibler, D., and Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6, 37-66. Akaike, H. (1973). Information theory and an extension of the maxinium likelihood principle. In International Symposium on Informatzon Theory, pages 267-281. Alpay, D., Dijksma, A., Rovnak, J., and de Snoo, H. (1997). Schur Punctions, Operator Colligations, and Reproducing Kernel Pontryagin Spaces. Birkhauser Verlag, Basel-Boston-Berlin. Anderberg, M. (1973). Cluster Analysis for Applications. Academic Press. New York, NY. Anderson, T. and Bahadur, R. (1962). Classification into two multivariate normal population with different covariance matrices. Annals of Mathematical Statistics, 33,420-431. Arkadiev, A. and Braverman, E. (1964). Teaching a computer pattern, recognition. Nauka. Atkeson, C., Moore, A., and Schaal, S. (1997). Locally weighted learning. A I Review, 11, 11-73. Avesani, P., Blanzieri, E., and Ricci, F. (1999). Advanced rnetrics for classdriven similarity search. In International Workshop on Database and Expert Sgstems Applications, pages 223-227, Italy. Ayad, H., Basir, O., and Karnel, M. (2004). A probabilistic niodel using information theoretic measures for cluster ensembles. In F. Roli, J. Kittler, and T. Windeatt, editors, Multiple Classifier Systems, LNCS, volume 3077, pages 144-153. Ball, K. (1990). Isometric embedding in l,-spaces. European Journal of Combinatorics, 11, 305-311. Banfield, J. and Raftery, A. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49, 803 -821. 561
Biblaography
562
Barnett: V. and Lewis, T. (1994). Outliers in statistical data. New York: Wiley, 3rd edition. Barthdemy, J. and Guknoche, A. (1991). Trees and Proximity Representations. Chichester: Wiley. Bartkowiak, A. (2000). Identifying multivariate outliers by dynamic graphics as applied to some medical data. In M. Ahsanullah and F. F. Yildirim, editors, Applied Statistical Science I V , pages 29-36, New York. Nova Science Publishers. Bartkowiak, A. (2001). A semi-stochastic grand tour for identifying outliers and finding a clean subset. Biometrical Letters, 38(1),11-31. Bartkowiak, A. and Szustalewicz, A. (2000). Outliers finding and classifying which genuine and which spurious. Computational Statistics, 15(l ) , 3-12. Basri, R,.and Jacobs, D. (1997). Constancy and similarity. Computer Vision and Image Un,derstanding, 65(3), 447-449. Basri, R., Costa. L., Geiger, D., and Jacobs, D. (1996). Distance metric bet,wcen 3D models and 2D images for recognition and classification. IEEE Transactions on Pattern Analysis and Machine Intelliqen~e,18(4) 465-470. Basri, R,.> Costa, L., Geiger, D., and Jacobs, D. (1998). Determining the similarity of deformable shapes. Vision Research, 38, 2365-2385. Baulieu, F. (1989). A classification of presence/absence based dissimilarity coefficients. Algebra Universalis, 20, 351-367. Baulieu, F. (1997). Two variant axiom systems for presence/absence based dissiniilarity coefficients. Journml of Classzjkation, 14, 159 -170. Belkin, M. and Niyogi, P. (2002a). Laplacian eigerimaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, volume 14, pages 585-591. The MIT Press. Belkin, M. arid Niyogi, P. (2002b). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373 1396. Bellinan, R.. (1957). Dynamic Programming. Princeton University Press. Belongic, S. and Malik, J. (2000). Matching with shape context. In IEEE Workshop on Content-based Access of Image and Video Libraries, pages 20-26. Belongie, S., Malik, J., and Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(24), 509-522. Bennet>t,C., Gacs, P., Li, M., VitBnyi, P., and Zurek, W. (1998). Infor-
-
-
Bibliography
563
mation distance. IEEE Transactions on Information Theory, IT-44(4) 1407-1423. Bennett, K. and Mangasarian, 0. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1,23-24. Bennett, K. and Mangasarian, 0. (1999). Combining support, vector and mathematical programming methods for induction. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods, Support Vector Learning, pages 307- 326. MIT Press, Cambridge, MA. Berchtold, S., Ertl, B., Keim, D., Kriegel, H.-P., and Seidl, T. (1998). Fast nearest neighbor search in high-dimensional spaces. In Internationml Conference on, Data Engineering, Orlando, Florida. Berg, C., Christensen, J . , and Ressel, P. (1984). Harmonic Analysis on Semigroups. Springer-Verlag. Bertsekas, D. (1995). Nonlinear Programming. Athena Scientific, Belmont, %
MA. Bezdek, J. and Hathaway, R. (2002). VAT: A tool for visual assessment of (cluster) tendency. In International Joint Con,ference o n Neural Networks,pages 2225-2230, Piscataway, NJ. IEEE Press. Bezdek, J . and Pal, N. (1998). Some new indexes of cluster validity. IEEE Transactions on, Systems, Man and Cybernetics, 28(3), 301 -315. Bezdek, J . , Keller, J., Krishnapuram, R., and Pal, N. (1999). Fuzz9 Models and Algorithms ,for Pattern Recognition and Image Processing. Boston. Bialynicki-Birula, A. (1976). Algebra liniowa z geometriq. PWN, Warszawa. Billingsley, P. (1995). Probability and Measure. John Wiley & Sons: New York, 3rd edition. Bilmes, J. (1997). A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden rnarkov models. Technical Report ICSI-TR-97-021, Signal, Speech, and Language Interpretation Laboratory, University of Washington. Birkholc, A. (1986). Analiza matematyczna. Funkcje iuielu zni,iennych,. PWN, Warszawa. Bishop (1995). Neural Networks for Pattern Recognition. Oxford Universit'y Press. Bishop, C., Svenskn, M., and Williams, C. (1996). GTNI: a principled alternative to the self-organizing map. In C. Von der Illalsburg, C. Voii Seelen, J. Vorbrggen, and B. Sendhoff, editors, International Conjererm on Artificial Neural Networks, pages 165-1 70, Berlin. Springer-Verlag. Bishop, C., Svensh, M., and Williams, C. (1998). Developments of the
564
Bibliography
generative topographic mapping. Neurocomputing, 21, 203-224. Blumenthal, L. (1936). Remarks concerning the Euclidean four-point property. Ergebnisse eines Math. Koll., 7, 8-10. Blumenthal, L. (1953). Theory and Applications of Distance Geometry. Oxford University Press, Amen House, London. BognBr, J. (1974). Indefinite Inner Product Spaces. Springer-Verlag, Berlin Heidelberg New York. Bonsangue, M., van Breugel, F., and Rutten, J. (1998). Generalized metric spaces: Completion, topology and powerdomains via the Yoneda embedding. Theoretical Computer Science, 193, 1-51. Bookstein, A., Klein, S., and Raita, T. (2001). Fuzzy Hamming distance: A new dissimilarity measure. In A. Amir and G. Landau, editors, CPM 2001: LNCS 2089, pages 86-97. Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling. Springer-Verlag, New York. Borgefors, G. (1986). Distance transformation in digital images. Compure Vision, Graphics and Image Processing, 34, 344-371. Bourgain, J. (1985). On Lipschitz embedding of finite metric spaces in Hilbert space. Israel Journal of Mathematics, 5 2 , 46-52. Boyd, S. arid Vandenberghe, L. (2003). Convex Optimization. Cambridge University Press. Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 11451159. Bradley, P., Mangasarian, O., and Street, W. (1998). Feature selection via mathematical programming. INFORMS Journal on Computing, 10, 209-217. Breirnan, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123140. Breirnan, L. (1996b). Bias, variance, and arcing classifiers. Technical Report 460, Statistic Department, University of California. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classzfication and regression trees. Wadsworth & Brooks. Bretagnolle, J., Dacunha Castelle, D., and Krivine, .J. (1966). Lois stables et espaces 1”. Ann. Inst. Henri Poincare‘, I I ( 3 ) , 231-259. Bruske, .J. and Sommer, G. (1997). Topology representing network for int’rinsic dimensionality estimation. In International Conference o n Art(ficia,l Neural Networks, Springer LNCS 1327, pages 595-600. Bnhmann, J. and Hofmann, T. (1994). A maximurn entropy approach to
Bibliography
565
pairwise data clustering. In International Con,ference on Pattern Recognition, volume 11, pages 207-212, .Jerusalem, Israel. Buhmann, J. and Hofmann, T . (1995). Hierarchical pairwise data clustering by mean-field annealing. In International Conference on Artificial Neural Networks, pages 197-202. Bunke, H. and Sanfeliu, A,, editors (1990). Syntactic and Structural Pattern Recognition Theory and Applications. World Scientific. Bunke, H. and Shearer, K. (1997). On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18(8),689-694. Bunke, H. and Shearer, K. (1998). A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19(3-4), 255 -. 259. Bunke, H., Gunter, S., and Jiang, X. (2001). Towards bridging the gap between statistical and structural pattern recognition: Two new concepts in graph matching. In International Conference on Advances in Pattern Recognition: Springer LNCS 2013, pages 1-11. Bunke, H., Jiang, X., Abegglen, K., and Kmdel, A. (2002). On the weighted mean of a pair of strings. Pattern Analysis and Applications, 5(1),23-30. Burges, C. (1998). Geometry and invariance in kernel based methods. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods, Support Vector Learning. MIT Press. Campbell, C. and Bennett, K. (2000). A linear programming approach to novelty detection. In Advances in Neural Information Processing Systems, pages 395--401. Cayley, A. (1841). On the theorem in the geometry of position. Cambridge Mathema.tica1 Journal, 11, 267-271. Cech, E. (1966). Topological Spaces. Wiley, London. Cha; S. and Srihari, S. (2000). Distance between histograms of' angular measurements and its application to handwritten character similarity. In International Conference on Pattern Recognition,, volume 2. pages 21 24. Chabrillac, Y. and Crouzeix, J.-P. (1984). Definiteness and semidefiriiteness of quadratic forms revisited. Linear Algebra and its Application,s, 63(4), 283-292. Chan, T. and Goldfarb, L. (1992). Primitive pattern learning. Pattern Recognition, 25(8), 883-889. Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library for support vector machines. http: //www. csie .ntu.edu.tw/"cjlin/libsvm. Chaudl-iuri, B. and Rosenfeld, A. (1996). On a metric distance between
566
Bibliography
fuzzy sets. Pattern Recognition Letters, 17, 1157-1160. Chaudhuri, B. arid Rosenfeld, A. (1999). On a metric distance between fuzzy sets. Information Sciences, 118, 159-171. Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790-799. Chepoi, V. and Fichet, B. (2000). 1,-approximation via subdominants. J. Mathem~aticalPsychology, 44, 600-616. Cherkassky, V. and Mulier, F. (1998). Learning ,from data: Concepts, Theo r y and Methods. John Wiley & Sons, Inc., New York, NY, USA. Cho, D. and Miller, D. (2002). A Low-complexity Multidimensional Scaling Method Based on Clustering. Concept paper. Chow, C. (1970). On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, IT-16(1), 41-46. Chung, K.-L. (2001). A Course in Probability Theory. Academic Press, New York, 3rd edition. Cilibrasi, R. and VitBnyi, P. (2004). Automatic meaning discovery using Google. http://xxx.lanl.gov/abs/cs.CL/0412098. Cilibrasi, R. and VitSnyi, P. (2005). Clustering by compression. IEEE Transactions on Information Theory, 41(4), 1523-1545. Cilibrasi, R,., VitBnyi, P., and de Wolf, R. (2004). Algorithmic clustering of music based on string compression. Computer Music Journal, 28(4), 49-67. Cohen, J. and Farach, M. (1997). Numerical taxonomy on data: Experimental results. Journal of Computational Biology, 4(4). Constantinescu, T. and Gheondea, A. (2001). Representations of Hermitian kernels by means of Krein spaces 11. invariant kernels. Comm.unications in, Mathematical Physics, 216, 409--430. Conway. B. (1990). A Course in Functional Analysis. Undergraduate Texts in Mathematics. Springer-Verlag, 2nd edition. Corpet, F., Servant, F., Gouzy, J., and Kahn, D. (2000). Prodom arid prodom-cg: tools for protein domain analysis and whole genome comparisons. Nucleid Acads Research, 28, 267-269. Costa, L. d. F. arid Cesar, R.M., J. (2001). Shape Analysis and Classification. CRC Press, Boca Raton. Courrieu, P. (2002). Straight monotonic embedding of data sets in Euclidean spaces. Neural Networks, 15, 1185-1196. Cover: T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theorg, 13(1);21-27. Cox. T. and Cox, NI. (1995). Multidimensional Scaling. Chapman & Hall, London.
Bzbliography
567
Cox, T. and Cox, M. (2000). A General Weighted Two-way Dissimilarity Coefficient. Journal of Classification, 17, 101-121. Coxeter, H. (1998). Non-Euclidean Geometry. The Mathematical Association of America, Washington, DC, 6th edition. Cristianini, N. and Shawe-Taylor, J. (2000). Support Vector Machines and other kernel- based learning methods. Cambridge University Press, UK. Critchley, F. and Fichet, B. (1997). On (Super-)Spherical Distance Matrices and Two Results from Schoenberg. Linear Algebra and its Applications, 251, 145-165. Crouzeix, J. and Ferland, J. (1982). Criteria for quasiconvexity and pseudoconvexity: relations and comparisons. Mathematical Programming, 23(2), 193-205. CsiszAr, I. (1967). Information-type measures of divergence of probability distributions and indirect observations. Studia Scientiarium Mathematicarum Hungarica, 2, 299-318. Dasarthy, B., editor (1991). Nearest Neighbor ( N N ) Norms: N N Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA. Dasarthy, B. (1994). Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design. IEEE Transactions on Systems, Man, and Cybernetics, 24(3), 511-517. Day, M. (1944). Convergance, closure and neigbhoorhoods. Duke Mathematical Journal, 11, 181-199. de Carvalho, F. (1994). New Approaches in Classification and Data Analysis, chapter Proximity coefficients between Boolean symbolic objectzs, pages 387-394. Springer-Verlag. de Carvalho, F. (1998). Data Science, Classification and Related Methods, chapter Extension based proximities between constrained Boolean symbolic objects, pages 370-378. Springer-Verlag. de Diego, I., Moguerza, J., and Muiioz, A. (2004). Combining kernel information for support vector classification. In F. Roli, J. Kittler, and T. Windeatt, editors, Multiple Cla,ssifier Systems, LNCS, volume 3077, pages 102-111. de Ridder, D. and Duin, R. (1997). Sammon’s mapping using neural networks: a comparison. Pattern Recognition Letters, 18(11-13). de Ridder, D., Pqkalska, E., and Duin, R. (2002). The economics of classification: Error vs. complexity. In R. Kasturi, D. Laurendeau, and C. Suen, editors, International Conference on Pattern Recognition; volume 3 , pages 244-247, Quebec City, Canada.
568
Ba b liography
de Ridder, D., Duin, R., Egmont-Petersen, M., van Vliet, L., and Verbeek, P. (2003a). Nonlinear image processing using artificial neural networks. Advances in Imaging and Electron Physics, 126, 351--450. de Ridder, D., Kouropteva, O., Okun, O., Pietikainen, M., and Duin, R. (2003b). Supervised locally linear embedding. In Joint International Conferences: ICANN/ICONIP, Lecture Notes in Computer Science, vol. 2714, pages 333-341. de Soete, G. (1984a). Additive tree representations of incomplete dissirnilarity data. Quality and Quantity, 18, 387-393. de Soete, G. (198413). A least squares algorithm for fitting an ultrametric tree t o a dissimilarity matrix. Pattern Recognition Letters, 2, 133-137. de Soete, G. ( 1 9 8 4 ~ ) .Ultrametric tree representations of incomplete dissimilarity data. Journal of Classification, 1, 235-242. de Soete, G. and Caroll, J. (1996). Clustering and Classification, chapter Tree and other Network Models for Representing Dissimilarity Data; pages 157- 198. London: World Scientific. de Veld, D., Skurichina, M., Witjes, M., and et.al. (2003). Autofluorescence characteristics of healthy oral miicosa at different anatomical sites. Lasers in, Surgery and Medicine, 23, 367-376. Debnath, L. and Mikusinski, P. (1990). Introduction to Halbert Spaces with Applications. Academic Press, San Diego. DELVE (Website). Data for evaluating learning in valid experiments. University of Toronto, Department of Computer Science. http://www. cs . toronto.edu/"delve/. Demartines, P.and Hkrault, J. (1997). Curvilinear component annalysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Transations on, Neural Networks, 8(I ) , 148-154. Denipster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the ern algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1 38. Deneux: T. and et al. (Website). Belief functions and pattern recognition: Matlab software. http ://www .hds .utc .fr/"tdenoeux/sof tware .htm. Denoeux, T.and Masson, M.-H. (2004). EVCLUS: Evidential clustering of proximity data. IEEE Transations on Systems, Man and Cybernetics, 34(1), 95-109. Devijver, P. and Kittler, J. (1982). Pattern recognition: A statistical upproach. Prentice/Hall, London. Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer-Verlag.
Bibliography
569
Deza, M. and Laurent, M. (1994). Applications of cut polyhedra. Journal of Computational and Applied Math<ema,tics,55(2), 217 - 247. Deza, M. M. and Laurent, M. (1997). Geometry o,f Cuts and metrics. Springer-Verlag. Domeniconi, C., Peng, J., and Gunopulos, D. (2002). Locally adaptive metric nearest-neighbor classification. IEEE Transactions o n Pattern, Analysis and Machine Intelligence, 24(9), 1281-1285. Domingos, P. (2000~~). A unified bias-variaace decomposition and its applications. In International Conference on Machine Learning, pages 231 238. Morgan Kaufmann. Domingos, P. (2000b). A unified bias-variance decomposition for zero-one and squared loss. In International Conference o n Artificial Intelligence, pages 564-569, Austin, Texas. AAAI Press. Domingos, P. and Pazzani, M. (1997). On the optiniality of the simple bayesiaii classifier under zero-one loss. Machine Leurning, 29(2-3), 103130. Donoho, D. and Grimes, C. (2003). Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. In Proceedings of the Natiosnal Academ,y of A r t s and Sciences, volume 100, pages 5591-5596. Dritschel, M. and Rovnyak, J. (1996). Operators on indefinite inner product spaces. Lectures on Operator Theory and its Applications, Fields Institute Monographs, pages 141-232. Dubuisson, M. and Jain, A. (1994). Modified Hausdorff distance for object matching. In International Conference on, Pattern Recognition, volunie 1, pages 566-568. Duch, W . (2000). Similarity based methods: a general framework for classification, approximation and association. Control and Cybernetics, 29(4) ~
~
937- 968.
Duch, W., Naud, A., and Adamczak, R. (1998). A framework for similaritybased methods. In Polish Conference on Theory and Applications of Artificial Intelligence, pages 33-60, L6d’z. Duch, W., Adamczak, R., and Diercksen, G. (2000). Classification, association and pattern completion using neural similarity based methods. Applied Mathematics and Computer Science, 10(4), 101-120. Duda, R., Hart, P., and Stork, D. (2001). Pattern Classification. John Wiley & Sons, Inc., 2nd edition. Duin, R. (1999). Compactness and complexity of pattern recognition problems. In International Symposium o n Pattern Recognition ’In Memoriam Pierre Devijver’, pages 124-128, Royal Military Academy, Brussels.
570
Bibliography
Duin, R. (2002). The combining classifier: To train or not to train? In International Conference on Pattern Recognition, volume 11, pages 765770, Quebec City, Canada. Duin, R.. and Pqkalska, E. (2001). Complexity of dissimilarity based pattern classes. In Scandinavian Conference on Im,age Analysis, Bergen, Norway. Duin, R. and Pqkalska, E. (2002). Possibilities of zero-error recognition by dissimilarity representations. In J. Iriesta and L. Mico, editors, Pattern Recognition in Information Systems, Allicante, Spain. Dnin, R. and Pqkalska, E. (2005). Object representation, sarnple size and data set complexity. In T. Ho and M. Basu, editors, Data Complexity in Pattern Recognition, page to appear. Springer-Verlag. Duin, R. and Tax, D. (1998). Classifier conditional posterior probabilities. In Advances in Pattern Recognition, LNCS, volume 1451, pages 611-619, Sydney. Joint IAPR International Workshops on SSPR and SPR. Duin, R. and Verveer, P. (1995). An evaluation of intrinsic dimensionality estimators. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(l ) , 81-85. Duin, R., de Ridder, D., and Tax, D. (1997). Experiments with object based discriminant functions; a fea,tureless approach to pattern recognition. Pattern Recognition Letters, 18(11-13), 1159--1166. Duin? R,.? de R.idder, D., and Tax, D. (1998). Featureless pattern classification. Kybernetika, 34(4), 399-404. Duin, R., Pqkalska, E., and de Ridder, D. (1999). Relational Discriminant Analysis. Pattern Recognition Letters, 20(11-13), 1175-1181. Duin, R., R.oli, F., and de Ridder, D. (2002). A note on core research issues for statistical pattern recognition. Pattern Recognition Letters, 23(4), 493--499. Duin, R., Pqkalska, E., Paclik, P., and Tax, D. (2004a). The dissimilarity representation, a basis for domain based pattern recognition? In L. Goltlfarb, editor, Pattern representation and the future of pattern recognition, ICPR 2004 Workshop Proceedings, pages 43-56, Cambridge, United Kingdom. Duin. R., Jiiszczak, P., de Ridder, D., Paclik, P., Pqkalska, E., and Tax, D. (2004b). PR-Tools. Website. http : //prtools . org. Dunford, N. arid Schwarz, J. (1958). Linear operators. Part I: general theory. Interscience Publishers, Inc., New York. Edelman, S. (1999). Representation and Recognition in Vision. MIT Press, C ainbr idge. Edelrnari, S. arid Duvdevani-Bar, S. (1997). Similarity, connectionism, and
Bibliography
571
the problem of representation in vision. Neural Co,mputation, 9, 701-720. Edelman, S., Cutzu, S., and Duvdevani-Bar, S. (1996). Similarity to reference shapes as a basis for shape representation. Cognitive Science Conference. Edelman, S.; Cutzu, S., and Duvdevani-Bar, S. (1998). Representation is representation of similarities. Behavioral and Brain Sciences, 21, 449498. Effros, E. and Ruan, Z.-J. (2000). Operator Spaces. Clarendon Press, Oxford. Eiben, A.E., S. J. (2003). Introduction to Evolutionary Computing. Springer. Eiter, T. and Mannila, H. (1997). Distance measures for point sets and their computation. Acta Informataca, 34(2), 109-133. Esposito, F., Malerba, D., Tamma, V., Bock, H., and Lisi, F. (2000). Analysis of Symbolic Data, chapter Similarity and Dissimilarity. SpriiigerVerlag. Everitt,, B. and Rabe-Hesketh, S. (1997). The Analysis of Prorimity Data. Arnold, London. Everitt, B., Landau, S., and Leese, M. (2001). Cluster Armlysis. Arnold, London, 4th edition. Evgeniou, T., Pontil, M., and Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1), 1-50. Faloutsos, C. and Lin, K.-I. (1995). FastMap: A fast algorithm for indexing, data-mining and visualization of tradditional and multimedia datasets. In A C M SIGMOD, International Conference on Management of Data, pages 163-174, California. Farach, M., Kannan, S., and Warnow, T. (1995). A robust model for finding optimal evolutionary trees. Algorithmica, 13, 155-179. Fault data (Website). h t t p : //www. sidanet . org. Feller, W. (1968). An Introduction to Probability Theory and Its Applications, volume 1. Wiley, New York, 3rd edition. Feller, W. (1971). An Introduction to Probability Theory and Its Applications, volume 2. Wiley, New York, 3rd edition. Fichtenholz, G. (1997). Rachunek rdiniczkowy i catkowy. Panstwowe Wydawnictwo Naukowe, Warszawa. Fiedler, M. (1998). Ultrametric sets in Euclidean point spaces. Journal of Linear Algebra, 3, 23-30. Fischer, B., Thomas Zoller, T., and Buhmann, J. (2001). Path based pair-
572
Bzbliography
wise data clustering with application to texture segmentation. In Interna-
tional Workshop on, Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 235-250. Fish contours (Website). University of Surrey. h t t p : //www. ee . surrey. ac.uk/Personal/F.Mokhtarian/.
Fraley, C. and Raftery, A. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal, (41), 578-588. Fred, A. and Jain, A. (2002a). Data clustering using evidence accumulation. In R. Kasturi, D. Laurendeau, and C. Suen, editors, International Conference on Pattern Recognition, pages 276-280, Quebec City, Canada. Fred, A. and Jain, A. (2002b). Robust data clustering. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 442 -451, Madison - Wisconsin, USA. Fred, A. and Jain, A. (2003). Evidence accumulation clustering based on the k-means algorithm. In Structural, Syntactic, and Statistical Pattern Recognition, LNCS vol. 2396, volume 11, pages 128-133. Springer-Verlag. Fred, A. and Leitao, J. (2003). A new cluster isolation criterion based on dissimilarity increments. IEEE i'l-ansactions on Pattern Analysis and Machine Intelligence, 2 5 ( 8 ) , 944-958. Freeman, H. and Glass, J. (1961). On the encoding of arbitrary geometric configurations. IRE Transactions, EC-10(2), 260-268. Frdicot, C. and Emptoz, H. (1998). A pretopological approach for pattern classification with reject options. In A. Amin, D. Dori, P. Pudil, and H. Freeman, editors, Joint IAPR International Workshops on SSPR and SPR, LNCS, volume 1451, pages 707-715. Springer. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proc. of the 13th International Conference, pages 148--156. Friedman, J . (1994). Flexible metric nearest neighbor classification. Technical Report 113, Stanford University Statistics Department. Fu, K. (1982). Syntactic Pattern Recognition and Applications. PreticeHall. Fujie, T. and Kojima, M. (1997). Semidefinite programming relaxation for nonconvex quadratic programs. Journal of Global Optimization, 10, 367-380. Fiikunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic Press. Gaal, S . (1964). Point-Set Topology. Academic Press, New York.
Bibliography
573
Garrett, P. (2003). Notes on functional analysis. http: //www .math.umn. edu/"garret t /m/ f u n / . Gascuel, 0. (1997). BIONJ: an improved version of the N J algorithm based on a simple model of sequence data. Molecular Biology and Evolution, 14, 685-695. Gascuel, 0 . (2000). Data model and classification by trees: The minimum variance reduction (MVR) method. Journal of Classification, 17, 67-99. Gastl, G. and Hammer, P. (1967). Extended topology. neighboorhoods and convergents. In Colloquium on Convexity 1965, pages 104-1 16, Coperihagen. Kobenhavns Univ. Matematiske Inst. Gavrila, D. (2000). Pedestrian detection from a moving vehicle. In European Conference on Computer Vision, Dublin, Ireland. Gavrila, D. and Philomin, V. (1999). Rcal-time object detection for smart vehicles. In I E E E International Confereme on Computer Vision, Kerkyra. Gdalyahu, Y. and Weinshall, D. (1999). Flexible syntactic matching of curves and its application to automatic hierarchical classification of silhouettes. IEEE Transactions on Pattern Analysis and Machine Iritelligence, 21(12). Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the biaslvariance dilemma. Neural Computation, 4, 1-58. Ghahramani, S. (2000). Fundmentals of Probability. Prctice Hall. Gibbs, A. and Su, F. (2002). On choosing and bounding probability rnetrics. International Statistical Review, 70(3), 419-435. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1455-1480. Gnilka, S. (1994). On extended topologies. i: Closure operators. Commentationes Mathernuticue, 34, 81 ~ ~ ~ 9 4 . Gnilka, S. (1995). On extended topologies. ii: Compactness, quasimetrizability, symmetry. Commentationes Mathematicae, 35,147- 162. Gnilka, S. (1997). On continuity in extended topologies. Annales Societatis Mathematicae Polonae. Seria I. Commentationes Mathematicue, 37, 99~108. Goldberg, D. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Kluwer Academic Publishers, Boston, MA. Goldfarb, L. (1984). A unified approach to pattern recognition. Pattern, Recognition, 17, 575-582. Goldfarb, L. (1985). A new approach to pattern recognition. In L. Kana1 and A. Rosenfeld, editors, Progress in Pattern Recognition, volume 2,
574
Bibliography
pages 241-402. Elsevier Science Publishers BV. Goldfarb, L. (1990). On the foundations of intelligent processes I. An evolving model for pattern recognition. Pattern Recognition, 23(6), 595616. Goldfarb, L. (1992). What is distance and why do we need the metric model for pattern learning? Pattern Recognition, 25(4), 431-438. Goldfarb, L. and Deshpande, S. (1997). What, is a symbolic measurement process'? In Systems, M a n and Cybernetics, volume 5, pages 4139-4145, Orlando, Florida. Goldfarb, L. and Golubitsky, 0 . (2001). What is a structural measurement process? Technical Report TR01-147, University of New Brunswick, Fredericton, Canada. Goldfarb, L., Abela, J., Bhavsar, V., and Kamat, V. (1992). Transformation systems are more economical and informative class descriptions than formal grammars. In International Conference o n Pattern Recognition, volume 11, pages 660-664, The Netherlands. Goldfarb, L., Abela, J., Bhavsar, V., and Kamat, V. (1995). Can a vector space based learning model discover inductive class generalization in a symbolic environment? Pattern Recognition Letters, 16(7), 719-726. Goldfarb, L., Golubitsky, 0.; and Korkin, D. (2000a). What is a structural representation? Technical Report TR00-137, University of New Brunswick, Fredericton, Canada. Goldfarb, L., Golubitsky, O., and Korkin, D. (2000b). What is a structural representation in chemistry? Towards a unified framework for CADD. Technical Report TR00-138, University of New Brunswick, Fredericton, Canada. Goldfarb, L., Gay, D., Golubitsky, O., and Korkin, D. (2004). What is a struct.ura1 representation? second version. Technical Report TR04- 165, University of New Brunswick, Fredericton, Canada. Goldstone, R. (1994). Similarity, interactive activation, and mapping. Journal of Experimental Psychology, 20, 3-28. Goldstone, R. (1998). Hanging together: A connectionist model of similarity. In J. Grainger and A. Jacobs, editors, Localized Connectionist Approaches t o H u m a n Cognition, pages 283-325. NJ: Lawrence Erlbaum Associates, Mahwah. Goldstone, R. (1999). Similarity. In R. Wilson and F. Keil, editors, MIT encyclopedia o,f the cognitiiue sciences, pages 763-765. MA: MIT Press, Cambridge. Gordon, A. (1996). Clustering and Classification, chapter Hierarchical Clas-
Biblzography
575
sification, pages 65-122. London: World Scientific. Gould, N. and Toint, P. (2002). Numerical methods for large-scale nonconvex quadratic programming. In A. Siddiqi and h4.Kotcvara, editors, Trends in Industrial and Applied Mathematics, pages 149-179. Kluwer Academic Publishers, Dordrecht, The Netherlands. Gowda, K. and Diday, E. (1991). Unsupervised learning through symbolic clustering. Pattern Recognition Letters, 12, 259-264. Gower, J. (1971). A general coefficient of similarity and some of it,s properties. Biometrics, 27,25-33. Gower, J. (1982). Euclidean distance geometry. Mathematical Scimtist, ( 7 ) ,1-14. Gower, J. (1986). Metric and Euclidean Properties of Dissimilarity Coeffcients. Journal of Classification, 3,5-48. Gower, J. and ROSS,G. (1969). Minimum spanning trees and single linkage cluster analysis. Applied Statistics, 18, 54-64. Graepel, T., Herbrich, R., Bollmann-Sdorra, P., and Obermayer, K. (1999a). Classification on pairwise proximity data. In Advances in Neural Information System Processing 11, pages 438-444. Graepel, T., Herbrich, R., Scholkopf, B., Smola, A., Bartlett, P., Mullcr, K.-R., Obermayer, K., and Williamson, R. (1999h). Classification on proximity data with LP-machines. In International Conference on Artificial Neural Networks, pages 304-309. Grenander, U. (1976). Pattern, synthesis: Lectures in pattern theory, oolume 1. Springer-Verlag. Grenander, U. (1978). Pattern analysis: Lectures in pattern theory, volume 2. Springer-Verlag. Grenander, U. (1981). Regular structures: Lectures in pattern theory, volume 3. Springer-Verlag. Greub, W. (1975). Linear Algebra. Springer-Verlag. Griffiths, A. and Bridge, D. (1997). Towards a theory of optimal similarity measures. In Workshop on Case-Based Reasoning, United Kingdom. Grothcr, P., Candela, G., and Blue, J. (1997). Fast implementatioris of nearest-neighbor classifiers. Pattern Recognition, 30(3). 459 -465. Grunwald, P. (2005). Advances in Minimum Description, Length,: Theory and Applications, chapter Minimum Description Length Tutorial. MIT Press. Gukrin-Dug&, A,, Teissier. P., Delso Gafaro, G., and Herault, J. (1999). Curvilinear component analysis for high-dimensional data representation: 11. examples of additional mapping constraints in specific applications.
576
Bibliography
In Conference o n Artificial and Natural Neural Networks, LNCS 1607, pages 635-644, Spain. Guo, P., Chen, C., and Lyu, M. (2002). Cluster number selection for a small set of samples using the bayesian ying-yang model. IEEE Transactions on, Neural Networks, 13(3), 757-763. Haasdonk, B. (2005). Feature space interpretation of svms with indefinite kernels. I E E E Transactio,ns on P d t e r n Analysis and Machine Intelligence. 25(5), 482-492. Hadlock, F. arid Hoffman, F. (1978). Manhattan trees. Utilitas Mathematica, 13. 55-67. Hagedoorn, M. arid Vcltkamp, R. (1999a). Reliable and efficient pattern matching using an affine invariant metric. International Journal of Computer Vision, 31(2-3), 103-115. Hagedoorn, M. and Veltkamp, R. (1999b). A robust affine invariant metric on boiiridary patterns. International Journal of Pattern Recognition and Articzfial Intelligence, 13(8), 1151-1164. Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001). On clustering validation techniques. Intelligent Information Systems Journal, 17(2-3), 107- 145. Halmos, P. (1974). Measure Theory. Springer-Verlag, New York. Harnpel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (1986). Robust Statistics the Approach Based on Influence Functions. John Wiley 8.z Sons, New York. Hand, D. (1997). Construction and Assesrnent of Clmsification Rules. John Wiley & Sons, Chester, England. Hanscn, J. and Heskes, T. (2000). General Bias/Variance decomposition with target independent variance of error functions derived from the exponential family of distributions. In Internmtional Conference on Pattern Recognition, volume 11, pages 207--210, Barcelona, Spain. Hart, P. (1968). The condensed nearest neighbor rule. IEEE Transactions on In,for.mation Theory, 14, 515-516. Hartigan, *J. (1975). Clustering Algorithms. Wiley, New York, NY. Hastie, T. and Stuetzle, W. (1989). Principal curves. Journal ofthe Amer%canStatistical Association,, 84,502-516. Hastie, T. and Tibshirani, R. (1996). Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), 607-616. Hastie, T., Tihshirani, R,., and Friedman, J. (2001). The Elem,ents of Statistical Learning. Springer Verlag, New York Berlin Heidelberg.
Bibliography
577
Hathaway, R. and Bezdek, J. (2003). Visual cluster validity (VCV) for prototype generator clustering models. Pattern Recognition Letters, 24(9l o ) , 1563-1569. Heiser, W. (1991). A generalized majorization method for least squares multidimensional scaling of pseudodistances that may be negative. Psychonzetrica, 56, 7--27. Herault, J., Jausions-Picaud, C., and Gukrin-Duguit, A. (1999). Curvilinear component analysis for high dimensional data representation: I. theoretical aspects and practical use in the presence of noise. In Conference on Artaficial and Natural Neural Networks, LNCS 160'7, pages 625-634, Spain. Heskes, T. (1998). Bias/variance decompositions for likelihood-based estimators. Neural Computation, 10, 142551433, Hettich, S., Blake, C., and Merz, C. (1998). UCI repository of Machine Learning databases. http : //www . ics .uci .edu/"mlearn/ MLRepository.htm1. Hjort, P., Lisonkk, P., Markvorsen, S., and Thomassen, C. (1998). Finite metric spaces of strictly negative type. Linear Algebra and Its Applications, 270, 255-273. Ho, T. (1998). The random subspace method for constructing decision forests. IEEE Transactions o n Pattern Analysis arid Machine Intelligence, 20(8), 832-844. Hofmann, T. and Buhmann, J. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1-14. Hofstadter, D. (1979). Godel, Escher, Bach - an Eternal Golden Braid. Basic Books. Hoppner, F.? Klawonn, F., Kruse, R., and Runkler, T. A. (1999). Fuzzy Cluster Analysis. Chichester, England. Horn, R. and Johnson, C. (1991). Topics in Matrix Analysis. Carribridge University, Oxford. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417-441. Hoyle, D. and Rattray, M. (2003a). Limiting form of the sample covariance matrix eigenspectrum in pca and kernel pca. In S. Thrun, L. Saul, arid B. Scholkopf, editors, Neural Information Processing Systerns confe'ience. Hoyle, D. and Rattray, M. (2003b). Pca learning for sparse high-dimensional data. Europlzysics Letters, 62(1), 117-123. Hoyle, D. and Rattray, M. (2004a). Principal component analysis eigenvalue
578
Bibliography
spectra from data with symmetry breaking structure. Physical Review, E(69). Hoyle, D. and Rattray, M. (2004b). A statistical mechanics analysis of gram matrix eigenvalue spectra. In J. Shawe-Taylor and Y . Singer, editors, Conference on Learning Theory, volume 3120 of Lecture Notes in Artificial Intelligence, pages 579-593. Huber, P. (1981). Robust Statistics. John Wiley & Sons, New York. Hughes, B. (2004). Trees and ultrametric spaces: a categorical equivalence. Advancesin, Mathematics, 189, 148-191. Hughes, N. and Lowe, D. (2003). Artefactual structure from least squares multidimensional scaling. In Advances in Neural Information Processing Systems (NIPS 2002). MIT Press. Huttenlocher, D.; Klanderman, G., arid William, J. (1993). Comparing images using the Hausdorff distances. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 850-863. Ichino, M. and Yaguchi, H. (1994). General Mirikowski metrics for mixed features type data analysis. IEEE Transaction on System, Man and Cybenletics, 24, 698 ~-708. Indyk, P. (2001). Algorithmic applications of low-distortion geometric embeddings. IKIAnnual Symposium on Foundations of Computer Science, pages 10-33, Las Vegas, Nevada. Iohvidov, I., Krein, M., and Langer, H. (1982). Introduction to the Spectral Theoyy of Operators in Spaces with an Indefinite Metric. AkademieVerlag, Berlin. Isomap (Website). Isomap. h t t p : //isornap. stanfo r d . edu/. Jacobs, D., Weinshall, D., and Gdalyahu, Y. (2000). Classifica,tion with Non-Metric Distances: Image Retrieval arid Class Representation. I E E E Transactions on Pattern Analysis and Machine Intelligence, 22(6), 583600. Jain, A. arid Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. IEEE Transactions o n Pattern Analysis and Machine Intelligence, 19(2), 153-158. Jain, A,, Murthy, M., and Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3),264-323. Jain, A., Duin, R., and Mao, J. (2000). Statistical pattern recognition: A review. IEEE Trmsactions on Pattern Analysis and Machine Intellige71ce, 22(l ) ,4-37. Jain, A. K. and Chandrasekaran, B. (1987). Dimensionality and sample size considerations in pattern recognition practice. In P. R. Krishnaiah and
Bi blzography
579
L. N. Kanal, editors, Handbook of Statistics, volume 2 , pages 835-855. North-Holland, Amsterdam. Jain, A. K. and Dubes, R. C. (1988). Algorithms f o r Clusterin,g Data. Prentice Hall, Englewood Cliffs, NJ. James, G. (2003). Variance and bias for general loss functions. Machine Learning, 51, 115-135. Japkowicz, N., Myers, C., and Gluck, M. (1995). A novelty detection approach t o classification. In International Joint Conference on Art.$cial Intelligence, pages 518-523. Jardine, N. and Sibson, R. (1971). Muthematical Taxonomy. Wiley, London. Jensen, F. (1996). An Introduction t o Bayesian Networks. Springer-Verlag, New York. Jiang, M., Tseng, S., arid Su, C. (2001). Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6-7). 691-700. Johnson, W., Lindenstrauss, J., and Schechtman, G. (1987). On Lipschitz embedding of finite metric spaces in low dimensional normed spaces. In J . Lindenstrauss and V. Milman, editors, Geometrical Aspects of F u n c tional Analysis, Lecture Notes, 1267. Springer-Verlag. Juszczak, P. and Duin, R. (2003). Uncertainty sampling for one-class classifiers. In N. Chawla, N. Japkowicz, and A. Kolcz, editors, ICML Workshop: Learning with Imbalanced Data Sets II., pages 81--88. Karzanov, A. (1985). Metrics and undirected graphs. Math,ematical programming, 32, 183-198. Kashyap, R. (1982). Optimal choice of ar and ma parts in autoregressive moving average models. IEEE Transactions on Pattern Anulysis and Machine Intelligence, 4, 99-104. K&1, B. (2002). Intrinsic dimension estimation using packing mimbers. In Neural Information Processing Systems, Vancouver, Canada. Kelley, J . (1975). General Topology. Springer-Verlag, New York. Kelly, J. (1970). Hypermetric spaces. In L. Kelly, editor, The Geometry o,f Metric mid Lirnear Spaces, Lecture Notes in Mathematics, vohnrie 490> pages 17-31. Springer. Kernel Machines (Website). http : //www .kernel-machines .org/. Khalimsky, E. (1987). Topological structures in computer science. Journal of Applied Mathematics and Simulation, 1, 25-40. Khalimsky, E., Koppernian, R., and Meyer, P. (1990). Compnter graphics and connected topologies on finite ordered sets. Topology and its Applications, 36, 1-17.
580
Bibliography
Kim, J. and Warnow, T. (1999). Tutorial on phylogenetic tree estimation. In Intelligent Systems f o r Molecular Biology, Heidelberg, Germany. Kittler, J., Hatef, M., Duin, R., and Matas, J. (1998). On combining classifiers. I E E E Tmnsactions on Pattern Analysis and M a c h i m In,telligence, 20(3), 226-239. Klein, E . and Thompson, A. (1984). Theory of Correspondences. John Wiley & Sons, New York. Kohonen, T. (2000). Self-organizing maps. Springer-Verlag, 3rd edition. Kong, T., Kopperman, R., and Meyer, P. (1991). A topological approach to digital topology. American Mathematical Monthly, 98, 901-917. Kong, T., Kopperman, R., and Meyer, P. (1992). Special issue on digital topology. Topology and its Applicationq 46. Koppel, M. and Schler, J. (2004). Authorship verification as a one-class classification problem. In International Conference on Machine Learning. Korkin, D. and Goldfarb, L. (2002). Multiple genome rearrangement: a general approach via the evolutionary genome graph. Bioinforrnatics, 18, 303-311. Kotbe, G. (1969). Topological vector spaces I. Springer-Verlag, Berlin, Heidelberg, New York. Krauthgamer, R., Linial, N., and Magen, A. (2004). Metric embeddings beyond one-dimensional distortion. Discrete and Com,putation,al Geometry, 31, 339-356. Kreyszig, E. (1978). Introductory Functional Ananlysis with Applications. John Wiley & Sons, New York. Kreyszig, E. (1991). Diflerential Geometry. Dover, New York. Kruskal, J. (1964). Multidimensional scaling by optimizing goodness of fit to a nonrnetric hypothesis. Psychometrika, 29, 1-27. Kruskal, J . (1977). Multidimensional scaling and other methods for discovering structure. In Statistical methods for digital computers, pages 296-339. John Wiley & Sons, New York. Kriiskal, J. and Wish, M. (1978). Multidimensional scaling. Sage Publications, Newbury Park, CA. Krysicki, W., Bartos, J., Dyczka, W., Krdikowska, K., and Wasilewski, M. (1995). Rach.unek pramdopodobien'stwa a statystyka matematyczna 'w zadu,rriu,ch, cz~$C I i II. Paristwowe Wydawnictwo Naukowe, Warszawa. Kuncheva; L. (2004). Comb,ining Pattern Classifiers. Methods and Algorithms. Wiley. Kundieva, L. arid Whitaker, C. (2002). Using diversity with three variants of boosting: aggressive. conservative and inverse. In J. Kittler and
Bibliography
581
F. Roli, editors, Multiple Class@er Systems, LNCS, volume 2364, pages 81-90. Springer-Verlag. Kuncheva, L. and Whitaker, C. (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2), 181 - 207. Kuncheva, L., Skurichina, M., and Duin, R. (2002). An experimental study on diversity for bagging and boosting with linear classifiers. Information Fusion, 3(2), 245-258. Kurcyusz, S. (1982). Matematyczne podstawy Ceorii optymalizacji. PWN, Warszawa. Lai, C., Tax, D., Pckalska, E., Duin, R., and Paclik, P. (2002). On combining one-class classifiers for image database retrieval. In J . Kittler and F. Roli, editors, Multaple Classifier Systems, LNCS, volume 2364, pages 212-221. Springer-Verlag. Lai, C., Tax, D., Duin, R., Pqkalska, E., and Paclik, P. (2004). A study on combining image representations for image classification and retrieval. International Journal of Pattern Recognition and Artificial Intelligence, 18(5), 867--890. Lam, L. (2000). Classifier combinations: implementation and theoretical issues. In J. Kittler and F. Roli, editors, Multiple Classifier Systems, LNCS, volume 1857, pages 78-86. Lance, G. and Williams, W. (1967). A general theory of classificatory sorting strategies. I. hierarchical systems. Computer Journal, (9), 373380. Landgrebe, D. (2003). Signal theory methods in multispectral remote sensing. John Wiley & Sons. Lang, S. (2004). Linear algebra. Springer, 3rd edition. Laub, J. and Miiller, K.-R. (2004). Feature discovery in non-metric pairwise data. Journal of Machine Learning Research, pages 801-818. Law. M. (Website). Manifold learning techniques - a collection of papers. http://www.cse.msu.edu/”lawhiu/manifold/. Lebourgeois, F. and Emptoz, H. (1996). Pretopological approach for supervised learning. In International Conference o’n Pattern Recognition, pages 256-260, Los Alamitos, CA. IAPR, IEEE Computer Society Press. Lee, J., Lendasse, A,,Donckers, N., and Verleysen, M. (2000). A robust nonlinear projection method. In European Symposium on, Artificial N ~ u ral Networks, pages 13-20, Bruges, Belgium. Lee; J., Lendasse, A., and Verleysen, M. (2002). Curvilinear distance analysis versus isornap. In European, Symposium on Artificial Neural Netmorks,
582
Bibliography
pages 185-192, Bruges, Belgium. Lee, M. (Website). Similarity judgements.
http : //www .psychology. adelaide.edu.au/members/staff/michaellee.html. Lemin, A. (1985). Isometric embeddings of isosceles (non-archirnedean) spaces in Euclidean spaces. Soviet Math. Dokl., 32(3), 740-744. Leon, S. (1998). Linear Algebra with Applications. Prentice Hall, 6th edition. Levenshtein, V. (1966). Binary codes capable of correcting delations, insertions and reversals. Soviet Phys. Dokl., 6, 707-710. Levina, E. and Bickel, P. (2001). The earth mover's distance is the Mallows distance: Some insights from statistics. In International Conference on Computer Vision, Vancouver, Canada. Li, M. and VitBnyi, P. (1997). An Introduction t o Kolmogorov Complexity and Its Applications. Springer-Verlag, New York, 2nd edition. Li, M., Cben, X., Li, X., Ma, B., and VitBnyi, P. (2003). The sirriilarity metric. In A C M - S I A M Symposium o n Discrete Algorithms, pages 863872, Baltimore, Maryland, USA. Ligteringen, R., Duin, R., Frietman, E., and Ypma, A. (1997). Machine diagnostics by neural networks, experimental setup. In P. J. H.E. Bal, H. Corporaal and J. Tonino, editors, Annual Conference of the Advanced School !or Computing and Imaging, pages 185-190, The Netherlands. Lin, D. (1998). An information-theoretic definition of similarity. In Interizational Conference on Machine Learning, pages 296--304. Morgan Kaufmann, San Francisco, CA. Linial, N. (2002). Finite metric spaces - combinatorics, geometry and algorithms. In International Congress of Mathematicians, pages 573-586, Beijing, China. Liu, T.-L. and Geiger, D. (1999). Approximate tree matching and shape similarity. In International Confereme o n Computer Vision, pages 456G 462, Greece. LLE (Website). Locally linear embedding. http ://www. cs . toronto. edu/ "roweis/lle/. Lowe, D.(1995). Similarity metric learning for a variable-kernel classifier. Neural Computation, 7(l ) ,72-85. MacKay, D. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. MacQiieen, J . (1967). Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposiurn o n Mathematical Statistics and Probability, pages 281-297, Berkeley, CA.
Bibliography
583
Malerba, D., Esposito, F., Gioviale, V., and V. Tamma, V. (2001). Comparing dissimilarity measures in symbolic data analysis. In Joint Conferences on ’New Techniques and Technologies f o r Statistcs ’ and )Exchange of Technology and Know-hoTw ’, pages 473-481. Malone, S. and Trosset, M. (2000). A study of the stationary configurations of the sstress criterion for metric multidimensional scaling. Technical Report 00-06, Department of Computational and Applied Mathematics, Rice University. Malone, S., Tarazaga, P., and Trosset, M. (2002). Better initial configurations for metric multidimensional scaling. Computational Statistics and Data Analysis, 41, 143-156. Mangasarian, 0. (1999). Arbitrary-norm separating plane. Operations Research Letters, 24(1-2), 15-23. Manly, B. (1994). Multivariate Statistical Methods. Chapman & Hall, Englewood Cliffs, New Jersey. Manning, C. and Schutze, H. (1999). Foundations o,f Statistical Natural Language Processing. MIT Press, Cambridge, MA. Mao, J. and Jain, A. (1995). Artificial neuraI networks for feature extraction and multivariate data projection. IEEE Transaction,s on Neurul Networks, 6, 296-317. Martinez, T. and Schulten, K. (1994). Topology representing networks. Neural Networks, 7, 507-523. Marzal, A. and Vidal, E. (1993). Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1 5 ( 9 ) , 926-932. Mathworld (Website). h t t p : //mathworld.wolfram. corn. MatouSek, J. (1990). Bi-Lipschitz embeddings into low dimensional Euclidean spaces. Commentationes Mathematicae Universitatis Carolinae, 31(3), 589 600. MatouSek, J. (2002). Lectures on Discrete Geometry. GTM Series. Springer. McLachlan, G. and Basford, K. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. MCS00 (2000). Multiple Classifier Systems, LNCS. MCS02 (2002). Multiple Classifier Systems, LNCS. Menger, K. (1931). New foundation of Euclidean geometry. American Journal of’ Mathematics, 53,721-745. Metropolis, N. and Ulam, S. (1949). The monte carlo method. Journal oj the American Statistical Association, 44, 335-341. MFEAT (Website). Multiple features database at UCI repository of Ma-
584
Bibliography
chine Learning databases. http: //www .i c s .u c i .edu/$\sim$mlearn/ MLRepository.htm1. Michalewicz, Z. (1999). Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag. Mic6, L. and Oncina, J. (1998). Corriparison of fast nearest neighbour classifiers for handwritten character recognition. Pattern Recognition Letters, 19(3-4), 351-356. Mic6, L., Oncina, J., and Carrasco, R. (1996). A fast branch & bound nearest neighbour classifier in metric spaces. Pattern Recognition Letters, 17(7), 731-739. Mika, S., Ratsch, G., Weston, J., Scholkopf, A. nad Smola, A., and Muller, K.-R. (2003). Learning discriminative and invariant nonlinear features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 623--628. MGler, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525-533. Moreno-Seco, F., Mi&, L., and Oncina, J. (2003). A modification of the LAESA algorithm for approximated k-nn clasification. Pattern Recognition Letters, 24( 1-3). 47-53. Mori, G., Belongie, S., and Malik, H. (2001). Shape contexts enable efficient retrieval of similar shapes. In Computer Vision and Pattern Recognition, volume 1, pages 723-730. Mottl, V., Dvoenko, S., Seredin, O., Kulikowski, C., and Muchnik, I. (2001a). Featureless pattern recognition in an imaginary Hilbert space and its application to protein fold classification. In International Workshop on Machine Learning and Data Mining in P a t t e m Recognition, pages 322-336, Leipzig. Mottl, V., Dvoenko, S., Seredin, O., Kulikowski, C., and Muchnik, I. (2001b). Featureless regularized recognition of protein fold classes in a hilbert space of pairwise alignment scores as inner products of amino acid sequences. Pattern. Recogn>itionan,d Image Analysis, Advances in Math,ematical Theory and Applications, 11(3), 597-615. Moya, M., Koch, M., and Hostetler, L. (1993). One-class classifier networks for target recognition applications. In World Congress o n Neural Netwosrks, pages 797-801, Portland, OR. International Neural Network Society. Muiioz, A., de Diego, I., and Moguerza, J. (2003). Support vector machine classifiers for asyrnrnet,ric proximit,ies. In In,tern,ntion,al Conference on Artificial Neural Networks, pages 217--224, Istanbul, Turkey.
Bibliography
585
Munkres, J. (2000). Topology. Prentice-Hall, Englewood Cliffs, New Jersey, 2nd edition. Murzin, A., Brenner, S., Hubbard, T., and Chothia, C. (1995). Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247, 536-540. Nadler, M. and Smith, E. (1993). Pattern recognition engineering. John Willey & Sons Inc., New York. Navarro, G. (2001). A guided tour to approximate string matching. A C M computing surveys, 33(1),31-88. Newsgroups data: a subset (Website). http: //www. c s .toronto.edu/ "roweis/data. html.
Ng, A . , Jordan, M., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, volume 14. The MIT Press. Noble, B. and Daniel, J. (1988). Applied linear algebra. Prentice-Hall, Englewood Cliffs, New Jersey. Ong, C., Mary, X.and Canu, S., and A.J., S. (2004). Learning with nonpositive kernels. In International Conference on Machine Learning, pages 639-646, Brisbane, Australia. Paclik, P. and Duin, R. (2003a). Classifying spectral data using relational representation. In Spectral Imaging Workshop, Graz, Austria. Paclik, P. and Duin, R. (2003b). Dissimilarity-based classification of spectra: computational issues. Real Time Imaging, 9(4), 237-244. Paredes, R. and Vidal, E. (2000). A class-dependent weighted dissimilarity measure for nearest neighbor classification problems. Pattern Recognition Letters, 21(12), 1027-1036. Parra, L., Deco, G., and Miesbach, S. (1996). Statistical independence and novelty detection with information preserving nonlinear maps. Neural Computation, 8, 260-269. Parzen, E. (1962). On estimation of a probability function and mode. Annals o f Mathematical Statistics, 33(3), 1065-1076. Paulsen, V. (2002). Completely Bounded Maps und Operator Algebras. Cambridge Studies in Advanced Mathematics, 78. Cambridge University Press, Cambridge. Pekalska, E. (2002). Dealing with the data flood. Mining data, text and multimedia, chapter Introduction to Multidimensional Scaling. STT/Beweton, The Hague, The Netherlands. Pqkalska, E. and Duin, R. (2000). Classifiers for dissimilarity-based pattern recognition. In International Conference on Pattern Recognition,
586
Bibliography
volume 2, pa.ges 12-16, Barcelona, Spain. Pekalska, E. and Duin, R.. (2001a). Automatic pattern recognition by similarity representations. Electronic Letters, 37(3), 159-160. Pekalska, E. and Duin, R. (2001b). On combining dissimilarity representations. In J. Kittler and F. Roli, editors, Multiple Classifier Svstems, LNCS, volume 2096, pages 359-368. Springer Verlag. Pekalska, E. and Duin, R. (2002a). Dissimilarity representations allow for building good classifiers. Pattern Recogn,ition Letters, 23(8) 943-956. Pekalska, E. and Duin, R. (2002b). Prototype selection for finding efficient representations of dissimilarity data. In R. Kasturi, D. Laurendeau, and C. Suen, editors, International Conference on Pattern Recognition, volume 3 , pages 37-40, Quebec City, Canada. Pekalska, E. and Duin, R. ( 2 0 0 2 ~ ) Spatial . representation of dissimilarity data via lower-complexity linear and nonlinear mappings. In T. Caelli, A. Amin, R. Duin, M. Kamel, and D. de Ridder, editors, International Workshop on SPR + SSPR, LNCS, volume 2396, pages 470-478. Springer-Verlag. Pekalska, E., Duin, R., Kraaijveld, M., and de Ridder, D. (1998a). An overview of IClultidirnerisioIial Scaling techniques with application to Shell data. Technical Report TN-97-036-1, Pattern R.ecognition Group, Delft University of Technology, The Netherlands. Pekalska, E., Duin, R., Kraaijveld, M., and de Ridder, D. (1998b). Multidimensional Scaling: Applications to Shell data. Technical Report TN-97036-3, Pattern Recognition Group, Delft University of Technology, The Netherlands. Pekalska, E., Duin, R., Kraaijveld, M., and de Ridder, D. (1998~).Multidimensional Scaling: Theoretical Aspects. Technical Report TN-97-036-2, Pattern Recognition Group, Delft University of Technology, The Netherlands. Pekalska, E., de Ridder, D., Duin, R., and Kraaijveld, M. (1999). A new method of generalizing Sammon mapping with application to algorithm speed-up. In Conference of the Advanced School for Cornputing and Imaging, pages 221-228, Heijen, The Netherlands. Pqkalska, E., Skurichina, M., and Duin, R. (2002a). A Discussion on the Classifier Projection Space for Classifier Combining. In J. Kittler and F. R.oli, editors, Multiple Classifier Systems, LNCS, volume 2364, pages 137-148. Springer Verlag. Pekalska, E., Paclik, P., and Duin, R. (2002b). A Generalized Kernel Approach to Dissimilarity Based Classification. Journal of Machine Learni n g Research, 2 ( 2 ), 175-2 11.
Bibliography
587
Pekalska, E., Tax, D., and Duin, R. (2003). One-class LP classifier for dissimilarity representations. In S. T. S. Becker and K. Obermayer, editors, Advances i n Neural Information Processing Systems 15; pages 761-768. MIT Press, Cambridge, MA. Pekalska, E., Skurichina, M., and Duin, R. (2004a). Combining Dissimilarity Representations in One-class Classifier Problems. In F. Roli, J. Kittler, and T. Windeatt, editors, Multiple Classifier Systems, LNCS, volume 3077, pages 122-133. Springer Verlag. Pqkalska, E., Duin, R., Gunter, S., and Bunke, H. (2004b). On not making dissimilarities Euclidean. In T. Caelli, A. Amin, R. Duin, M. Kamel, and D. de Ridder, editors, Joint I A P R International Workshops o'n SSPR and SPR, LNCS. pages 1145-1154. Springer-Verlag. Pqkalska, E., H a d , A., Lai, C., and Duin, R. (2005a). Pairwise selection of features and prototypes. In International Conference on, Computer Recognition Systems, pages 271-278, Rydzyna, Polanad. Pekalska, E., Duin, R., arid Paclik, P. (2005b). Prototype selection for dissimilarity-based classifiers. Pattern Recogn,ition, to appear. Persoon, E. and Fu, K . 3 . (1974). Shape Discrimination Using Fourier Descriptors. In 2nd International Conference on Pattern Recognitiosn, pages 126- 130, Copehagen, Denmark. Pettis, K., Bailey, T., Jain, A., and Dubes, R. (1979). An intrinsic dirnensionality estimator from near-neighbor information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(1),25-37. Pisier, G. (2003). Introduction to Operator Space Theory. London Mathematical Society Lecture Note Series, 294. Cambridge University Press, Cambridge. Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). N7merzeal Reczpes in C. Cambridge University Press, Cambridge. Pryce, J . (1973). Basic methods of linear functionml analysis. Hiitchinson University Library, London. Puzicha, J., Hofmann, T., and Buhmann, J. (1997). Non-parametric siniilarity measures for unsupervised texture segmentation arid image rctrieval. In IEEE International Conference o n Computer Vision and Pattern Recognition, pages 267-272, San Juan. Puzicha, J., Rubner, Y . , Tomasi, C., and Buhmann, J. (1999a). Empirical evaluation of dissimilarity measures for color and texture. In I E E E International Conference o n Computer Vision, pages 1165-1 173. Puzicl-la, J., Hofmann, T., and Buhmann, J. (199913).A theory of proximity based clustering: Structure detection by optimization. Pattern, Reco,9n.ition, 33(4), 617-634.
588
Bibliography
Pyatkov, S. (2002). Operator Theory. Nonclassical problems. VSP, Uteclit, Boston, Koln, Tokyo. Raftery, A. (Website). Bayesian model selection papers. http: //www. stat. washington.edu/raftery/Research/Bayes/bayes-papers.html%. Ramasubramanian, V.and Paliwal, K. (2000). Fast nearest-neighbor search algorithms based on approximation-elimination search. Pattern Recognition, 33(96), 1497-151. Raudys, S. and Duin, R. (1998). On expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognition Letters, 19(5-6), 385-392. Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. Cambridge. Rish, I. (2001). An empirical study of the naive Bayes classifier. In Workshop on ”Empirical Methods in AI” accompanying In,tern,ational Joint Conference on Artijicial Intelligence. Robert, C. (2001). The Bayesian Choice. Springer-Verlag, New York, 2nd edition. R.obinson’s notes (Website). http : //www .math.nwu. edu/-clark/310/ 2001/metric.pdf. Rodrigues, M. (2001). Invariants for Pattern Recognition and Classification. Series in Machine Perception and Artificial Intelligence, 42. World Scientific Publishing Co. Roth, V., Laub, J., Buhmann, J., and Muller, K.-R. (2003). Going metric: Denoisirig pairwise data. In Advances in Neural Information Processing Systems, pages 841-856. MIT Press. Rousseeuw, P. and Leroy, A. (1987). Robust Regression and Outlier Detection. John Wiley & Sons, New York. Rousseeiiw, P. and van Zomeren, B. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 8 5 , 633639. Rovnyak, J. (1999). Methods of Krein space operator theory, toeplitz lectures given at tel aviv. Roweis, S. and Saul, L. (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290, 2323-2326. Roweis, S., Saul, L., and Hinton, G. (2002). Global coordination of local linear models. Advances in Neural Information Processing Systems, 14. Rubinstein, Y. and Hastie, T. (1997). Discriminative vs informative learning. In Knowledge Discovery and Data Mining, pages 49-53.
Bibliography
589
Rubner, Y. (1999). Texture Metrics. Ph.D. thesis, Stanford University. Rubner, Y., Tomasi, C., and Guibas, L. (1998a). The earth mover’s distance as a metric for image retrieval. Technical Report STAN-CS-TN-98-86, Department of Computer Science, Stanford University. Rubner, Y., Tomasi, C., and Guibas, L. (199813). A metric for distributions with applications to image databases. In IEBE International Conference on Computer Vision, pages 59-66, Bombay, India. Rudin, W. (1986). Real and Complex Analysis. IVkGraw-Hill, New York, 3rd edition. Rudin, W. (1991). Functional analysis. McGraw Hill, New York, 2nd edition. Sadovnichij, V. (1991). Theory of Operators. Consultants Bureau, New York and London. Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4,406-425. Sammon Jr., J. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18, 401-409. Sanchez, J., Pla, F., and Ferri, F. (1997). Prototype selection for the nearest neighbor rule through proximity graphs. Pattern Recopition Letters, 18, 507-513. Sanchez, J., Pla, F., and Ferri, F. (1998). Improving the k-ncn classification rule through heuristic modifications. Pattern Recognition Letters, 19, 1165-1 170. Santini, S. and Jain, R. (1996). Gabor space and the development of preattentive similarity. In International Conference on Pattern Recogn,ition, Vienna, Austria. Santini, S. and Jain, R. (1997). Image databases are not databases with images. In A. D. Bimbo, editor, International Conference on Image Analysis and Processing, Florence, Italy. Santini, S. and Jain, R. (1999). Similarity measures. IEEE Transactions on, Pattern Analpsis and Machine Intelligence, 21(9), 871-883. Saul, L. (2003). Think globally, fit locally: Unsupervised learning of lowdimensional manifolds. Journal of Machine Learning Research, 4,119--
155. Scannell, J., Blakemore, C., and Young, M. (1995). Analysis of connectivity in the cat cerebral cortex. Journal of Neuroscience, 15, 1463-1483. Schaback, R. (1999). Native hilbert spaces for radial basis furictions i.
590
Bibliography
In M. Buhman, D. Mache, M. Felten, and M. M.W., editors, International Series of Numerical Mathematics, volume 132, pages 255-282. Birkhiiuser-Verlag. Schaback, R. (2000). A unified theory of radial basis functions (native hilbert spaces for radial basis functions ii) ,. Journal of Computational and Applied Mathematics, 121(1-2), 165-177. Schaback, R. and Wendland, H. (2001). Approximation by positive definite kernels. In M. Buhman and D. Mache, editors, Advanved Problems in Constructive Approximation, International Series in Numerical Mathematics, volume 142, pages 203--221. Birkhauser-Verlag. Schenker, A., Last, M., Bunke, H., and Kandel, A. (2003). Comparison of distance measures for graph-based clustering of documents. In Graph Based Representations in Pattern Recognition, LNCS, volume 2726, pages 202-213. Springer. Schoenberg, I. (1935). Remarks to maurice fritchet’s article ’sur la definition axiomatique... d ‘une classe d’espace distancies vectoriellement applicable siir l’espace de hilbert. Annals of Mathematics, 36(3), 724-732. Schoenberg, I. (1937). On certain metric spaces arising from Euclidean spaces by a change of metric and their imbedding in hilbert space. ilnnals of Mathematics, 38,787-797. Schoenberg, I. (1938a). Metric spaces and completely monotone functions. Annals of Mathematics, 39, 811-841. Schoenberg, I. (193813).Metric spaces and positive definite functions. Transactions on American Mathematical Society, 44,522-536. Scholkopf, B. (1997). Support vector learning. Ph.D. thesis, Verlag, Munich. Scholkopf, B. (2000). The kernel trick for distances. In Advances in Neural Information Processing Systems, Vancouver, British Columbia, Canada. Scholkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge. Schiilkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., and Vapnik, V. (1997a). Comparing support vector machines with gaussian kcrnels to radial basis function classifiers. IEEE Transations on Signal Processing. Scholkopf, B., Smola, A,, and Miiller, K.-R. (199713). Kernel principal component analysis. In International Conference on Artificial Neural Networks. Schijlkopf, B., Smola, A . , and Miiller, K.-R. (1998a). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, lO(5). Scholkopf, B., Simard, P., Smola, A,, and Vapnik, V. (199813). Prior knowl-
Bibliography
591
edge in support vector kernels. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processings Systems, volume 10, pages 640-646, Cambridge,MA. MIT Press. Scholkopf, B., Mika, S., Burges, C., Knirsch, P., Muller, K.-R., Ratsch, G., and Smola, A. (1999a). Input space vs. feature space in kernel-based methods. IEEE Transations on Neural Networks. Scholkopf, B., Smola, A,, and Muller, K.-R. (1999b). Kernel principal coniponent analysis. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods, Support Vector Learning, pages 327-352. MIT Press, Cambridge, MA. Scholkopf, B., Smola, A., Williamson, R., and Bartlett, P. (2000a). New support vector algorithms. Neural Cornputation, 12, 1207-1245. Scholkopf, B., R..C., W., Smola, A . , Shawe-Taylor, J . , and Platt, J. (2000b). Support vector method for novelty detection. In Advances in Neural
Information Processing Systems. Scholkopf, B., Platt, J., Smola, A., and Williamson, R. (2001). Estirnatirig the support of a high-dimensional distribution. Neural Computation, 13, 1443-1471. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464. Scott, D. (2004). Outlier detection and clustering by partial mixture rnodeling. In Computational Statistics, Prague, Czech R.epublic. Sebastian, T. and Kimia, B. (2001). Curves vs skeletons in object recognition. In International Conference on Image Processin,g, Thessaloniki, Greece. Sebastian, T. and Kimia, B. (2003). Curves vs skeletons in object recognition. Signal Processing, t o appear. Sebastian, T., Klein, P., and Kimia, B. (2001). Recognition of shapes by editing shock graphs. In International Conjerence on Computer Vision, pages 755-762. Sebastian, T., Klein, P., and Kimia, B. (2002). Shock-based indexing into large shape databases. I n European Conference on Computer Vision, volume 3 , pages 731-746. Sebastian. T., Klein, P., and Kimia, B. (2003). On aligning curvcs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(I), 116125. Sharvit, D., Chan, J., Tek, H., and Kimia, B. (1998). Symmetry-based indexing of image databases. Journal of Visual Communication and Image Representation, 9(4), 366-380.
592
Bibliography
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press, UK. Shewchuk, J. (1994). An introduction to the conjugate gradient method without the agonizing pain. Technical report, chool of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Sierpinski, W. (1952). General Topology. University of' Toronto Press, Toronto. SIGIR (Website). Sigir, special interest group on information retrieval. http://www.sigir.org/. Simard, P., Le Cun, Y., and Denker, J. (1993). Efficient pattern recognition using a new transformation distance. In Advances in Neural Information Processing Systems, pages 50-58, Canada. Simard, P.; Le Cun, Y., Denker, J., and Victorri, B. (1998). Transformation Invariance in Pattern Recognition - Tangent Distance and Tangent Propagation, volume 1524, pages 239-274. Springer, Heidelberg. Skurichina, M. (2001). Stabilizing Weak Classifiers. Ph.D. thesis, Delft University of Technology, Delft, The Netherlands. Skurichina, M. and Duin, R. (2003). Combining different normalizations in lesion diagnostics. In 0. Kaynak, E. Alpaydin, E. Oja, and L. Xu, editors, Artificial Neural Networks and Information Processing, Supplementary Proceedings ICANN/ICONIP, pages 227-230, Istanbul, Turkey. Srnola, A., Friess, T., and Scholkopf, B. (1999). Semiparametric support vector and linear programming machines. In M. Kearns, S. Solla, and D. Cohn, editors, Advances in Neural Information Processings Systems 11, pages 585-591, Cambridge,MA. MIT Press. Sneath, P. arid Sokal, R. (1973). Numerical Taxonomy. W.H. Freeman, San Francisko, California. Stadler, B. and Stadler, P. (2001a). Basic properties of filter convergence spaces. Technical report, Institute for Theoretical Chemistry and Structural Biology, University of Vienna, Austria. Stadler, B. and Stadler, P. (2001b). Higher separation axioms in generalized closure spaces. Annales Societatis Mathematicae Polonae. Series I. Commen,tationes Mathematicae, submitted. Stadler, B,. and Stadler, P. (2002). Basic properties of closure spaces. Technical report, Institute for Theoretical Chemistry and Structural Biology, University of Vienna, Austria. Stadler, B., Stadler, P., Wagner, G., and Fontana, W. (2001). The topology of the possible: Formal spaces underlying patterns of evolutionary change. Journal of Theoretical Biology, 213(2), 241-274.
Bibliography
593
Stadler, B., Stadler, P., Shpak, M., and Wagner, G. (2002). Recombination spaces, metrics, and pretopologies. Zeitschrifl fur Physikalische Chemie, 216, 217-234. Stanfill, C . and Waltz, D. (1986). Toward memory-based reasoning. Communications of ACM, 29, 1213-1228. Stephen, G. (1998). String Searching Algorithms. World Scientific Publishing Company, 2nd edition. Straws, R. (Website). http: //www. biol. t t u . edu/Strauss/Matlab/ matlab.htm. Strehl, A. (2002). Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. Ph.D. thesis, University of Texas a.t Austin, USA. Strehl, A. and Ghosh, J. (2002a). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Mach.in,e Learning Research, 3,583-617. Strehl, A. and Ghosh, J. (2002b). Cluster ensembles - a knowledge reuse framework for combining partitionings. In Conference on Artificial Intelligence, pages 93-98, Edmonton. Strehl, A., Ghosh, J., and Mooney, R. (2000). Impact of similarity measures on web-page clustering. In National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, pages 58-64, Austin, Texas, USA. AAAI. Struik, D. (1988). Lectures on Classical Diflerential Geometry. Dover, New York. Taneja, I. (1989). On generalized information measures and their applications. In P. Hawkes, editor, Advances in Electronics and Electron Physics, volume 76, pages 327-413. Taneja, I. (1995). New developments in generalized information measures. In P. Hawkes, editor, Advances in Imaging and Electron Physics, volume 91, pages 37-135. Taneja, I. (Websit,e). Generalized information measures and their applications - online book. http://rntm.ufsc.br/"taneja/book/book.html. Tarassenko, L.,Hayton, P., and Brady, M. (1995). Novelty detection for the identification of masses in mammograms. In International I E E Conference on Artificial Neural Networks, volume 409, pages 442-447. Tax, D. (2001). One-class classification. Concept-learning in the absence of counter-examples. Ph.D. thesis, Delft University of Technology, The Net herlands. Tax, D. (2003). DD-Tools, a Matlab toolbox for data description, outlier
594
Bibliogmphy
and novelty detection. Tax, D. and Duin, R. (1999). Support vector domain description. Pattern Reco,qnition Letters, 20( 11-13), 1191-1 199. Tax, D. and Duin, R. (2001). Combining one-class classifiers. In J. Kittler and F. Roli, editors, Multiple Classifier Systems, LNCS, volume 2096, pages 299-308. Springer Verlag. Tax, D. and Duin, R. (2002). Uniform object generation for optimizing one-class classifiers. Journal for Machine Learning Research, 2(2), 155173. Tax, D. and Duin, R. (2004). Support vector data description. Machine Learning, 54( l ) ,45-56. Tax, D. and Miiller, K.-R. (2004). A consistency-based model selection for one-class classification. In Internmtional Conference on Pattern Recognition, volume 2, pages 363 366, Cambridge, UK. Tch, Y. and Roweis, S. (2003). Automatic alignment of local representas in Neural Information Processing Systems, 15.
Teneiibauni, J., de Silva, V., and Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319-2323. Terra, E. and Clarkc, C. (2003). Frequency estimates for statistical word similarity measures. In Human Language Technologg and North American, Ch,apter of Assocadion, of Computational Linguistics Conference, pages 244-25 1. Texture data (Website). ftp://whitechapel .media.mit . edu/pub/ VisTex/.
Thayananthan, A., Stenger, B., Torr, P., and Cipolla, R. (2003). Shape context and chamfer matching in cluttered scenes. In IEEE Conference o n Computer Vision and Pattern Recognition, pages 127-133, Wisconsin. Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the nuniher of clusters in a data set via the gap statistic. Journ,al of the Royal Statisticul Society: Series B (Statistical Methodology), 63(2), 411-423. Tipping, M. (2000). The relevance vector machine. In Aduances in Neural Information Processing Systems, San Mateo, CA. Tipping, M. and Bishop, C. (1999). Mixtures of probabilistic principal cornporient analysers. Neural Computation, 11(2), 443-482. Topchy, A., Miriaei, B., Jain, A., and Punch, W. (2004). Adaptive clustering ensembles. In R. Kasturi, D. Laurendeau, and C. Suen, editors, International Conference on Pattern Recogn,ition, Cambridge,United Kingdom. Torgerson, W. (1967). Theory and Methods of Scaling. John Wiley & Sons.
Bibliography
595
Torsello, A. and Hancock, E. (2003). Computing approximate tree edit distance using relaxation labeling. Pattern Recognition Letters, 24(8), 1089 ~-1097. Trosset, M. and Mathar, R. (2000). Optimal dilations for metric rnultidimensional scaling. In Statistical Com,puting Section, American Statistical Association. Tukey, J. (1960). A survey of sampling from contaminated distributions. In I. Olkin, S. Ghurye, W. Hoeffding, W. Madow. and H. Mann, editors, Contribution,s to Probability and Statistics. Essuys in Honm- of Harold Hotelling, pages 448-485. Stanford University Press, Stanford, CA. Tversky, A. (1977). Features of similarity. Psychological Reiiiew, 84(4) 327-352. van dcr Heiden, F., Duin, R., de Ridder, D., and Tax, D. (2004). Classi,ficution, Parameter Estimation, State Estim,ation: An, Enyineerinq Approach Using MatLab. Wiley, New York. Vapnik, V. (1995). Th,e Nature o j Statistical Learning Theory. Springer Verlag. Vapnik, V. (1998). Statistical Learning Theory. John Wiley & Sons, Inc. Vasconcelos, N. and Kunt, M. (2000). Content-based retrieval from irnage databases: Current solutions and future directions. In In,terrLation,al Conference on, Image Processing, Thessaloniki, Greece. Vasconcelos, N. and Lippnian, A. (2000). A unifying view of image sirnilarity. In In,ternationml Co,nference o n Pattern Recognition, Barcelona, Spain. Veltkamp, R.. (2001). Shape matching: Similarity measures and algorithms. Technical Report UU-CS-2001-03, Utrecht University, the Ncthcrlands. Veltkamp, R. arid Hagedoorn, M. (1999). State-of-the-art in shape inatching. Technical Report UU-CS-1999-27, Utrecht University, the Netherlands. Verma: R. (1991). A Metric Approach to Isolated Word Recog ter’s thesis, Departrncnt of Computer Science, University o Vidal, E., Marzal, A., arid P., A. (1995). Fast computation of normalized edit distances. IEEE Transaction.s on Pattern, Anmlysis arid Machine Intelligence, 17(9), 899 902. VitAnyi, P. (2005). Universal similarity. In I E E E I T S O C Irifor Theory Workshop on Coding and Complezity, Rotorua, New Zealand. von Luxburg, U. and Bousquet, 0. (2003). Distance-based classifica,t,ion with Lipschitz functions. In B. Scholkopf and M. Warrnuth? editors, Annual Conference on Gomp~utationalLeurning Theory, pages 314-328, ~
-
596
Bibliography
Berlin - Heidelberg, Germany. Springer Verlag. von Luxbmg, U. and Bousquet, 0. (2004). Distance-based classification with Lipschitz functions. Journal of Machine Learning Research, 5 , 669695. Wagner, R. and Fisher, M. (1974). The string-to-string correction problem. Journal of the Association for Computing Machinery, 21(1);168-173. Wahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods, Support Vector Learning, pages 69---88.MIT Press, Cambridge, MA. Watanabe, S. (1974). Pattern Recognition, Human and Mechanical. Academic Press, New York. Webb, A. (1995). Multidimensional scaling by iterative rnajorization using radial basis functions. Pattern Recognition, 28(5), 753-759. Webb, A. (1997). Radial basis functions for exploratory data analysis: An iterative majorisation approach for Minkowski distances based on multidimensional scaling. Journal of Classification, 14(2), 249-268. Webster dictionary (Website). h t t p : //www .m-w.com. Wells; J. and Williams, L. (1975). Embeddings and Extensions in Analysis. Springer-Verlag, Berlin. Werman, M. and Weinshall, D. (1995). Similarity and affine invariant distance between 2D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8). Wharton, C., Holyoak, K., Downing, P.E., Lange, T., and Wickens, T. (1992). The story with reminding: Memory retrieval is influenced by analogical similarity. In Annual Conference of the Co,qnitive Science Society, pages 588-593, Blomington. Wilks, S. (1962). Mathematical Statistics. John Wiley 8~ Sons, Inc., New York - London. Willard, S. (1970). General Topology. Addison-Wesley Publishing Company. Wilson, C. and Garris, M. (1992). Handprinted character database 3. Technical report, National Institute of Standards and Technology. Wilson, D. and Martinez, T . (1997). Improved heterogeneous distance functions. Journal of Artzficial Intelligence Research, 6, 1-34. Wilson, R. and Martinez, T. (2000). Reduction techniques for instancebased learning algorithms. Machine Learning, 38(3), 257-286. Wood, J. (1996). Invariant pattern-recognition: A review. Pattern Recognition, 29( l ) ,1-17.
Bibliography
597
Younes, L. (1998). Computable elastic distances between shapes. SIAM Journal on Applied Mathematics, 5 8 ( 2 ) ,565-586. Younes, L. (1999). Optimal matching between shapes via elastic deformations. Image and Vision Computing, 17(5-6), 381-389. Young, G. and Householder, A. (1938). Discussion of a set, of points in terms of their mutual distances. Psychometrika, 3,19-22. Ypma, A. and Duin, R. (1998). Support objects for domain approximation. In International Conference on Artificial Neural Networks, pages 719724, Skovde, Sweden. Springer, Berlin. Ypma, A., Ligteringen, R., Frietman, E., and Duin, R. (1997). Recognition of bearing failures using wavelets and neural networks. In British Symposium on Applications of Time-Frequency and Time-Scale Methods, pages 69-72, University of Warwick, Coventry, UK. Zahn, C. and Roskies, R. (1972). Fourier Descriptors for Plane Closed Curves. IEEE Transactions, C-21(3), 269-281. Zavrel, J. (1997). An empirical re-examination of weighted voting for knn. In Belgian-Dutch Conference on Machine Learning, pages 139-148, Tilburg, The Netherlands. Springer, Berlin. Zha, H. and Zhang, Z. (2003). Isometric embedding and continuum isomap. In International Conference on Machine Learning, Washington, DC. Zhu, S. and Yuille, A. (1996). Forms: A flexible object recognition and modeling system. International Journal on Computer Vision, 20(3), 187212.
This page intentionally left blank
Index
AUC measure, 339, 456 average linkage, 295
L?, 58, 91 L E , 91
c(n),58 backward triangle inequality, 47 bagging, 478 Banach space, 65 basis dual, 59 Hilbert space, 68, 79 local, 33 pretopological space, 33 pseudo-Euclidean space, 74, 75 vector space, 57 Bayes error, 156 Bayes rule, 152 Bayesian inference, 162 Bessel inequality, 68 bias, 148-151, 157 bias-variance dilemma, 157 bijection, 26 bilinear form, 60, 61 map, 60 binary feature, 216 binary relation, 26 binomial distribution, 531 Blumenthal, 100 boosting, 478 Bore1 algebra, 527 boundary descriptor, see domain descriptor bounded functional, 65 bounded linear operator, 65 bounded subset, 50
. q f l ) , 57 J ,78 3-idempotent, matrix, 524 J-isometric, 80 3-negative definite matrix, 524 J-normal matrix, 524 3-orthogonal, 76 3-orthogonal matrix, 524 3-positive definite matrix, 524 3-pseudo-inverse matrix, 523 3-symmetric, 76 3-symmetric matrix, 524 J-unitary, 80 JPPq, 74 M(fl),58 el -embeddable, 131 !?,-distance, 47, 91 &distance representation, 210, 404, 413 e;, 91 0-algebra, 527 Ic-centers, 298 k-means clustering, 291 k-th central moment, 530 k-th moment, 530 additive distance tree, 97, 282-285 additive tree, 97 adjoint, 68, 80 of a matrix, 519, 523 algebraic dual space, see dual space antispace, 77
Cartesian product, 44 599
600
categorical feature, 216 Cauchy sequence, 49 Krein space, 79 Cauchy-Bunyakovski-Schwarz, see Schwarz inequality centering matrix, 108 centroid linkage, 296 chain code, 246 chain rule, 530 chi-squared distribution, 531 class of objects, 149 classical scaling, 118, 263, 264 classification, 153 dissimilarity space, 180-196 generalized topological space, 175-1 79 pseudo-Euclidean space, 196-205 classifier, 153, 185-196 decision tree) 195 Fisher linear discriminant, 188 Fisher linear discriminant in a pseudo-Euclidean space, 201 generalized nearest mean in a pseudo-Euclidean space, 198 linear programming machine, 194-1 95 logistic, 189 naive Bayes, 189 nearest mean, 188 nearest mean in a pseudo-Euclidean space, 198 nearest neighbor rule, 177, 195 k-NN rule, 177, 384 edited and condensed, 179 weighted, 178 normal density based, 187 linear, 187 quadratic, 187 regularized, 187- 188 Parzen, 195 pseudo-Fisher linear discriminant, 188
Index
quadratic in a pseudo-Euclidean space, 202 relevance vector machine, 193 support vector machine, 190-193, 202-205 weighted nearest mean, 188 classifier projection space, 475 classifier-clustering, 292 cluster ensembles, 294 cluster validity, 292, 302 clustering, 290-295 by neighborhood relations, 295 hierarchical, 290 partitioning, 291 codomain of a function, 25 combined representation, 457-458 combining classifiers, 458-459 combining strategies, 466, 476 compact space, 43 compactness hypothesis, 165, 169, 320, 428 complete linkage, 295 completeness, 49 complexity algorithmic, 163 model, 157, 160 of a classifier, 160, 161, 165 of a function, 161 composition of mappings, 26 concave function, 515 concave transformation, 415 concept, 148 condensing, 179, 397, 400 conditionally negative definite function, 69 matrix, 108, 520 conditionally positive definite function, 69 matrix, 520 connected graph, 96 continuity dissimilarity, 53 generalized metric space, 55 neighborhood spaces, 41 contraction. 92
Index
convergence K r e h space, 79 metric space, 49 neighborhood space, 40 quasimetric space, 53 convex function, 515 set, 62 Courrieu, 101 cover, 42 Cox and Cox’s coefficient, 223 Cramer-Rao inequality, 534 cumulative distribution function, 530 curse of dimensionality, 157 Curvilinear Component Analysis, 274 cut semimetric, 94 decision function, see classifier dendrogram, 290 dense subset, 43, 49 density estimation, 154 density linkage, 296 Dice distance, 219 difference metric, 224 dimension of a vector space, 57 direct product, 102 metric spaces, 102 directed set, 27 disagreement between classifiers, 476 discriminative classifier, 189 disparity, 137 dissimilarity, 2 concave transformation, 347, 415 representation, 166-172 space, 173 dissimilarity measure affinity, 234 between sequences, 234-237 between sets, 238-241 categorical data, 218 dichotomous data, 217 divergence, 229-233 Gower’s, 222 ordinal data, 218 quantitative data, 220 symbolic data, 221
601
Tversky model, 226 dissimilarity representation, 5, 10 distance !,-distance, 47, 91 91 d f , 47 d z , 47 city block, 47 Dice, 219 discrete, 47 Euclidean, 47 geodesic, 130, 271 Hamming, 234 fuzzy Hamming, 235 Hausdorff, 238-241 fuzzy Hausdorff, 241 Jaccard, 219, 440 Levenshtein, 235 max-norm, 47 Minkowski, 47 modified-Hausdorff, 241 simple matching, 219, 440 spherical, 130 Yule, 219, 440 distribution binomial, 531 chi-squared, 531 normal, 531 uniform, 531 divergence X2-divergence, 232 tn-distance, 233 Bhattacharyya coefficient, 232 Chernoff coefficient, 233 Hellinger coefficient, 232 information radius, 231 Kullback-Leibler, 230 variation distance, 233 diversity measure, 453, 474-476 domain, 148 domain descriptor, 335 domain of a function, 25 dual basis, 59 dual space, 59, 62 Banach, 65 normed space, 65
!T,
602
editing, 179, 397, 400 eigendecomposition, 521 eigenvalue, 521 eigenvector, 521 EM algorithm, 292, 535 EM-clustering, 292 embedding, 90-95 !,-embedding, 91 Curvilinear Component Analysis, 274 distorted, 95 Euclidean, 118-120 Isomap, 271 isometric, 91 locally linear, 269 pseudo-Euclidean, 122-1 29 spherical, 131 empirical risk minimization, 156 equivalence class, 26 equivalence relation, 26 error empirical error, 152 true error, 152 Euclidean behavior, 107-111 distance, 47, 116 distance matrix, 107-109 embedding, 118-120 evaluation functional, 61 evidential clustering, 301 expectation, 530 expectation-maximization, see EM algorithm FastMap, 133, 261, 263, 264 feature, 2, 6, 9, 150, 151 types, 216 field, 56 filter, 40 filter basis, 40 Fisher information, 533 four-point property, 96 function, see mapping fundamental decomposition, 73 projection, 78
Index
symmetry, 78 gap statistics, 293 generalized centroid linkage, 296 generalized closure, 35 generalized interior, 35 generalized metric space, 51 generalized topology, 32-44 generalized Ward linkage, 297 Goldfarb, 164 Gower, 105, 112 Gower’s coefficient, 222 Gram matrix, 105, 109, 118, 520 Gram operator, 70 Krein space, 81 graph, 96 connected, 96 internal nodes, 96 leaves, 96 minimum path distance, 96 group, 56 HI-norm, 78 H-scalar product, 78 Hadamard operation, 522 Hadamard product, 522 Hamel basis, 57 Hausdorff distance, 238 space, 40 Heine-Bore1 theorem, 50 Hermitian kernel, 69, 85 Hermitian matrix, 520 Hermitian symmetry, 66, 72, 77 Hilbert space, 67 homeomorphism, 45 homomorphism, 58 hypermetric inequality, 96 hypothesis function, 152 hypothesis space, 152, 153 idempotent matrix, 520 identity matrix, 520, 524 image, 26 image of a linear map, 59 indefinite inner product, 72, 117
Index
indefinite least-square problem, 83, 84 independent events, 529 information- theoretic similarity measure, 237 informative classifier, 189 injection, 26 inner product, 66, 72, 116 internal nodes, 96 int,rinsic dimension, 309 for a Gaussian sample, 312-314 invariance, 242 invariant measure, 104, 242 inverse function, 25 inverse matrix, 519, 523 Isomap, 271 isometry metric spaces, 91 pseudo-Euclidean spaces, 76 isotropic vector, 72 Jaccard distance, 219, 440, 442 Jensen's inequality, 516 joint distribution, 531 kernel, 13 generalized kernel, 208 of a linear map, 58 reproducing kernel, 85, 86 kernel PCA, 272 Krein space, 77 kurtosis, 531 label, 153 learning principle Bayesian inference, 162 inductive, 154 minimum description length, 163 Occam razor, 160 regularization, 161 structural risk minimization, 160 transductive, 155 least square scaling, 137 leaves, 96 Lebesgue integral, 528 Lebesgue measure, 528 left linear map, 61
603
likelihood, 162, 533 limit point, 39 linear combination, 57 independence, 57 map, 58 space, see vector space linear programming machine, 194 linearity, 66, 72, 77 Lipschitz continuous, 92 locally corripact space, 62 locally convex space, 62 locally linear embedding, 269 log-likelihood, 534 loss function, 152 stress, 137-141 true loss, 152 mapping, 25 marginal distribution, 531 matrix 3-idempotent, 524 3-negative definite, 524 3-normal, 524 3-orthogonal, 524 3-positive definite, 524 3-symmetric, 524 conditionally negative definite, 520 conditionally positive definite, 520 Hermitian, 520 idempotent, 520 identity, 520, 524 inverse, 519, 523 negative definite, 520 negative semidefinite, 520 normal, 520 orthogonal, 520 permutation, 520, 524 positive definite, 520 positive semidefinite, 520 projection, 520, 524 pseudo-inverse, 520 pseudoinverse, 523 singular, 520, 524 symmetric, 520 unitary, 520
604
maxiniuni a posteriori estimation, 535 maximum likelihood estimation, 534 measurable function, 528 measurable set, 527 measurable space, 527 measure, 527 measure space, 527 measurement, 149 Mercer theorem, 71 metric, 46 definiteness, 46 reflexivit,y, 46 symmetry, 46 triangle inequality, 46 metric space, 46 Cauchy sequence, 49 completeness, 49 convergence, 49 direct product, 102 natural topology, 48 minimum description length, 163 minimum spanning tree, 283 Minkowski distance, 47 missing value problem, 439 mixture of Gaussians, 342, 539 mode-seeking, 299 model complexity, 157, 160 model selection, 160, 161 MoG, see mixture of Gaussians Moore-Aronszajn theorem, 71 Moore-Penrose pseudo-inverse, 520 multidimensional scaling, 135-144, 257, 261 classical scaling, 118 implementation, 141, 267-268 linear examples, 261-266 missing values, 267 nonlinear examples, 261-266 reduction of complexity, 143 multivariate normal distribution. 531 negative definite function, 69 operator Krein space, 81 negative definite matrix, 520
Index
negative semidefinite matrix, 520 negative type, 97 neighborhood basis, 33 closed set, 35 generalized topology, 33 of a set, 35 open set, 35 system, 33, 36 neutral vector, 72 nominal feature, 217 non-degenerate bilinear form, 60 inner product, 66 norm, 63 non-Euclidean dissimilarity correction, 111, 120-122, 428-430 non-target example, 333 norm, 63 operator, 65 normal distribution, 531 normal matrix, 520 normal space, 41 normed space, 63 Occam razor principle, 160 operator norm, 65 ordinal feature, 216 orthogonal, 67, 72 complement, 68, 72 expansions, 68 Krein space, 79 orthogonal matrix, 520 orthonormal basis, 68, 74, 79 outlier example, 333 parallelogram law, 66 parametric method, 156 Parseval formula, 68 partially ordered set, 26 Parzen density, 343 path distance, 96 path in a graph, 96 path metric, 98 pattern, 1 pattern recognition, 2, 5
Index
statistical, 6, 9 structurad, 6 pattern recognition system, 7 PCA, 119, 541 PCA-dissimilarity space, 300 Pearson correlation, 469 penalized risk, 161 permutation matrix, 520, 524 Plancherel inequality, 68 polarization identity, 66 Pontryagin space, 77 positive definite function, 69 matrix, 520 operator Krein space, 81 positive semidefinite matrix, 108, 520 preimage, 26 premetric, 51 primitive, 6 principal component analysis, 541 probabilistic, 542 probabilistic PCA, 542 mixture, 543 probability conditional, 529 cumulative distribution function, 530 density function, 530 distribution. 529 independence, 529 joint, 529 measure, 529 space, 529 Procrustes analysis, 143 product space, 55 project ion Krein space, 81, 82 Hilbert space, 68 projection matrix, 520, 524 prototype, 168, 383 prototype selection criteria Average Projection Error, 419 EdiCon, 400 FeatSel, 399
605
KCentres, 399, 418 KCentres-LP, 400 Largest Approximation Error, 420 LinProg, 400 MaxProj, 419 Modeseek, 399 NLC-err, 420 Pivots, 420 Random, 398, 418 RandomC, 398 proximity, 10, 12 proximity representation, 168 conceptual, 167 learned, 171-1 72 relative, 167 pseudo-Euclidean distance, 117 embedding, 122-129 generalized variance, 124 space, 73 pseudo-inverse matrix, 520 quantitative feature, 216 quasimetric, 52 random subspace method, 478 random variable, 530 k-th central moment, 530 k-th moment, 530 continuous, 530 discrete, 530 expectation, 530 kurtosis, 531 skewness, 530 variance, 530 random vector, 531 range of a function, 25 raw stress see stress, 137 reconstruction error, 343 regression, 153 regular space, 41 regularization, 161, 187 regularization principle, 161 representation proximity, 168
606
representation set, 168 Riesz representation theorem, 69 right linear map, 61 risk empirical risk, 152 penalized, 161 true risk, 152 robust measure, 242 robust statistics, 334 ROC curve, 339 Sammon mapping, 138-144, 261, 263, 264 Sammon stress, 139, 141 Sammon-Isomap, 272 sampling criteria compactness, 325 correlation, 324 intrinsic embedded dimension, 324 mean relative rank, 321 PCA dimension, 322 skewness, 321 Schoenherg, 93, 131. Schur product, 522 Schwarz inequality, 67, 79 second dual space, 61 self-adjoint, 68 semimetric, 52 seminorm, 63 serninormed space, 63 sesquilinearity, 66, 72, 77 set closed, 35, 38 generalized closure, 35 generalized interior, 35 neighborhood system, 33 open, 35, 38 partial order, 26 total order, 27 similarity matrix, 111 simple matching, 442 simple matching distance, 219, 440 single linkage, 295 singular matrix, 520, 524 skewness, 530
Index
SOM, 274 space Krein , 77 Banach, 65 continuous dual, 62 dissimilarity, 173 dual, 59 generalized metric, 46 generalized topological, 32 Hausdorff, 40 Hilbert, 67 hollow, 51 indefinite inner product, 72 inner product, 66 linear, 57 metric, 46 normal, 41 normed, 63 Pontryagin, 77 pre-Hilbert, 67 premetric, 51 pseudo-Euclidean, 73 quasimetric, 52 regular, 41 reproducing kernel, 69-71, 85--87 Krein space, 86 Hilbert space, 69 Pontryagin space, 86 second dual, 61 semimetric, 52 seminormed, 63 topological, 32 compact, 43 ultrametric, 51 vector space, 56-62 sparse average selection, 300 spatial represent at ion, 132-1 44 Spearman rank correlation, 469 spectral clustering, 301 statistical learning, 151-163 stress, 136 least square scaling, 137, 140 raw stress, 137 Sammon mapping, 139, 141 strong triangle inequality, 51 structural risk minimization, 160
Indez
subspace, 57 Krein regular, 81 degenerate, 72 isotropic, 72 orthogonal, 68, 72 positive, 81 uniformly positive, 81 support vector machine, 190, 193 surjection, 26 SVM, 190 symbolic feature, 217 symmetric matrix, 520 target, 333 topological vector space, 62 topological product space, 44 topology, 32-44 metric space, 48 totally bounded subset, 50 totally ordered set, 27 training set, 151-153 transformat ion (semi)metric, 99 concave, 99, 347, 415 tree additive, 97 additive distance, 97, 282 minimum spanning, 283 model, 95, 281 path metric, 97 ultrametric, 97 ultrametric distance, 98, 283 triangle inequality, 46 true representation hypothesis, 165, 320 Tversky model, 226 ultrametric distance tree, 98, 282-285 ultrametric inequality, 51, 96 ultrametric space, 51 ultrametric tree, 97 uniform distribution, 531 unitary matrix, 520 variance. 530
607
VC dimension, 157, 160. 165 vector isotropic, 72 neutral, 72 orthogonal, 68, 72 vector representation, 109 vector space, 57 visual cluster validity, 302 Wards’s linkage, 296 weak classifier, 458, 478 well-ordered set 27 ~
Yule distance, 219, 440, 442 zero-error classifier, 444