This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
where/* is a classical context-free production rule and a is an element in L. Intuitively the meaning is "apply rule/? when you know a". • T is a closure operator on a lattice L whose elements we call worlds. We call model for the grammar G a world coe L such that r(co) = co. Also , we say that information in G is complete w.r.t. a world © provided that T(a) = co. Definition 1 separates the pattern recognition into parsing pattern detection On which a good amount of research has been done - from the problem of representation of knowledge.
17 l
UW^^M&^^&SX
•
Proof Example Solution ^ S M S l t e
H I s Dodecvc) — Dodeeid.<
/ Givea / Given
I" Small(c)
/
j#
/ Anfe
!• Dodec(c) ! a Dodec(d)
/ /
Ob^rw -»Eimi
{^ ^
/
Assume
|^ ^
/
Assume
Exhaustive SameShapeCc, d)
8
Hi
Given
/ Extant / tepeet O! «
Figure 4. Afraipnentof proof in Hyperproof at Stanford University
Hyperproof -depicted in Figure 4 -in an example of information system in which the pattern rules are mixed up with the -partial- knowledge of world. This is reached by using first order logic as a unified layer. . In this example we establish that c and d have the same shape. The other approach to pattern recognition in GIS is data mining. Data mining technique are based on a cluster of discriminating fiinctions (Fi)i€p „] An element x — represented as vector of features - is recognized by a class i provided that F\ (xj > F-} (x) for anyy * /.. In general each class is characterized by a training set ¥\ that is composed by a collection of prototypical elements. From the cognitive point of view this approach is rather poor since pattern recognition is only a low-level process. As a matter of fact for- we humans it is not difficult to formulate new concepts from simpler components. For example, a city is an aggregate' of houses. This is
18 rather difficult to model with techniques of data mining because an aggregation of prototypes is not always the prototype of an aggregate. Conceptual spaces originated from the works of Gardenfors (Cfr [5]). A conceptual space is a geometrical model of concept formation. According to Gardenfors original idea a conceptual space is a t-ple C = <(A) ,^i , d{> where • D\ is a set of feature values called domain • <j is a complete partial order over Dt • d, is a distance function over AAdditionally , <; satisfies a set of "closure" axioms. Among the many versions of connectedness we chose the one of Aiello and van Benthem ([1]) which introduce the notion of relative nearness.
i) (W)(Vx)(Vz) (Vw) (N(x,y,z) A N(XJ,U) -» N{xy,u)) (transitivity) ii) (Vx)(Vy) -i N(xyy) (irreflexivity) iii) (VxXVyXVz) (VM) (N(xy,z) -> ( N(xj>,u) v N(x,uj)). (almostconnectedness) Relative nearness can introduce the Euclidean relation of betweenness Btw(x,y,z)
<=> ( V x ' ) - i (iVO^c^;') A JV(r^c»).
The least integer n such that (^(i),...^,,)) = (y^i),...^^) => x = y is called dimension of the conceptual space. A sequence (x{) of values of features is called a point of C. A typical conceptual space is the CMK -RGB model of representation of colours. Geometry let conceptual spaces inherit some interesting closure properties that data mining denies. For example, the composition of two concepts C and C is the product C®C Given a family of manifolds (M;) a conceptual space is the convex hull of the product
c = c(n(M,
19 Voronoi's tessellation. To avoid that the partition cell is univocally determined by the prototypes a slight variant of the Voronoi tessellation is considered in [6]. Such a model is limited because Euclidean geometry -made of lines and points - is not a good model for grasping adaptability of concepts. We are now proposing a model of conceptual space based on NURBS. A NURBS surface is a combination of rational polynomials according to formula. S > , v ) = SVo Imj=o NJu)
NjA(v) Pwij
Where iVi>p(u), A^,q(v) are a set of polynomials called basis and Pwy is the following matrix
The formula above is a NURBS in the projective space P(R 3 ). It can be generalized to ^ ( R ^ . Cx{ux) =r (1) /(l)=0M(l),p(l) (K1)/>Wi(l),i(2),...,i(N) C;(U2) = E"(2)/(2) =0 Ni(2Xp(2) («2) ^Wi(l),,(2),....i(N)
CN(«N)
=
£
'(N) =0 jVj(N),p(N)
("N)
/>Wi(l),i(2),...,i(N)
where :
r w
^Wi(D,i(2)
i(N)
=
i(l),i(2),...,i(N) X i(l),...,i(N)
< w
i(l),i(2)
i(N) X i(l),...,i(N)
^Wi(i),i(2)„,.,i(N) define an NN matrix of control points that -in our casecorrespond to the set of prototypes. To each point there correspond a weight in the matrix (w^ip),...,^) )- We call conceptual space the convex hull generated by the products of the NURBS.
20
Acknowledgments We are indebted with Ken Kahn -the author of Pictorial Janus- for keeping us in touch with the state of the art of visual languages and with professor Giuliano Antoniol of center for the excellence of research RCOST for furnishing us with some relevant insights on geographical information system.
References 1. M. Aiello M. and J. Benthem, A Modal Walk through Space vol 12: 34 Journal of Applied Non-Classical Logics Hermes.(2002). 2. J.Barwise and J. Etchemendy Hyperproof, Cambridge University Press , (2002). 3. R.Cordeschi The discovery of the artificial Kluwer Academics, (2001). 4. C.Crimi, A.Guercio G.Nota, G. Pacini, G. Tortora, and M. Tucci,. Relation Grammars and their Application to Multi-dimensional Languages. Journal of Visual Languages and Computing 2(4): 333-346.(1991). 5. P. Gardenfors Conceptual Spaces: the Geometry of Thought. MIT Press, Cambridge, MA, (2000). 6. P. Gardenfors "A geometric model of concept formation", Information Modelling and Knowledge Bases III, ed. by S. Ohsuga et al., IOS Press, Amsterdam, 1-16, (1992). 7. G. Gerla Fuzzy Logic, Mathematical Tools for Approximate Reasoning. Kluwer Academics, (2000). 8. R. Kremer Constraint Graphs: A Concept Map Meta-Language (PhD Dissertation -University of Calgary , (1993). 9. G. Yossi, Taylor J. and Kent S. "Projections in Venn-Euler Diagrams", IEEE International Symposium on Visual Languages., (2000). 10. R.Wille, "Concept lattices and conceptual knowledge systems". Computers and mathematics with applications, 23, 493-522.
21 A SIMPLE FUZZY EXTENSION TO THE SEARCH OF DOCUMENTS ON THE WEB
LUIGIDILASCIO, ENRICO FISCHETTI, ANTONIO GISOLFI, ANIELLO NAPPI and ANTONIO SANTANGELO Dipartimento di Informatica e Matematica Universita di Salerno via Ponte don Melillo 84084 Fisciano, Italia E. mail: {dilascio, ef, gisolfi, anappi, asantangelo}@unisa.it
The growth of documents on the web can show all its positive sides only if, in the search operation, a user can find easily what he needs (structure, format and contents). A solution is to give semantic meanings to web documents, trying to interpret the users' wishes, possibly expressed through a user's well known language. In this paper we introduce a linguistic variant of standard meta-data types. Using type 2 fiizzy sets, we associate a linguistic value with each meta-information, then represent a user profile and introduce a matching system to select the most compatible documents with the profile. This method can be considered as an extension of Boolean search on the web, where the results can be classified and ordered according to users' needs.
1.
Introduction
The amount of information on the Web is growing up with an impressive speed and involves any knowledge sector; this huge repository of information contains documents extremely different in terms of structure, format and contents. If this aspect is very positive for the completeness of knowledge, it may create big problems when data are to be retrieved. The current Web structure, in fact, makes impossible to adapt the search to users' needs and it obliges them to a long unproductive session of documents' selection [3]. The solution is to develop a semantic structured web, even diversified in the concepts [2, 13], as suggested by the most important organizations for standardization such as W3C, IEEEE [10, 11, 12]. Their common objective is to add a formal semantics through the use of a standard metadata structure realized by the XML language, identified as the most valid instrument for this goal.
22
This paper presents an approach to the above mentioned problem with two relevant additional features: the use of the natural language to characterize a document and the definition of a correspondence between the user profile and the document he/she is looking for. Our model is a simple fuzzy extension to the Boolean search model. Through the use of type 2 fuzzy sets and a linguistic variable (lv, for short), we can give much more expressivity to the documents' metadata, which become featured by linguistic terms that reflect the imprecision of a characterization. With these fuzzy metadata, we can express user preferences, and we can select and organize the results of a search on the basis of their compatibility with the user profile. This paper is organized as follows: in section 2 we give the formal definition of document (par. 2.1) and user profile (par. 2.2), then we specify the use of the attributes (par. 2.3) and present a brief description of a linguistic approximation algorithm and a similarity index for the attributes (par. 2.4). Then we present in section 3 the selection method of documents, and finally a case study is discussed (sec. 4). 2.
Attribute Strings and User profiles representation
As said before, our approach can be viewed as a sort of extension to the traditional search of documents on the web, featured by the use of a matching between the meta-data, assigned to each page, and the profile of a user. In order to fully achieve our aim of a true semantic search, we use a linguistic variable, a set of linguistic attributes and terms on it and a set of corresponding triangular fuzzy numbers. In this way all the features of a page can be linguistically expressed by the author of the page, and the end user can fill in easily his linguistic profile. Through a suitable choice of preferences applied to the linguistic variable Interest, it's possible both to assign a string describing adequately the contents of a web page and to fully represent the user profile. The author of a web page, in fact, can simply use a string of representation to linguistically express the contents, the level of widening involved, the degree of satisfaction of the page with respect to some user preferences. So a web page can be viewed as a couple (w, s2), i.e. a page and his contents representation. On the other side, the end user of the information can express his interests to the search engine by answering with linguistic terms to a short questionnaire, obtaining a string A(w) that represent his profile.
23
2.1. Formal definition of a document We define a generic Web Document as a couple (w, s 2 ), where w represents the contents, and s2 is a type-2 fuzzy set [4, 7]. Our idea is to use these fuzzy sets in order to linguistically describe the features of the documents. So, the basis of this new meta-data attribution is the use of linguistic variables [14] to give more expressivity and to add more information to the contents. Let's consider the following elements: P = {p], ..., p n }, a finite crisp set offeatures, as for example, contents area, addressing people, length, form, difficulty and so on. II(P) = {spi, ..., spk} a set of classical partitions on P; W: a set of documents on the web; Tr = {cti, ..., a,,,} a set of totally orderedtriangular fuzzy numbers on [0,1]; Vi: the linguistic variable "Interest; T(Vi) = {Xu ..., Xm} a set of linguistic terms on Vi; V c : the linguistic variable "Compatibility"; T(VC) = {yi, ..., yn} a set of linguistic terms on V c ; M: a semantic rule that assigns to each linguistic term t e T(Vi)u T(VC) its meaning, i.e., M: T(Vi) u T(VC) -> Tr. Using all these elements, we can give the following meta-data definition: Definition: Given (w, s 2 ), its meta-data is the following: S2(W) = 2
Cli /Spi.
The fuzzy linguistic description of s2(w) is given by: FLD[s2(w)] = i l i /spi, where M(A,j) = cti. Now we can express our web document as the couple (w, FLD[s2(w)]) or for the sake of simplicity as (w, FLD[w]). It represents a general definition of our meta-data type, which does not set any limitation or restriction on the performable/eaft/rcs as with using XML; however, in this paper, we use a set P organized in classes that form a subset of the IEEE LOM basic metadata structure [10]: General / Technical / Educational / Annotations / Classification. Example 1: Given the document w3 = English_Medieval_Poetry.pdf, the author can classify it using the following linguistic elements: Table 1. A possible choice of linguistic terms and triangular fuzzy numbers associated to the linguistic variable interest. Linguistic Variable : Interest Triangular Num. Linguistic Term Triangular Num. Linguistic Term Sufficiently Interested (si) [0.8, 1, 1] Very Interested (vi) [0.2, 0.4, 0.6] Little Interested (li) Interested (i) [0.6, 0.8, 1] [0.0, 0.2, 0.4] Not Interested (ni) Fairly Interested (fi) [0.4, 0.6, 0.8] [0.0, 0.0, 0.2]
24
We have FLD(w3) = vi / {Language, Poetry, Researcher, Medium length } + i / { English , History, University Student, Theoretical presentation } + fi / { Reflector presentation, Political, Short} + li / { Scholar, Long} + ni /{Scientific}.
2.2. The User Profile When the final user carries out search on the web, he/she is interested to a document that contains some information but also that complies with his/her wishes in terms of features and presentation. However, each user owns a particular grade of preference on the characteristics of a document, and he/she knows perfectly how to linguistically express them. So, during this search operation, the user profile can be represented as a set of features he/she is looking for in a document whose contents are explicitly given. The result is that we can use a user representation that is similar to the meta-data associated with the document, defined on F c P = {fi, f2,..., fk}, a finite crisp set of features. Definition: The User Profile is a couple Up = (£/,, Re), where: Ui = 2 J di / sf„ or linguistically as LUt = Li A,; /sfl, where M(A,i) = a ; , and Re is a linguistic term of lv Compatibility that represent the degree of suitability of the selected documents according to the user. The user chooses his/her preferences on the documents, expressed in Ub and he/she gives a sort of tolerance limit of suitability on the proposed results, represented by Re. The meanings of the values of Re are fixed as in the following table 2. Table 2. Linguistic terras and triangular fuzzy numbers of the lv Compatibility Linguistic Variable: Compatibility Triangular Fuzzv Number Linguistic Term Triangular Fuzzv Number Linguistic Term [0.0, 0.0, 0.25] High (h) [0.5, 0.75, 1] Sufficient (s) [0.0,0.25,0.5] Good (g) Low (1) [0.75, 1, 1] Medium (m) [0.25,0.5,0.751
It is worth noting that the above given set of triangular fuzzy numbers is a partition fuzzy on [0, 1]. 2.3. Using the attributes The problem, now, is that the web author and the final user have to agree on the universe of features, in order to make comparable the two representations. In
25
this first realization, we assume that all documents share the same universe P. With this assumption, we can consider four cases: i) P = F, i.e. we have a perfect correspondence; ii) P => F: in this case we can think that the user is not interested in some attributes, and so we can make the compatibility comparison on the bases of the features present in F; Hi) P c F : this means that the authors of the documents has left out some features; in this case, with our coherence assumption of representation among all documents, we can use the features in P. iv) P * F and P n F * 0 : comparisons are made taking into account the common features. 2.4. Linguistic Approximation and Similarity Index Our method provides the use of an algorithm of linguistic approximation: it allows to map triangular fuzzy numbers onto linguistic expressions referring to a singled out linguistic variable. Well known approximation algorithms can be found in [1]; we use the notation ApprLingk in order to refer to a generic approximation algorithm. In the case study presented in section 4 we use the linguistic approximation algorithm presented in [5, 6, 9]. It introduces k intermediate labels, obtained with linguistic modifiers on the n original terms, and it provides the generation of [(n - 1) * k + 1] - n overall labels, then associated with the triangles. For example, with k=3, the algorithm introduces the following new linguistic terms for each couple of consecutive labels Xx and Xi+l with Xi < Xi+i : "More then V , "Very Xi" and "Almost Xi+l". Example 2: Let us consider the lv Interest and the mappings triangular fuzzy numbers/linguistic terms of the Example 1; suppose to choose k = 3 and to apply the approximation algorithm to the number: [0.74, 0.90, 1]. Its central value (0.90) is included between m; and m^ which are the central values of the triangular fuzzy numbers corresponding to "Interested" and "Very Interested", respectively. Then it is calculated d= mvi - m;= 0.2; since 0.90 e [m; + (7/10)*0.2, m i + (9/10)*0.2] and so the algorithm gives the linguistic modification associated with this case: [0.74, 0.90, 0.1] ~ "Almost Very Interested". In our method, we shall use a similarity index [5, 6, 8] between two metadata strings (in example A and B) defined as: n
26
where P: All_Terms-> N, P(XJ = i, Vie{l,..., k*m+l} associates with a term ^ its position in an increasing order of all the terms, both basics and those generated by ApprLingk; k^ and XiB are the linguistic labels associated with the element a; respectively in A and B ; n = |Universe of the discourse!, m = # D a s i c labels + #labels generated by ApprLingk\ nc = min(a, b), where a, b are the number of nonempty subset in A and B, respectively. It can be easily shown that 8(A, B) e [0,1], 8(A, A) = 1 and S(A, B)= 8(B, A).
3.
Selection and ordering of most relevant results
As said before, our aim is to add a second refinement on the result given by a Boolean search (R), by linguistically comparing the meta-data of the documents and the active user profile. So we define a subset SFS (Search Filtered Results) of R. Step 1) The matching algorithm we present FMatch(wj, FLD(wj), Up) has as input Up the user profile defined by the end user and a document Wj in W = {(Wj, FLD(WJ))} (the set of documents found), then the algorithm calculates LDifj that is a linguistic label (in function of the lv Compatibility, using the same linguistic terms chosen for Re) associated with the document: this information allows us to linguistically cluster the documents found. The algoritm in pseudocode is as follows: FMatch (wj, FLD(wj), Up) { For each document w, in R, For each p, s Fn P in the string FLD(wj) If the label (A.i)FLD(wj) * Qd(uP) then Difj = Difj + |(aj)o, = M(W) s FLD(WJ) a (aO * = M(W> ZUP\ T D i f = Difj /1 F n P | /*extends the average of triangular numbers */ LDif = ApprLing ( T_Difj, CompatibilityLinguisticTerms) }
The algorithm uses a set of temporary variables Difj (inizialized to zero), and the variables TDifj and LDifj, that express (linguistically and numerically) an average compatibility between the preferences of the user (declared in the User Profile), and the features of the document Wj. So if the value of T_Difj is numerically small this means that the document features and contents are very near to what the user is looking for, and it will be approximated with a high compatibility linguistic term.
27
The following operation a: Tr —• Tr, is used in order to obtain an assessment of the inequality between two triangular numbers. Given [ai,bi,Ci], [a2,b2,C2], the number [a',b',c']=. [ai,bi,ci] n [a2,b2,c2] is so obtained: Table 3. Definition of the operation a. b* = | b, - b21
a ' = b ' - r b ' l ( | b , - a , | + |b 2 -a 2 |)/2
c'= b' + Tl - b'l (| b, - c, | + 1 b2 - c21)/2
Example 3: Using the lv Interest and the linguistic terms/triangular fuzzy numbers of Example 1, it is possible to see that the operation on two following numbers gives always the same result: [0.6 , 0.8 , 1] a [0.4 , 0.6 , 0.8] = [ 0 0.2 0.4] and we get the shape of triangles when applied between a border number and its adjacent one; this value increases when n is applied to more distant labels, and it is maximum between the two border numbers: [ 0.8 1 1] a [0 0 0.2] = [0.9 1 1]. Now, using these results, we can filter the documents found by means of the search, presenting all documents that satisfy the following relation: If LDifj > Re then the document Wj is introduced in SFS. Note that the result documents can be simply presented on the basis of the calculated linguistic compatibility LDifj. Step 2) Then we make a refinement on the obtained clustering through the calculus of the similarity index on the documents that belong to the same cluster: this index is calculated between a document and the user profile in order to associate a similarity numerical value (between 0 and 1) with which it is possible to organize the documents linguistically grouped in SFS.
4.
Case study
We illustrate with a simple example how the system works. Let us define: p^ literary contents, p2: poetry, p3:scientific contents, p 4 : IA concepts, p5: logic concepts, p6: physics contents, p7: formality, p8: technical language, p9: student, p10: researcher, Pn:long. P ={p b p2, p3, p„, p5, p6, p7, p8, p 9 }, F ={ p3, p4, Ps, p?, Ps, P<>, Pio, Pn } ad so P n F = { p 3 , p 4 , p5, p 7 , p 8 , p 9 }. Then we can consider the following choice of Triangular fuzzy numbers and Linguistic terms of Example 1 and Example 2 for the lv Interest and lv Compatibility, respectively. As said, we show only the features present in P n F for both documents and profile. Suppose the user selects the following profile: Uj = vi/{p3, p4} + i/{p8} + si/{p7,
28 p9} + li/{p8}, while Re = Sufficient. Let us consider the following singled out documents: Table 4. An example of documents and their semantic information. Document Wi FLD(Wi) w. vi/{p4, p9} + i/{ps, p,} + li/{p3, p8} w2 vi/{p3, p8} + fi/{p4> + si/{p5, p7, p9} w3 vi/{p5, p7, p9} + ni/{p3, p4> p8} w4 vi/{p3, p4, p8} + fi/{p9} + si/{p5, py} w5 vi/{p4} + i/{p3, p8} + fi/fp,} + li/{p5, p,}
On these documents, our classification algorithm is applied as follows: T_Difi = ( [0.6, 0.8, 0.9] + [0.0, 0.2, 0.3] + [0.6, 0.8, 0.9] + [0.2, 0.4, 0.6] + [0.4, 0.6, 0.8] + [0.4, 0.6, 0.8])/6 = [0.366, 0.566, 0.7] and so, using our algorithm ApprLingk=3 (briefly described in par. 2.3), applied on the linguistic terms defined in Example 2, we have LDifi = "Medium". In the same way: LDif2 = "Very Good', LDif3 = "Almost Sufficient, LDif, = "Very Good", LDif5 = "Very Good\ So the set of "compatible" documents RFS = {"Very Good"/{w2, w4, w5}, "Medium'Vwi }, whereas the document w3 is excluded because its compatibility level is Almost Sufficient, less then the chosen Re . We can now calculate the similarity indexes: 8(Uj, w2) = l-((0+2+l+0+l+0)/3)/(6*5) = 0,9556. In the same way, we have: 8(Uj, w4) = 0,9667; 5(Ui, w5) = 0,9750. Now, for the sake of completeness, we calculate 8(Ui, Wj) = 0,8334. Then we can organize the documents in the cluster labelled as Included Between Interested-Very Interested as follows: ws, w4, wi, hence w5 is the document nearest to user needs. Finally we obtain the ordered SFS: Table 5. The final result of the method: Ordered RFS. I Document -User Similarity Document-User Compatibility Documents Very Good ws 0.9750 w4 0.9667 0.9556 w2 Medium 0.8334 w,
5.
Concluding remarks
In this paper we have illustrated a fuzzy-based methodology for organizing the results of documents search on the web. Our methodology, through type 2 fuzzy sets, introduces linguistic terms to enrich the documents metadata and to represent a user profile. Then an algorithm for matching between user profiledocuments metadata and clustering and ordering the results in function of user needs is presented. Both the meta-data representation and the selection
29 algorithm illustrated in this paper present several aspects deserving further investigation: •
•
•
•
•
A possible extension of the methodology concerns the introduction of a weighting function. In such way the final user could associate higher weights with features he/she considers more important for his/her interests; We could introduce more linguistic variables to give more expressivity to the documents representations and to deal with the complexity of the user profile; It is also possible to tackle the problem of coherence between the attributes used for documents meta-data and those for the user profile, by introducing special labels that represent no information or not compatible to complete the matching; In some situations, it could be useful to use the rejected results of the search; the user, in fact, could be also interested in something different or even opposite to his profile to take general information on a context; Another possible extension regards the introduction of a grouped clustering, in which the selection is made not on the single attributes, but on main sets of them (as contents form and so on, or general, technical, educational, annotations, classification as in [2, 10, 11, 12]);
References 1. P. P. Bonissone, 2001, Fuzzy Sets and Expert Systems in Computer Engineering. On-line Course ECSE 6710. http: //www. rpi. edu/~bonisp/fuzzy-course/2000/course00. html. 2. G. Casella, L. Di Lascio, A. Gisolfi, 2003. Una procedura per la rappresentazione della conoscenza in un ipertesto mediante insiemi fuzzy di tipo 2. AttiAICA2003, Trento, Italy, pp. 53 - 60. 3. N. Dessi, B. Pes, 2003, Learning Objects e Semantic Web. AM AICA 2003, Trento, Italy, pp. 61 - 66. 4. L. Di Lascio, A. Gisolfi, P. Ciamillo, 200?, A new approach to Soft Computing. Elsevier (submitted). 5. L. Di Lascio, E. Fischetti, A. Gisolfi, V. Loia and A. Nappi. Linguistic resources and fuzzy algebraic computing in adaptive hypermedia systems, 2004, in E. Damiani, L. Jain, (Eds.), Soft Computing And Software Engineering, Springer Verlag, Berlin.
30
6. L. Di Lascio, A. Gisolfi and G. Rosa, 2002. A commutative 1-monoid for classifications with fuzzy attributes. Int. J. Of Approximate Reasoning, 26, pp. 1 - 46. 7. L. Di Lascio, E. Fischetti, A. Gisolfi, 2001. An Algebraic Tool for Classification in Fuzzy Environments, in A. Di Nola, G. Gerla (Eds.), Advances in Soft Computing. Phisica-Verlag, Berlin, pp. 129 - 156. 8. L. Di Lascio, E. Fischetti and A. Gisolfi, 1999. A fuzzy-based approach to stereotype selection in hypermedia. User Modelling and User-Adapted Interaction, 9: pp 285 - 320. 9. Gisolfi and G. Nunez, 1993. An algebraic approximation to the classification with fuzzy attributes. International Journal of Intelligent Systems, 9, pp. 75-95. 10. IEEE 1484.12.1-2002, 2002. Draft Standard for Learning Object Metadata, http://www.ieee.org. 11. IMS Learning Resource Meta-Data Information Model Version 1.2.1 Final Specification, 2001, http://www.imsglobal.org/metadata. 12. World Wide Web Consortium (W3C), 2001, Semantic Web, http://www.w3c.org. 13. Z. Yao, B. Wang, 2000. Using section-semantic relation structures to enhance the performance of Web search. Database and Expert Systems Applications. Proceedings. 14. Zadeh L. A., 1970. The Concept of a Linguistic Variable and its Application to Approximate Reasoning-I, II, III. Information Sciences 1 8 II 8 - III 9, pp 199-249; pp 301-357; pp 43-80.
31
D E V E L O P I N G A SYSTEM FOR T H E RETRIEVAL OF MELODIES FROM W E B REPOSITORIES
R I C C A R D O DISTASI a n d L U C A P A O L I N O a n d G I U S E P P E S C A N N I E L L O Dipartimento Email:
di Matematica e Informatica Universita di Salerno, Italy. {ricdis, Ipaolino,gscanniello}<Sunisa.
(DMI) it
This paper presents a system called WebMelodyFinder for content-based retrieval of melodies from repositories on the world wide web. The search is based on a least squares fit. The system considers only the (exact) melodic shape rather than the actual notes, and it is therefore invariant to transposition. Using this system, the melody, automatically extracted by a MIDI file or manually entered by a knowledgeable operator, can be used as the main search key to locate the best matching melodies. Other applications might include musicological archives, where other dimcult-to-search information is stored (e.g., scores or audio recordings).
1. Introduction It would be nice to be able to search for a specific tune by providing the melody to a system, by humming or by playing it into a computer by means of a MIDI-enabled instrument. In most cases, however, the only form of search actually available to the end user is based on metadata (performer, composer, genre, title, etc.) rather than on the actual content, although several systems for music matching and retrieval exist. Most of the existing systems are based on a symbolic representation and perform some form of string matching, often adopting the 'edit distance' as a metric (the number of editing steps necessary to obtain a string from another). This choice makes it possible to manipulate the musical objects in useful ways 9,10,8 , but it makes it harder to account for transposition or melodic variations (staccato or tenuto articulations, etc.) A brief summary of the concepts relevant to MIR (Music Information Retrieval) is sketched in [6], while many string matching based techniques are described in depth in [5]. On the other hand, so-called query-by-humming systems perform an analysis of the melody as sung by the operator in order to extract the
32
information needed for the search 3 . This is an interesting approach indeed, but the process is usually prone to significant errors at several stages: the operator might not be a trained singer, pitch or timing recognition could be problematic, and so on. As a first step, then, it would probably be better to use some different kind of data entry, so that it is possible to assume that the system's input is really what the operator wanted it to be. Among the desirable characteristic of a content based music retrieval system, there are invariance to transposition (i.e., the music should be recognized no matter in which key it is played) and invariance or robustness to tempo change (faster or slower). The proposed system, named WebMelodyFinder, performs melodic matching and retrieval based on the actual content, represented in numerical, rather than symbolic, form. Section 2 explains how melodies are represented and how the searching for the best match is performed, while Section 3 discusses issues related to the ongoing implementation of WebMelodyFinder. Finally, Section 4 draws some conclusions.
2. The Underlying Technique In order to search a melody repository for the best-matching element, the key and the candidates must all be represented in a suitable form. With WebMelodyFinder, the melody is represented as a sequence of integers - one integer for each 'tick' of time. The value of the integer associated with a specific tick reflects the chromatic pitch of the note that sounds during that tick, with middle C (C4) equal to 60, as in the MIDI standard 7 . Thus, the B below middle C (B3) is 59, while C # 4 is 61. The ticks are a musical, rather than absolute, unit of time, in the sense that the duration of a tick is expressed as a fraction of a quarter note, rather than in milliseconds or multiples thereof. The number of ticks per quarter note is called temporal resolution. Any given tick contains exactly one integer (i.e., one note). Therefore, the representation is strictly monodic. This might be considered as a limitation, but for melodic searching it is better to have only the relevant data in the index keys, rather than having to wade through information which is useless for the task at hand. Furthermore, the representation makes no provision for pauses: each note is 'held' until the next one chimes in. This, too, is a design choice, since as long as the following note starts right on time, the exact articulation length of a given note in a melody can be significantly altered without altering the perceived melodic shape, which is
33
what this technique aims at capturing. Typical resolution values in actual midi files are 48, 96, 192, 240, 384 and 480. Desirable resolution values are multiples of 3 and some power of 2, so that triplets (ternary time divisions of a note), as well as the usual binary divisions, can be represented without roundoff. The same goes with WebMelodyFinder, which generaly adopts a resolution of 24 in order to limit the search time—the length of keys and melodies is proportional to the resolution. A detailed picture of the representation is depicted in Figs. 1, 2 and 3. For simplicity, these illustrations were prepared using the somewhat atypical resolution of 100 ticks per quarter note. As can be seen, melodic shapes are markedly different from melody to melody and characterize the melody fully, in the sense of being informationally equivalent to traditional notation. In fact, with a little training it is possible to recognize a familiar melody visually. The idea of a mathematical curve plotting pitch vs. time is not new—see for instance Goldstein4 for a treatment of this concept oriented towards musical analysis. The idea of using the information in the melodic curve in order to perform a search is a small step further. Perhaps it would have been possible to include at least part of the harmonic aspect, for instance by considering the harmonic function of selected melody notes, but this would introduce a layer of ambiguities that can only be risolved at another, higher, level. For instance, even simple questions such as "Is this chord an Fmaj6 or a Dmin7?" require the intervention of a human expert, while extracting melody information is a much easier task— e.g., by picking the relevant track from a midi file, or even by playing a midi keyboard to reproduce the melody which is sought. Furthermore, melody reharmonization is a frequent practice in many styles of music, and this would add a further level of ambiguity, stacking difficulties over difficulties. In conclusion, given the goal (namely, melodic matching), the representation adopted by WebMelodyFinder is a reasonably simple and effective choice. 2.1. Melody
Matching
If u = (UQ, ..., un-i) let us define
and v = (vo,..., vn-i)
are two melodies of length n,
d(u, v) = min < Y " (ui - Vi - c) 2 > ,
(1)
34
Figure 1.
O sole mio
that is, the minimum Euclidean distance achievable by a suitable transposition interval c, expressed as a signed number of chromatic steps. Determining the optimal transposition value c* that yields the minimum in Eq. (1) only requires solving a least squares problem: c
*= E («i-«i)/»-
(2)
0
In other words, c* is simply the difference between the average values of u and v. Now suppose we have a ckey5 melody x = (XQ, . . . , x n - i ) to be searched for, and a repository TZ of melodies that x must be matched against. Let y = (t/o, • •., J/m-i) € 71 be one of such melodies. For 0 < j < k < TO, the shorthand yyfk] = (%'»!/j+i>•••>!/*) W*U be used to denote a subsequence of y. In the following discussion, it will be assumed that the key melody x is not longer than the stored melodies (that is, n
35
im
4—ir CJT r u—r
0
400
800
1200
1600
2000
i^^^^
2400
2800
3200
Figure 2. Over the Rainbow
For each melody y = (j/o> • • • ? 2/m-i) £ ^ 5 w® are looking for d*(x,y) =
mill
{d(x,y[(M+n])},
(3)
0
that is, the distance from x to the closest subsequence of y. The results are sorted by increasing distance before being presented to the user. 2,2. System,
Structure
WebMebdyFinder consists of two modules: the indexing module and the searching module, as depicted in Fig. 5. Both modules are implemented as Java applications. At the moment, the system is under experimentation. In the experimental page, the user is presented with a JSP page containing a form for uploading a key file for searching. The search itself will be performed server-side by the searching module. Additionally, it is possible for the user to download a copy of the indexing module so that a key can be generated from a MIDI file. In the future, there is going to be an applet that will allow the user to use a MIDI
36
0
400
800
1200
1600
2000
2400
2800
3200
Figure 3. Donna Lee
keyboard or any other MIDI-ed instrument (in theory, even a MIDi microphone) in order to play the key without having to use an existing MIDI file. For those users without a MIDI instrument, even the computer's own QWERTY keyboard can be used as a rough musical input device. 3. Discussion The number of operations necessary for a search performed by WebMelodyFinder can be quantified as follows. In the discussion, for the sake of brevity the term 'additions' will be used instead of 'additions/subtractions.' (1) The error in (3) is calculated for each of the m — n shifted position of the pattern x over the candidate y. This entails computing the optimal transposition c* every time. The first time, c* is computed by the full sum in (2), which requires 2n additions, but for all subsequent shifts 0 < 8 < m — n, it is updated by subtracting one simple element (namely, ys-i) and adding another (namely, yn+s)>
37
A Foggy Day
IY Freedom Jazz Dance
Guantanamera
I Fall In Love Too Easily
O sole mio
juJ^V)WljfVf
Tico-tico Figure 4.
Assorted melodic curves
Therefore, the total of operations for this step is 2m additions (the division by n is not really performed until necessary). (2) Having obtained the optimal transposition c*, it can now be used for computing the distance as given in (1). This requires 2 subtractions and 1 multiplication for each of the n elements, that is, 2n subtractions and n multiplications for each of the m—n+1 values of 6. The total operations for this step are therefore 2mn — 2n2 + 2 additions and mn — n 2 + 1 multiplications (the square root is not really computed, as distances can be compared while still squared). Putting together the operations for Steps 1 and 2, we have 2(n+l)(m—n+1) additions and n(m — n) + 1 multiplications. In other words, the operations necessary for one match are asymptotically proportional to n(m — n). 4. Conclusions and Future Work This paper has presented a system called WebMelodyFinder. The system can be used for the content-based retrieval of melodies in a transpositioninvariant way. What would be most useful at the moment is some extensive experimental work aimed at assessing the strong points and the possible weaknesses
38 MIDI Melodies
Searching Keys
''
rKey
MIDI Transformation
''
Transformation
"
i'
Melody representation
Searching Keys Representation
Melody Finder ~*
'' Result Report
Figure 5.
The structure of WebMelodyFinder
of the system. In this way, it would be possible to investigate the behaviour of the system regarding missing notes, short notes, slightly altered phrases and other kinds of variation between musical objects. Surfing the web, it is often easy to find different versions of the same tune, and it would be interesting to find out how easily can one version be retrieved while searching for another. This would be a very realistic use case for WebMelodyFinder. Such a series of experiments is currently under way. As for improvements in the underlying search engine, it could be made invariant or robust with respect to time stretching, so that melody snippets can be recognized as identical, or at least close enough, even if metrically different. The time stretching factor 2 is remarkably desirable, since in many cases the same melody can be written in different rhytmic units, differing by a factor of 2, with no perceptual or conceptual difference (i.e., 3/4 vs. 3/8, or 2/4 vs. 2/2). Additionally, explicit constraints could be added to make sure c* is an integer. At present, a non-integer value of c* signals non-exact matching, but such condition can actually also be inferred from a nonzero resulting distance. Finally, in order to speed up the search in large databases, some kind of mark-and-sweep search might be employed, based on a tree scheme similar
39 to, e.g., Tiger Tree Hashing 2 , with t h e hash function replaced by t h e average note value, or perhaps by some transposition-invariant quantity obtained from the sequence of melodic intervals in semitones.
References 1. David Bainbridge, Rodger J. McNab and Lloyd A. Smith, "Melody based tune retrieval over the World Wide Web." Last version: December 1997. h t t p : / / w w w . c s . w a i k a t o . a c . n z / ~ n z d l / p u b l i c a t i o n s / 1 9 9 8 / Bainbridge-McNab-Smith-Melody.pdf 2. Justin Chapweske, "Tree Hash EXchange format (THEX)." Last version: March 2003. h t t p : / / o p e n - c o n t e n t . n e t / s p e c s / d r a f t - j c h a p w e s k e thex-02.html 3. Asif Ghias, Jonathan Logan, David Chamberlin, Brian C. Smith, "Query By Humming — Musical Information Retrieval in an Audio Database." In Proc. ACM Multimedia '95, 5-9 Nov. 1995, San Francisco, CA 4. Gil Goldstein, The Jazz Composer's Companion. Advance Music, 1993. 5. Kjell Lemstrom, "String Matching Techniques for Music Retrieval." Ph.D. Thesis, Series of Publications A, Report A-2000-04, University of Helsinki, Nov. 2000. ISSN: 1238-8645. ISBN:951-45-9573-4 6. Kjell Lemstrom, "In Search of a Lost Melody—Computer Assisted Music: Identification and Retrieval." Finnish Music Quarterly magazine, March/ April 2000. 7. The MIDI Manufacturers Association (MMA), Complete MIDI 1.0 Detailed Specification, 2001. URL for ordering: http://www.midi.org/ 8. Lloyd Smith and Richard Medina, "Discovering Themes by Exact Pattern Matching." In Proc. 2nd Annual International Symposium on Music Information Retrieval (ISMIR 2001), University of Indiana, Bloomington, Indiana, October 15-17, 2001. 9. Alexandra Uitdenbogerd and Justin Zobel, "Manipulation of Music for Melody Matching." In Proc. 6th ACM Int'l Conf. on Multimedia, pp. 235240, Bristol, UK, 1998. ISBN:0-201-30990-4. 10. Alexandra Uitdenbogerd and Justin Zobel, "Melodic Matching Techniques for Large Music Databases." In Proc. 7th ACM Int'l Conf. on Multimedia, pp. 57-66, Orlando, FL, 1999. ISBN:l-58113-151-8.
This page is intentionally left blank
41
FAST FACE R E C O G N I T I O N U S I N G FRACTAL R A N G E / D O M A I N CLASSIFICATION
DANIEL RICCIO Dipartimento
di Matematica e Informatica Universita di Salerno, 84084 Fisciano (SA), Italy [email protected]
In this paper we introduce a new method, namely F F R (Fast Face Recognition Using Fractal Range/Domain Classification), for the face recognition problem. F F R is based on the IFS (Iterated Function Systems) theory, also used for still image compression and indexing, but not enough experimented in the biometrical field. It characterizes in a fast way the similarities between faces, associating to each range extracted from the eyes, nose or mouth regions the topological map of the best fitting domains. F F R is fast and robust to meaningful variations of expression and respect to small changes of illumination and pose as demonstrated in experimental results.
1. Introduction Automatic Face Recognition (AFR) is a complicated object recognition problem due to the variability of face expressions, face position and lighting changes. Several methods have been proposed in order to solve this problem, but the available recognition methods are very far from human capacity, in terms of precision and time spent. All these methodologies can be classified into: • Image based: analyze image as an array of pixels with shades of gray. ICA (Independent Component Analysis) 1, Neural Networks 7, Eigenfaces 7. • Feature based: analyze anthropomorphic face features, its geometry. Elastic Graph Matching 7. • Combined: extract areas of features, and on these areas apply image based algorithms. Fractals 6. Fractal based techniques usually support lossy coding where a given input image I, is partitioned into a set R of disjointed square regions named
42
ranges. From the same image / , another set D of overlapped regions called domains is extracted. Generally, we classify ranges and domains by means of feature vectors in order to throw down the cost of the linear search on the set of domains. For a range r e R, only the domains d € D, having a close feature vector have to be compared. Recently IFS demonstrated its effectiveness also in image indexing 2 because of some desiderable properties such as brightness, color and contrast invariance just to cite some of them. In this context a new fast technique for face recognition, namely FFR (Face Recognition Using Fast Fractal classification), based on IFS is introduced in the following. The core of FFR is MC-DRDC algorithm 5. A recent coding technique performing by means of the following two phases: • In the first phase we compare all domains with the preset block d, computing the approximation error according to the (1) and then storing it in a KD-Tree. e
3d = ™f{d-(<*d + P)}
(1)
• In the second phase, for each range a comparison with d is done, but the computed feature vector is now used searching for the best fitting domain in the KD-Tree. FFR using only the first phase of the DRDC method provides for each range falling into most significant region (eyes, nose and mouth) a linear vector representing the topological map of the best fitting domain. The rest of the paper is structured as follows: Section 2 describes how FFR works and how we compute the feature vector for a given image. Section 3 analyzes the FFR complexity in terms of memory and time. In Section 4 we present some experimental results and, at last, in Section 5 we draw conclusions. 2. The Method In general, the use of fractals in image indexing has the advantage of working with compressed images. However we do not apply both two phases described in Section 1, because we only need the domain codebook, that is created during the indexing phase. Furthermore we do not consider the entire face image, but only a narrow area comprising eyes, nose and mouth. This is done applying a mask M to the internal face region. The mask consists of a set of 104 16 x 16 pixels ranges. Before applying the mask M,
43
we have to normalize both dimension and position of the face. That can be done automatically, with a face detector and then localizing the eyes, or manually selecting the face region directly. Note that while we only index the ranges falling in the mask area defined by M, the domains for the indexing phase are extracted from the whole face image. We classify a generic face image I as follows. We apply the mask M to the face image obtaining the range set R as the set of all ranges falling in the mask region defined by M. For each range belonging to R, we carry out the above mentioned indexing process. Particularly we select a range r from R and we extract from the codebook the first n best fitting domains, where each of them is described by its spatial coordinates x and y. Notice that, as said in Section 5, the domain are quadruple sized with respect to the ranges and have to be shrunk before the approximation process, then the original image is down sampled in a quarter of the original. For this reason the spatial coordinates x and y of each domain belong to the range [0, ^ — l] instead of [0, L — 1] with L representing the dimension of the face image / . The domains extracted from the codebook represent a spatial map for the range r. We represent this map with an ^ x ^ square sparse matrix Sr, which we called score matrix. The score matrix is built as follow:
Sr (x, y) = {1 „ , x'v . 0 otherwise
r
where Dr represent the set of the domains , . extracted for r.
In order to make the algorithm more robust with respect to horizontal and vertical shifts, for each range r in the set R we also consider the set Tl of all its neighbors: il=reR^{r,
r _ , r y , rT, r x , r _ , r y , r x ,
r^}
At last we extend the definition of Sr to: <5 (x, y) =r€ji Y,Sr (x, y)
where 0 < S (x, y) < 9
Because the face information is concentrated in the central part of the image and, in general, it has an elliptical shape, the corner parts of the score matrix does not contain useful informations. For this reason, in order to characterize the data contained in the score matrix <S, we partition it in a fixed number of circular bands, each of them is divided in a fixed number of
44
sectors. Owning to the topology of the partition we find useful to index the domains with polar coordinates instead of Cartesian and then x = p cos (9), y = p sin (9) where: p= y/x2 + y2 arctan (£) arctan ( | ) + 7r 0= I ^7r if x §7r if x
if x > 0 if x < 0 = 0 and y > 0 = 0 and y < 0
In this way, each sector i is identified by the extreme radiuses (pi: Pi^x) and angles (9i, #j_i). In particular, this partitioning has to exhibit the following property: each sector stores the same amount of information. In other words, this means that all sectors in the matrix S have the same area. We draw now how to choose each 9i and pi in order to obtain that. We fix 6i = ^i, where m is the number of sectors we want in each band. Now the area contained in a sector is: ^ = m (Pi ~ Pi-i)>
where
Po = °
and
Pi > Pi-i
Vi
We want that: A4+1 = Ai, i = 0, 1, . . . , n — 1, where 11 is the number of band we want in the score matrix. Then: £(P?+1-P?)=£(p?-P?_i)
pi+i = \I2P1 - P?-I iterating on i we obtain: pi = \fipx- Knowing that pn = y/npx and pn = j , we obtain: px = ^7= • As said above only the ranges belonging to the mask region defined by M are indexed. We compute the score matrix S for the first range n in M and partition it. For each sector in 5, having p and 6 as polar coordinates, we calculate the number of domains Spe that fall in it, considering this value such as an its representative. Then starting from the central position, we linearize S linking all the Spe values with a spiral visit. At first the inner band, then the second inner band and so on, obtaining a linear vector Vri. In the same way we compute the vector Vr2 and so on. We define the global
45
feature vector: V = Vri-Vr,
K|M|
With the n and m growing, the size of the vector V given by \M\ x m x n can become significant. In this case we can apply a dimensional reduction technique, such as DCT, keeping only the first k greatest coefficients, with k fixed.
3. Memory and Time complexity Analysis The first interesting way to assess the effectiveness of the FFR algorithm is a time and memory complexity analysis. From a time point of view, the algorithm is very fast, because its time complexity is well upper bounded. In effect the most expensive operation during the coding process of a fractal encoder is the best range-domain matching association. The FFR strategy avoids this problem considering only the indexing phase. In this way we only have to extract the domains from the input face image, compute the feature vectors and build the KD-Tree structure. Let be N the face image dimension, the time to calculate the feature vectors is a constant 0 ( 1 ) , while O (N) requires domain extraction and O (N log N) KD-Tree building, with a total time complexity of O (NlogN). In the previous section we have explained how the final feature vector V is calculated, now we want to draw an upper bound for the size of the key depending on the number of radiuses m and angles n chosen. We compute the number of bits we need in order to represent a key (feature vector) as a function of m and n. Because the sectors in the score matrix S have the same area, they contain the same number of bits, that we can calculate as: 2
2
Areai = ?£>- and Areat-i
knowing that: Pi = yfipi and A = —— = m An
Pl
7rL2
4
= 1
rnn
^
= ^^-
46 Furthermore we consider also eight shifts for each range that implies a possible increment of no more than nine for each location of the score matrix. This means that we have to multiply the previous calculated area, obtaining: A _ 9nL2 1 _ _c_ ~ 4 mn — mn'
A
Figure 1.
:
W i fh m
C
~
_
9nL* 4
Graphical representation of the function F.
The number of bits we need in order to represent each sector is log2 ( ^ ) and then we can calculate the total number of bits as a function of m and n, such as:
F=\M\mnlog(—)
(2) \mnJ
\M\ is because we have to represent all the ranges in the mask M. We can study the shape of the function F obtaining the graph in Fig. 1. Another interesting case study is the relation between the key dimension and the recognition rate variation, that we bring in Fig. 2. We have found that the recognition rate, represented as Cumulative Match Score defined in 4 grows until key size reaches a dimension of about 5 kb, and then it reduces when the key size continues to grow.
47
10
Figure 2.
15
20
25
30
35
40
46
50
Graphical representation of the efficiency of the F F R algorithm.
4. Experimental results As pointed out in Section 1, the FFR algorithm offers good performances and several advantages, beyond to being one of the few of fractal based face recognition algorithms. We have tested it on FERET face database, comparing it with several others techniques. We have performed several tests on FERET face database according to the protocol described in 4 and constructing the galleries and probe sets in the same way as described here. We applied an automatic face detector, in order to extract the face region from the original face image in the FERET database. We compared the fully automatic FFR performances with semiautomatic and fully automatic techniques reported in 4. From the Fig. 3 we observe that only MSU, USD, MIT 96 and UMD 97 start better than FFR, while MIT 95 and Baseline Correletion overcome it only in the brief line among 5 and 10. However FFR overcomes all the others, in the last.
Figure 3.
Comparative results on F E R E T .
48
We also compared FFR with a novel Gabor-based kernel Principal Component Analysis (PCA) method proposed in 3, which integrates the Gabor wavelet representation of face images and the kernel PCA method for face recognition. Gabor wavelets first derive desirable facial features characterized by spatial frequency, spatial locality, and orientation selectivity to cope with the variations due to illumination and facial expression changes. The kernel PCA method is then extended to include fractional power polynomial models for enhanced face recognition performance. The FERET data subset used for comparison between FFR and the Liu's algorithm contains 600 frontal face images of 200 subjects. For each subject two images have been used for training and one for testing. Fig. 4 show that FFR achieves better performances with respect to the Gabor based method when a few number of features are used.
Figure 4.
Comparison with a Gabor based metod on F E R E T .
5. Conclusions A new fractal face recognition methodology has been presented. Particularly, it is based on range/domain relations, in the sense that the best fitting domains represent a spatial characterization for a range. Such characteristics produce a features spatial localization, testing results on FERET face database show that the proposed method is robust to expression variations and presence/absence of transparent glasses, providing high values of Cumulative Match Score. Furthermore the several tests we have carried out shown also that FFR is a very fast method according to the time complexity analysis done in Section 3
49
References 1. Bartlett Marian Stewart, Movellan Javier R. and Sejnowski Terrence J. "Face Recognition by Independent Component Analysis" , in IEEE Transactions on Neural networks, vol. 13, no. 6, pp. 1450-1464, November 2002. 2. Distasi R., Nappi M., Tucci M. "FIRE: Fractal Indexing with Robust Extensions for Image Databases" , in IEEE Transactions on Image Processing, vol. 12, Issue: 3, pp. 373-384, March 2003. 3. Chengjun Liu. "Gabor-Based Kernel PCA with Fractional Power Polynomial Models for Face Recognition" , in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, No. 5, May 2004. 4. Phillips J. P., Moon H., Rizvi A. S. and Rauss P. J. "The FERET Evaluation Methodology for Face-Recognition Algorithms" , in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090-1104, October 2000. 5. Riccio D., Nappi M. "Defering range/domain comparisons in fractal image compression" , in Proceedings 12th International Conference on Image Analysis and Processing, vol. 1, pp. 412-417, September 2003. 6. Tan T., Yan H. "Face recognition by fractal transformations " , in Proceedings., 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 6, no. 6, pp. 3537-3540, March 1999. 7. Zhang Jun, Yan Yong and Lades Martin "Face Recognition: Eigenface, Elastic Matching, and Neural Nets" in Proceedings of the IEEE, vol. 85, no. 9, pp. 1423-1435, September 1997.
This page is intentionally left blank
51
A M E T H O D FOR 3D FACE R E C O G N I T I O N B A S E D O N MESH N O R M A L S
STEFANO RICCIARDI, GABRIELE SABATINO Dipartimento di Matematica ed Informatica Universita di Salerno 84084 Fisciano (SA), Italy E-mail: {sricciardi, gsabatino}@unisa.it
This paper presents a method to check identity by 3D face recognition, comparing the 3D facial model in input to a reference database of previously acquired people. This task is performed representing geometrical features in the input mesh by a bidimensional matrix, the normal map, which stores local curvature data, the normals to each face of the mesh. The search for a match is then efficiently executed in a 2D space without sacrificing the geometric/physiognomic details available in the original 3D models. The proposed method, besides offering the typical advantages of 3D based face recognition approach, such as the invariance to translation, scaling, rotation, and illumination of the acquired face, shares some of the advantages common to 2D based methods due to the way it represents the tridimensional facial models. We present the results of our method on a 3d database of human faces, featuring different races, sex, ages, and expressions.
1. Introduction Face recognition is a well known research field which recently is gaining even more interest from the scientific community as the role of this technology could be relevant in many security related applications such as surveillance in critical areas. The most diffused techniques are based on 2D image processing of photos or video footage and, beside the advantage of a much simpler data acquisition, they perform well when scene lighting, camera position and orientation are under control, but recognition becomes much more difficult and unreliable when face position and orientation are not known or in case of strong variations in facial expression. Today, tridimensional scanning techniques for face and full body are becoming more diffused and reliable, and, even if they still presents some operational limits, it is now possible to create detailed 3D facial database
52
for biometrical purposes. This technology is pushing the development of recognition methods to fully exploits the potential of 3D face representation. Indeed, if the acquisition is performed correctly, the resulting surface of a 3D face model is more independent to lighting conditions, and even its position and orientation in 3D space respect a given reference system are much less problematic than in a 2D space [*]. This paper present a 3D face recognition method based on normal map [2], a bidimensional matrix representing local curvature data of a 3D polygonal model, aimed to biometrical applications. The proposed algorithm is invariant to traslation/scaling posing issues during the face acquisition and allows to perform a fast and accurate one-to-one/one-to-many face comparison working in a 2D space still retaining the original 3D model information. This paper is organized as follows. In section 2. related works are presented. In section 3. the proposed methodology is presented in detail. In section 4. the results of the proposed methodology are presented and discussed. The paper concludes in section 5. showing directions for future research.
2. Related works The term "3D face recognition" usually refers to a recognition methodology operating on tridimensional dataset representing face (or head) shape as range data or polygonal mesh. The early researches on 3D face recognition were conducted over a decade ago as reported from Bowyer et al. [3] in their recent survey on this topic and many different approaches have been developed over time to address this challenging task. In most methods, before the recognition phase begins a face normalization is performed to recover the input data from orientation or scaling issues due to the acquisition process. To this aim some authors segment a range image based on principal curvature to find a plane of bilateral symmetry through the face [4], or more frequently a set of feature points is found (nose tips, eye contours, etc.) and then is used as a guide to standardize face pose [5] through a 3D transformation. In other cases a semiautomatic, interactive approach is adopted instead, manually selecting three or more feature points to calculate the orientation of face in 3D space. The comparison between 3D facial representations can be performed according to different techniques and strategies. Indeed a first classification of 3D face recognition algorithms can be done based on their ability to work
53
on neutral faces, i.e. faces showing a standard "relaxed" expression, or to cope with shape variations due to random facial expressions. To the first category belong methods based on feature extraction to describe both curvature and metric size properties of the face represented as a point in feature space and therefore measuring the distance to other points (faces) [6], or extensions to the range images of 2D face recognition techniques based on eigenface [7] or Hausdorff distance matching [8]. Other authors compare faces through a spherical correlation of their Extended Gaussian Image [9], through Principal Component Analysis (PCA) [10, n ] , or even measure the distance between 3D face surfaces by the Iterative Closest Point (ICP) method [12]. To the aim of increase recognition rate in case of expression variations Bronstein et al. [13] apply an isometric transformation approach to 3D face analysis based on canonical images, while other authors combine 3D and 2D similarity scores obtained comparing 3D and 2D profiles [14], or extract a feature vector combining Gabor filter responses in 2D and point signatures in 3D [15]. 3. The proposed method The basic idea of the proposed method is to represent the tridimensional face surface by a bidimensional matrix, the normal map, and then to search for a match measuring the distance between this map and the gallery maps in the reference database. More precisely, we want to represent each normal vector of a 3D face mesh, by a pixel color, using the r,g,b pixel components to store the three vector components. To this aim we project the 3D geometry onto a 2D space through spherical mapping, thus obtaining a representation of original face geometry which retains spatial relationships between facial features. The comparison between two faces is therefore performed calculating the difference image between the corresponding normal maps. The whole recognition process is resumed in the schematic view below and discussed in depth in the following subsections 3.1. to 3.5. is showed in Fig. 1. 3.1. Face
acquisition
One of the goals of the presented method is the ability to work on a polygonal face model regardless to its position, rotation or scaling within an absolute reference coordinate system during the acquisition process. While
54
Face Acquisition and 3D Mesh Generation
Differ&see Map
Aftgnmr Dt%imiem Histogram
Normal Map
Figure 1.
Recognition process scheme.
the mesh alignment task is explained in the next subsection, there are some general requisites the mesh have to meet to be registered in the reference database or to allow a valid comparison. Indeed, the mesh should include a set of basic facial features such as both eyes, nose, mouth, forehead, chin. As the hair surface is not reliable to face recognition, it could and should not be included in the mesh. There are not strict polygonal resolution requirements for the face model, and, while a good level of detail and a symmetrical' mesh topology are preferable, they are not necessary for the system to work. Any face scanning system such as laser scanners, structured light scanners and even image feature based mesh warping could produce valid 3D data for the presented method. As most 3D scanning systems produce range data, and therefore 3D polygonal mesh, almost unaffected from lighting conditions, this peculiarity turn out in a clear advantage for use in face recognition.
55 3.2. Face
alignment
Due to posing issues during the acquisition, the face mesh could be translated in any direction respect to the coordinate system origin. Additionally it could be arbitrarily scaled and rotated as well. As the only data relevant to the proposed recognition method are the normals to each polygons in the face model, which measure the local curvature of the surface, we first calculate the mesh centroid and then, subtracting its coordinates (cx,cy, cz) from each mesh vertex, we align the centroid and the whole mesh to the axis origin. Scale factor of face mesh is simply not an issue as we normalize normal vector length before further processing. To allow a coherent assignment of mapping coordinates as detailed in subsection 3.3., we need to know the orientation of face mesh in 3D space and eventually re-align it. Normalization of face mesh orientation in 3D space is achieved through a semi-automatic procedure. First, three fiducial points with known spatial relationships are selected interactively on the face surface, then a rigid body transformation of vertex coordinates is performed to re-align the mesh. Valid fiducial points are the inner eye (left and right) and the nose tips, which are easily found on any face mesh and due to their peculiar local curvature could even be located automatically in a future version of this method. In subsection 3.5. we apply an averaging mask to pixel comparison which both increase the robustness of the method to residual misalignment of the mesh. 3.3. Mesh
mapping
The shape information of a facial surface is "encoded" in its local curvature features which may be viewed as a digital signature of the face. This information in a polygonal mesh is given by the polygon normals. As we want to represent normal vectors by a color image while retaining spatial relationships in the 3D mesh, we first need to project each vertex coordinates onto a 2D space, a task referred as mapping. As showed in fig. 2, we used a spherical projection (adapted to mesh dimension), because it fits better the actual 3D shape of the face mesh. More formally, given an arbitrary mesh M, we want associate to each mesh vertex Vi with coords (xi,yi,Zi) £ R? the ordered couple (iH,Vi) G (U, V) with 0 < u, v < 1. For each vertex Vi of mesh M, the formula is given by:
56
— i (Xi,yi5zi) Figure 2.
(^v.)
Spherical mapping from 3D coords to 2D coords.
diamY
Ui =
0, 5 — arctan
Vi 2
Vx + z2
/27T
/7T
where diamY and diamX are the diameters of mesh M on V and X axis respectively and >(c) return the fractional part of the value c. The resulting face geometry of a generic face mesh, showed in Fig. 3-a, projected onto the (U, V) domain is shown in Fig. 3-b. 3.4. Mesh
sampling
Now it is possible to store normal data representing face geometry and topology in a bidimensional matrix N with dimension k x I. To this purpose we have to sample the mapped geometry to quantize the lenght of the three versors of each normal. So we can assign to each pixel in N, the normal components [nx,ny, nz] to corrispondent surface region given by the mapping coordinates (u, v) of the polygons (i.e. the mapping coordinates of the triangles vertices if M is a tri-mesh) and, as the matrix N has discrete dimensions, the resulting sampling resolution for the mesh is 1/fc for the u range and 1/1 for the v range. The normal components [ stored in N using a value belonging to the R, G, B color space. We refer to the resulting matrix N as the normal map of mesh M. A normal map with a standard color depth of 24 bit allows 8 bit quantization for each normal component, this precision proved to be adequate for the recognition process (Fig 3-c).
57
(a) Original Mesh M
(b) Mapping Coords
(c) Nomial-Map Image
Figure 3. Normal map generation phases. Normal map is 24-bit [nx,nV)nz} as [r.g.b].
3.5* Normal
images
visualized
comparison
When the sampling phase is completed, we can register the new face, i.e. its normal image, in the reference database, or perform a search through it to find a matching subject. The basic idea for estimate the comparison between any two face mesh MA an MB (namely their normal images NA and NB) is achieved calculating the angle included between each pairs of pixel colors (in other words, the normals with corresponding mapping coordinates), and storing it in a new image D, Geometrically, the angle 0 from two vector v and w is given by: 0 = arccos (v x w) as v and w are normalized then ||v|| x ||w|| = 1. Now, as each pixel (XNAIVNA) 'm ^A has corresponding color components (rtfA,gNA,bNA) and each pixel (xNB,yNB) in NB has corresponding components (r^s» 9NB , &JVB ) ^he a n g l e included between the normals represented by each pair of pixel with x^A = XNB and y^A = yNB is given by: 9 = aiccas(rNA • rN]B + gNA • gNs + 6 ^ . bNo) with components opportunely normalized from color domain to spatial domain, so 0 < rNA,gNA,bNA < 1 and 0 < rN]3,gNB,bNB < 1. The angle is stored in a bidimensional m x n matrix D with gray-scale component 0, with 0 < 0 < n (see Fig. 4).
58
i.ti) Non.E„tlMap A
(b) Uonr.J M«p B
(c) Differed M^JJ I;
Figure 4. Example of comparison between two Normal Map 128 x 128. (c) Difference Map 128 X 128 is builded from (a) and (b).
To further reduce the effects of residual face misalignment during acquisition and sampling phases, we'calculate the angle 0 using akxk (usually 3 x 3 or 5 x 5) matrix of neighbour pixels. For example, the resulting values which will be used for calculate the angle 0 has a padded neighborhood of the form: / . . . rli_i,j_i fii-ij n t - i j + i ...\ ... n ifJ -_i Wij n»,j+i ... I ... rti+ij-i rii+ij n » + i j + i ...
V
/
Applying the formula: k
k
j=:--kj---k
n4
„™~-
j:
i=-kj = -k
it
I)
we obtain the normal vector of the corrispondent surface padded by the mask. By summing every gray level in D, we obtain an histogram H(x) that represent the angular distances between mesh MA and MB as showed in Fig 5. On the X axis we represent the achievable angles between each pair
59 H
4
Figure 5. Example of histogram H to represent the angular distances. . (a) shows a typical histogram between two similar Normal Maps, while (b) between two different Normal Maps.
of conmparisons (sorted from 0° degree to 180° degree), while on the Y axis we represent the total number of differences found. This means that two similar faces will have an histogram H{x) with very high values on little angles, while two distinct faces will have differences more distribuited. At this point, a convolution with a Gaussian function G{x) (see Fig 6.) is used for weigh the angle differences between MA and MB:
H(x)oG(x)=
£ x=0
h{x)
aVQjr
e£
In this matter, varying a and k is possible to change the sensibility of the recongnition system.
Figure 6.
Convolution between Histogram H(x) and Gaussian function
G{x).
60 4. E x p e r i m e n t s a n d Discussion We present the results of our method on a 3D database of human faces, featuring different races, sex, ages, and expressions.
4 . 1 . The reference
database
Unfortunately, 3D scanning technology is still costly and sometimes not suited to work reliably on living, non static subjects, forcing many researchers to build their own dataset with different equipments and techniques. One of the aims in experiments conducted on the proposed method was to test its performance under controlled conditions, so we decided to build a face database based on surface mesh resulting from a feature guided warping of a standard face mesh. More precisely, every face model in the database has been created deforming a standard polygonal face mesh to closely fit a set of facial features extracted from front and side images of each individual to be enrolled in the system. The standard face mesh used in the dataset has about 7K triangular facets, and even if it is certainly possible to use mesh with higher level of detail (LOD), we found this resolution to be adequate to the recognition purpose. This is mainly due to the optimized tessellation which privileges key area such as eyes, nose and lips whereas a typical mesh produced by 3D scanner features almost evenly spaced vertices. In section 4.2. we compare results for different meshes of the same face with variable LOD. The full database includes 50 different individuals (30 males and 20 females, age ranging from 19 to 40) each featuring the following 10 different expressions: neutral, rage (moderate), fear, smile (closed), doubt, surprise (moderate), rage (extreme), closed eyes, surprise (extreme), disgust.
4.2. Experimental
results
We have evaluated the recognition rate performing a one-to-one comparison of a probe set of 3D models with a gallery set of 3D models with neutral expression. The recognition rate showed to be very good reaching 100% using semi-automatic alignment. The results are generally better than those obtained by many 2D algorithms but the lack of a standard 2D/3D reference face dataset make the comparison difficult. More specifically, in figure 7 are illustrated the results of precision/recall of recognition system:
61 11*4*™™
—•4
0,95
m~^ ^ l ^ ^ ™ ™ ^ ^
»™_____™™„
,
, ^**** ; " T * - « « « .
0.9
™
„-.,«™™»^
"
*
^
^
^
»
^
0.85
=
*t8
08
1 0 75 *
07
.
*'"
" *
"•*••'•"
..^.^.-v.,*™**™™™*^^
0.65 0.6
0.55 ——1
0,1
0,2
T ™ — "— — T —
03
0,4
«
05
-i
06
-™-—-r*-~'
0.7
1
r™
"—i
0,9
Precision Figure 7. Precision/Recall using normal map 128x128, Gaussian function G(x) (with a" = 4.5 and k = 50) and mask size 3.
5.
Conclusions
We presented a m e t h o d for tridimensional face recognition based on normal image, a 2D array storing information a b o u t local curvature of face surface, aimed t o biometrical applications. T h e m e t h o d proved to be simple, robust to posing and expression variations, relatively fast and with an high average recognition rate. Experimental results show t h a t wavelet compression of normal image could greatly reduce the size of face descriptor not significantly affecting t h e recognition precision. As t h e normal image is a 2D mapping of mesh features, future research could well integrate additional 2D color info (texture) acquired during t h e same enrolment session. Implementing a true multi-modal version'of t h e basic algorithm which correlates t h e texture and normal image could further enhance t h e discriminating power even for complex 3D recognition issues such as t h e presence of beard, moustache, eyeglasses, etc. References 1. C. Beumier, M. Acheroy, Automatic Face Authentication from 3D surface, British Machine Vision Conference (BVMC-98), 1998. 2. X. Gu, S. Gortler, H. Hoppe, Geometry image, ACM SIGGRAPH 2002, pages 355-361. 3. K.W. Bowyer, K. Chang, P. A. Flynn, Survey of 3D and Multi-Modal 8D+&D Face Recognition, ICPR 2004. 4. J. Y. Cartoux, J. T. LaPreste, and M. Richetin. Face authentication or recog-
62
5.
6. 7.
8.
9.
10.
11.
12.
13.
14. 15.
nition by profile extraction from range images. Proceedings of the Workshop on Interpretation of 3D Scenes, pages 194-199, November 1989. T. Nagamine, T. Uemura, and I. Masuda. 3D facial image analysis for human identification. International Conference on Pattern Recognition (ICPR 1992), pages 324-327, 1992. G. Gordon. Face recognition based on depth and curvature features. Computer Vision and Pattern Recognition (CVPR), pages 108-110, June 1992. B. Achermann, X. Jiang, and H. Bunke. Face recognition using range images. International Conference on Virtual Systems and MultiMedia, pages 129nl36, 1997. B. Achermann and H. Bunke. Classifying range images of human faces with Hausdorff distance. 15-th International Conference on Pattern Recognition, pages 809-813, September 2000. H. T. Tanaka, M. Ikeda, and H. Chiaki. Curvature-based face surface recognition using spherical correlation principal directions for curved object recognition. Third International Conference on Automated Face and Gesture Recognition, pages 372-377, 1998. C. Hesher, A. Srivastava, and G. Erlebacher. A novel technique for face recognition using range images. Seventh Int'l Symposium on Signal Processing and Its Applications, 2003. K. Chang, K. Bowyer, and P. Flynn. Face recognition using 2D and 3D facial data. 2003 Multimodal User Authentication Workshop, pages 25-32, December 2003. G. Medioni and R. Waupotitsch. Face recognition and modeling in 3D. IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG 2003), pages 232-233, October 2003. A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Expression-invariant 3D face recognition. AudioandVideo-Based Person Authentication (AVBPA 2003), LCNS 2688, J. Kittler and M.S. Nixon, eds.:62-70,2003. C. Beumier and M. Acheroy. Face verification from 3D and grey level cues. Pattern Recognition Letters,22:1321-1329, 2001. Y.Wang, C. Chua, and Y. Ho. Facial feature detection and face recognition from 2D and 3D images. Pattern Recognition Letters, 23:1191-1202, 2002.
63
HIGH-D DATA VISUALIZATION M E T H O D S VIA PROBABILISTIC PRINCIPAL SURFACES FOR DATA M I N I N G APPLICATIONS
A. S T A I A N O , R. T A G L I A F E R R I A N D L. D E V I N C O Dipartimento
di Matematica ed Informatica, Universita di Salerno, Via Ponte don Melillo, 84084, Fisciano (Sa), Italy E-mail: {astaiano,robtag}Qunisa.it G. L O N G O
Polo delle Scienze
Dipartimento di Scienze Fisiche, Universita Federico II di Napoli, e della Tecnologia, Via Cintia 6, 80136 Napoli, E-mail: [email protected]
Italy
One of the central problems in pattern recognition is that of input d a t a probability density function estimation (pdf), i.e., the construction of a model of a probability distribution given a finite sample of data drawn from that distribution. Probabilistic Principal Surfaces (hereinafter PPS) is a nonlinear latent variable model providing a way to accomplish pdf estimation, and possesses two attractive aspects useful for a wide range of d a t a mining applications: (1) visualization of high dimensional d a t a and (2) their classification. P P S generates a non linear manifold passing through the data points defined in terms of a number of latent variables and of a nonlinear mapping from latent space to data space. Depending upon dimensionality of the latent space (usually at most 3—dimensional) one has 1 — D, 2 — D or 3 — D manifolds. Among the 3-D manifolds, P P S permits to build a spherical manifold where the latent variables are uniformly arranged on a unit sphere. This particular form of the manifold provides a very effective tool to reduce the problems deriving from curse of dimensionality when d a t a dimension increases. In this paper we concentrate on PPS used as a visualization tool proposing a number of plot options and showing its effectiveness on two complex astronomical data sets.
1. Introduction Across a wide variety of fields, data are being collected and accumulated at a dramatic pace. There is an urgent need for a new generation of computational theories and tools to assist humans in extracting useful information
64
(knowledge) from the rapidly growing volumes of data. These theories and tools belong to the field of Knowledge Discovery in Databases (KDD). At an abstract level, the KDD field is concerned with the development of methods and techniques aimed at extracting meaning out of data. The full and effective scientific exploitation of these massive data sets will require the implementation of automatic tools capable to perform a large fraction of data mining and data analysis work, posing considerable technical and even deeper, methodological challenges, since traditional data analysis methods are inadequate to cope with this sudden increase in the data volume and especially in the data complexity (ten or hundreds of dimensions of the parameter space) 7 . Among the data mining methodologies, visualization plays a key role in developing good models for data especially when the quantity of data is large. In this context PPS are placed as a powerful tool for characterizing and visualizing high-D data. PPS 4 ' 5 are a nonlinear extension of principal components, in that each node on the PPS is the average of all data points that projects near/onto it. From a theoretical standpoint, the PPS is a generalization of the Generative Topographic Mapping (GTM) 2 , which can be seen as a parametric alternative to Self Organizing Maps (SOM) 6 . Some advantages of PPS includes its parametric and flexible formulation for any geometry/topology in any dimension, guaranteed convergence (indeed the PPS training is accomplished through the Expectation-Maximization algorithm). A PPS is governed by its latent topology and owing to the flexibility of the PPS a variety of PPS topology can be created one of which is the 3D sphere. The sphere is finite and unbounded, with all nodes distributed at the edge, making it ideal for emulating the sparseness and peripheral property of high-D data. Furthermore, the sphere topology can be easily comprehended by humans and thereby used for visualizing high-D data. We shall go in details of all these issues over the paper, which is organized as follows: in section 2 the PPS theoretical background is described, while section 3 show the visualization possibilities offered by PPS illustrating a number of visualization methods proposed by us. Finally, in section 4 two complex astronomical data sets (synthetic and real-world, respectively) are addressed, and in section 5 conclusions close the paper.
65
2. Theoretical background 2.1. Latent
Variable
Models
The goal of a latent variable model is to express the distribution p(t) of the variable t = (t\,..., to) in terms of a smaller number of latent variables x = ( x i , . . . , XQ) where Q < D. At this aim the joint distribution p(t, x) is decomposed into the product of the marginal distribution p(x) of the latent variables and the conditional distribution p(t|x) of the data variables given the latent variables. It is convenient to express the conditional distribution as a factorization over the data variables, so that the joint distribution becomes D
p(t,x) = p(x)p(t|x) = p(x) J ] p{td\x).
(1)
The conditional distribution p(t|x) is then expressed in terms of a mapping from latent variables to data variables, so that t = y(x;w) + u
(2)
where y(x; w) is a function of the latent variable x with parameters w, and u is an x-independent noise process. If the components of u are uncorrected, the conditional distribution for t will factorize as in (1). Geometrically, the function y(x;w) defines a manifold in data space given by the image of the latent space. The definition of the latent variable model is completed by specifying the distribution p(u), the mapping y(x;w), and the marginal distribution p(x). The type of the mapping y(x;w) determines the specific latent variable model. The desired model for the distribution p(t) of the data is obtained by marginalizing over the latent variables
p(t) = y"p(t|x)p(x)dx.
(3)
This integration will, in general, be analytically intractable except for specific forms of the distributions p(t|x) and p(x). 2.2. Generative
Topographic
Mapping
The GTM defines a non-linear, parametric mapping y(x;W) from a Qdimensional latent space (x € B9) to a D-dimensional data space (t € RD), where normally Q < D. The mapping is defined to be continuous and differentiable. y(x; W ) maps every point in the latent space to a point into the data space. Since the latent space is Q-dimensional, these points will
66 be confined to a Q-dimensional manifold non-linearly embedded into the .D-dimensional data space. If we define a probability distribution over the latent space, p(x), this will induce a corresponding probability distribution into the data space. Strictly confined to the Q-dimensional manifold, this distribution would be singular, so it is convolved with an isotropic Gaussian noise distribution, given by p(t\X,W,f3)=^y
expL§f>-w(x,W))2j
(4)
where t is a point in the data space and / 3 _ 1 denotes the noise variance. By integrating out the latent variable, we get the probability distribution in the data space expressed as a function of the parameters ft and W ,
p(t|W,/J) = Jp(t\X, W,/?)p(x)dx.
(5)
This integral is generally not analytically tractable. However, by choosing p(x) as a set of M equally weighted delta functions on a regular grid, M
1
m=l
the integral in (5) turns into a sum, M
p(t|W,/?) = - 5 > ( t | x m , W , / ? ) .
(7)
m=l
Now we have a model where each delta function center maps into the center of a Gaussian which lies in the manifold embedded in the data space, as illustrated in Figure 1. Eq. (7) defines a constrained mixture of Gaussians, since the centers of the mixture components can not move independently of each other, but all depend on the mapping y(x; W ) . Moreover, all components of the mixture share the same variance, and the mixing coefficients are all fixed to JJ. Given a finite set of independent and identically distributed (i.i.d.) data points, { t „ } ^ = 1 , we can write down the log-likelihood function for this model, and maximize it by means of the EM algorithm with respect to the parameter of the mixture, namely W and (3. The form of the mapping y(x; w) is defined as a generalized linear regression model y(x;w)=W0(x)
(8)
where the elements of
67
Figure 1. In order to formulate a tractable non linear latent variable model, we consider a prior distribution p(x) consisting of a superposition of delta functions, located at the nodes of a regular grid in latent space. Each node x m is mapped to a corresponding point y ( x m ; w) in d a t a space, and forms the center of a corresponding Gaussian distribution.
2.3. Probabilistic
Principal
Surfaces
The PPS generalizes the GTM model by building a unified model and share the same formulation as the GTM, except for an oriented covariance structure for nodes in RD. This means that data points projecting near a principal surface node have higher influences on that node than points projecting far away from it. This is illustrated in Figure (2).
(b) PPS
(a) GTM
dy dx
j8S
Figure 2. Under a spherical Gaussian model of the GTM, points 1 and 2 have equal influences on the center node y(x) (a) PPS have an oriented covariance matrix so point 1 is probabilistically closer to the center node y(x) than point 2 (b).
68 Therefore, each node y(x;w), x € {x m }£f =1 , has covariance .),
0< a < (9)
where * {e g (x)}q_j is the set of orthonormal vectors tangential to the manifold at y(x;w), * {e 1 J_ to the manifold ID or spherical II to the manifold.
E(x) =
The EM algorithm can be used to estimate the PPS parameters W and /?, while the clamping factor is fixed by the user and is assumed to be constant during the EM iterations. If we choose a 3D latent space, a spherical manifold can be constructed using a PPS with nodes {x m }^f =1 arranged regularly on the surface of a sphere in 1Z3 latent space, with the latent basis functions evenly distributed on the sphere at a lower density. After a PPS model is fitted to the data, the data themselves are projected into the latent space as points onto a sphere (Figure 3). The latent manifold (a) Manifold in latent space R 3 ! o
(b) Manifold in feature space R D
x|
(c) t projected onto manifold in latent space R 1 x
—
Elxlll |
y(x)
Figure 3. (a) T h e spherical manifold in R 3 latent space, (b) T h e spherical manifold in R3 data space, (c) Projection of data points t onto the latent spherical manifold.
69 coordinates x n of each data point t„ are computed as . Xn = ( x | t n ) =
M
/ Xp(x|t)dx = J 2 r™nxm •* m=l
where rmn are the latent variable responsibilities defined as rmn
= P(xm|t„) =
P(tn|xm)P(xm) 12mi=l
t
x
= x
P( n| m')-P( m/)
p(t„[xm) I_,m/=1 P ( * n | x m ' )
These coordinates lie within a unit sphere. 3. Spherical P P S Visualizations Basically, our aim is to allow the user to: • easily interact with the data into the latent space, hence with the data onto the sphere in several ways, • to visualize the data probability density in the latent space so giving a first understanding about the clusters in the data, • finally to fix a number of clusters and visualize the points therein. Eventually, at the end of this option one could still interact with the data by selecting data points in a given cluster and make a number of comparisons. 3.1. Interactively
selecting
points on the
sphere
Having projected the data into the latent sphere, it is advisable for a data analyzer to localize the most interesting data points (obviously, this depends on the application at hand), for example the ones lying far away from more dense areas, or the ones lying in the overlapping regions between clusters, and to gain some information about them, by linking the data points on the sphere with their position in the data set which contains all the information about the typology of the data. Eventually, in the case of astronomical data set, for example, if the images corresponding to the data would be available to the user then he can visualize the object in the original image which corresponds to the data point selected into the sphere. These possibilities are fundamental for the astronomers who may be able to extract important meanings from the data and for all the data mining activities. Furthermore, the user is also allowed to select a latent variable and coloring all the points for which the latent variable is responsible (Figure 4).
70
Figure 4. Data points selection phase. The bold black circles represent the latent variables; the blue points represent the projected input data points. While selecting a latent variable, each projected point for which the variable is responsible is colored. Byselecting a data point the user is provided with information about it: coordinates and index corresponding to the position in the original catalog.
3.2. Visualizing sphere
the latent variable responsibilities
on the
The only projections of the data points into the sphere provide only partial information about the clusters inherently present in the data: if the points are strongly overlapped the data analyzer can not derive any information at all. A first insight on the number of agglomerate localized into the spherical latent manifold is provided by the mean of the responsibility for each latent variable. Furthermore, if we build a spherical manifold which is composed by a set of faces each one delimited by four vertices then we can color each face with colors varying in intensity on the base of the values of the responsibility associate to each vertex (and hence, to each latent variable). The overall result is that the sphere will contain regions more dense with respect to other and this information is easily visible and understandable. Obviously, what can happen is that a more dense area of the spherical manifold might contain more than one cluster, and this can be validated by further investigations. 3.3. A method
to visualize
clusters
on the
sphere
Once the user or a data analyzer has an overall idea of the number of clusters on the sphere, he can then exploit this information through the use of classical clustering techniques (such as hard or fuzzy fe-means1) to find out the prototypes of the clusters and the data therein contained. This task is accomplished by running the clustering algorithm on the projected data.
71
Afterwards, one may proceed by coloring each cluster with a given color (see Figure 5). The visualization options so far described have been integrated in a userfriendly graphical user interface which provides a unified tool for the training of the PPS model, and next, after the completion of the training phase, to accomplish all the functions for the visualization and the investigation of the given data set 7 .
Figure 5.
Clusters computed in the latent space by Ic-means.
4. E x p e r i m e n t s 4 . 1 . S y n t h e t i c Catalog
Visualizations
The catalog contains 20000 objects equally divided into two classes composed by stars and galaxies, respectively. Each object is described by eight features (parameters), namely the magnitudes in the corresponding eight optical filters. Figure 6 shows two different visualizations for the synthetic catalog, namely, 3-D PCA visualization and the spherical PPS projections. PPS projections onto the spherical latent manifold appear far and away more readable than PCA where all the data appear as a unique overlapped agglomerate (except a little isolated group). The figure also depicts the corresponding latent variable probability density function. By rotating the sphere with density, two high density regions are highlighted with other few lower density regions. 4.2. G O O D S Catalog
Visualizations
GOODS catalog is a star-galaxy catalog composed by 28405 objects. Each object is detected in 7 optical bands, namely U,BfV,R,IfJ}K bands. For
72
Zfito*b.usteiejs K:«J3P
:
e^g^"
"•
Figure 6. Prom top left to bottom: synthetic catalog 3D PCA projections, P P S projections on the sphere and probability density function on the sphere.
each band 3 different parameters (i.e., Kron radius. Flux and Magnitudes) are considered summing to a total number of 21 parameters. The catalog contains about 27000 galaxies and about 1400 stars. Moreover, there is a further peculiarity in the data contained in the catalog: the majority of the objects are "drop outs'*, i.e. they are objects not detectable in a given optical band. Among this type of objects there are groups which are not detectable in only one band, two bands, three bands and so on. The data set, therefore, contains four classes of objects, namely star, galaxy, star which are drop outs and galaxy which are drop outs (we do not care about the number of bands for which an object is a drop out). The GOODS catalog, is a very complex data set which exhibits four strongly overlapping classes. In fact, as it can be seen from Figure 7, the PCA visualization gives no interesting information at all, since it display only a single condensed group of data. In PCA, the class of dropped galaxies (whose objects are yellow colored), which contains the majority of objects (about 24000) is near totally hidden. The PPS projections, instead, show a large group consisting of the dropped galaxies and overlapping objects of the remaining objects
73 and a well bounded group of galaxies. Figure 7 also depicts the latent variable probability densities for galaxy and star objects, respectively. Note, especially, how different these densities appear* for each group of objects.
Figure 7. Prom top left to bottom right clockwise: GOODS 3D PGA projections, P P S projections on the sphere, galaxy density on the sphere and star density on the sphere.
5. Conclusions We discussed the potentiality of a particular non-linear latent variable model, namely the Probabilistic Principal Surfaces and highlighted the flexibility it exhibits in a number of activities fruitful for data mining applications focusing in particular on its visualization capabilities. Above all, the spherical PPS, which consists of a spherical latent manifold lying in a three dimensional latent space, is better suitable to high-D data since the sphere is able to capture the sparsity and periphery of data in large input spaces which are due to the curse of dimensionality. We proposed a number of visualization possibilities integrated in an user-friendly Graphical User Interface:
74 • Interactive selection of regions of sample points projected into the sphere for further analysis. This is particularly useful to profile groups of data. • Visualization of the latent variable responsibilities onto the sphere as a colored surface plot. It is specially, useful to localize more and less dense areas t o find out a first number of clusters present into the data, and t o highlight the regions where lies outliers. • A method to exploit the information gathered with the previous visualization options through a clustering algorithm to find out the clusters with the corresponding prototypes and d a t a points. T h e visualization tasks have been proved effective in a complex application domain: astronomical d a t a analysis. Astronomy is a very rich field for a computer scientist due to the presence of a very huge amount of data. Therefore, every day there is the need to resort to efficient methods which often are neural networks-based. T h e spherical P P S for visualization represents the first tool for astronomical d a t a mining which gives the possibility t o easily interact with t h e data. Although the study of the methods addressed in this paper is devoted to the astronomical applications, the system is general enough to be used in whatever data-rich field to extract meaningful information.
References 1. J.C. Bezdek, J. Keller, R. Krisnapuram, N.R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer Academic Publisher, (1999) 2. C. M. Bishop, M. Svensen, C.K.I. Williams, GTM: The Generative Topographic Mapping, Neural Computation, 10(1), (1998). 3. C M . Bishop, Latent variable models, In M. I. Jordan (Ed.), Learning in Graphical Models, MIT Press, (1999). 4. K. Chang, Nonlinear Dimensionality Reduction Using Probabilistic Principal Surfaces, PhD Thesis, Department of Electrical and Computer Engineering, The University of Texas at Austin, USA, (2000) 5. K. Chang, J. Ghosh, A unified Model for Probabilistic Principal Surfaces, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, NO. 1, (2001) 6. T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin, (1995) 7. A. Staiano, Unsupervised Neural Networks for the Extraction of Scientific Information from Astronomical Data, PhD Thesis, Universia di Salerno, Italy, (2003).
75
A S T U D Y ON R E C O V E R I N G THE CLOUD-TOP HEIGHT F R O M I N F R A - R E D VIDEO SEQUENCES
ANNA ANZALONE IASF/CNR Via Ugo La Malfa, 153 90146 Palermo - Italy E-mail: anna, anzalone @pa. iasf. cnr. it FRANCESCO ISGRO Dipartimento di Informatica e Scienze dell'Informazione Via Dodecanese), 35 16146 Genova - ITALY E-mail: [email protected] DOMENICO TEGOLO Dipartimento di Matematica ed Applicazioni Via Archirafi, 34 90123 Palermo - ITALY E-mail: [email protected]
In this paper we present some preliminary results on an optical-flow based technique aimed at recovering the cloud-top height from infra-red image sequences. The recovery of the cloud-top height from satellite infra-red images is an important topic in meteorological studies, and is traditionally based on the analysis of the temperature maps. In this work we explore the feasibility for this problem of a technique based on a robust multi-resolution optical-flow algorithm. The robustness is achieved adopting a Least Median of Squares paradigm. The algorithm has been tested on semi-synthetic data (i.e. real data that have been synthetically warped in order to have a reliable ground truth for the motion field), and on real short sequences (pairs of frames) coming from the ATSR.2 data set. Since we assumed the same geometry as for the ATSR.2 data, the cloud top height could be recovered from the motion field by means of the widely used Prata and Turner equation.
76
1. Introduction The interest in monitoring cloud properties by means of observations from satellites is due to the large influence that clouds have on the earth/atmosphere energy balance. Cloud properties, such as optical depth, visible albedo, thickness and geometrical heights, are difficult to measure directly, while satellite instruments can make useful indirect measurements. Reliable estimation of cloud parameters, and in particular global cloudtop height retrieval, is one of the main goals of the production of measurements from the space imagery for numerical weather forecasting, atmosphere studies and climate modelling. Focusing on cloud-top height, different approaches are exploited to evaluate this parameter from satellite data. The most widely used procedures are based on radiative transfer methods that rely on extra information derived from radiosonde measurements or objective analyses, whose uncertainties could provide height estimates with errors as large as l-3km, as claimed in *. Techniques like brightness temperature (infrared window method) 2 ' 3 for example, combine cloud Infra Red emissivity estimates, external temperature profiles or lapse rate with the cloud-top temperature measured by the 11-12/jn IR-channels of satellite instruments. For optically thin clouds the observed brightness temperature needs to be modified due to the semi-transparency of the cloud. In this case in fact, the cloud temperature is affected by ground /cloud and atmosphere below. In very diffuse clouds like cirrus, the adjusted temperature may correspond closer to the cloud centre than to the cloud-top being the mass of the cloud spread out over a depth of several kilometres even though the optical depth may only be 1 or 2. Whereas none correction is applied for optically thick cloud, because all the vertically propagating IR radiation comes from the cloud-top or very close to it. Finally the cloud brightness temperature is compared to a temperature profile of the atmosphere and the lowest altitude with the same temperature is assigned as cloud height. The CO2 slicing method 4 uses the 15m thermal emission region where the CO2 has a moderate to strong absorption characteristic and it is based on the ratio for two nearby bands of differences of radiation intensities measured in a pixel covered by a cloud and intensities expected on a cloudy free pixel. Finally the Oxygen A-band method 5 analyses the reflected solar radiation. Stereoscopy is a different approach that has the advantage of depending only on the geometry of the observations. In the past stereo measurements
77
were achieved by means of data coming from different configurations of satellites such as simultaneous geostationary satellite image pairs and also geostationary and polar orbiter image pairs 6 ' 7 . At present there are some single polar satellites having on board instruments with two views that are currently operative: ERS2/ATSR2, ENVISAT/AATSR, EOS Terra-ASTER. Besides the single polar orbiters EOS Terra and Aqua carry an instrument (MISR) with nine views. The Multi angle Imaging Spectro Radiometer (MISR) collects data from 4 spectral bands (Visible and NearlR) and provides 9 different views of the same scene within a total temporal range of 7 min, with an inter-camera delay of 45-60sec. It provides operational and simultaneous retrieval of cloud-top heights and motion obtained from a stereoscopic approach 8 . Feature and area-based stereo-matchers are applied to suitable triplet of properly selected views to provide cloud-motion evaluation on 70.4 km grids with an accuracy of about 3m/s and cloud-height values on 1.1 km grids (due to processing time constraints) with a height resolution of 562m. The Along-Track Scanning Radiometer-2 (ATSR2) on board the ERS2 satellite (780km altitude), collects data from 7 spectral bands (Visible and IR) and views the surface along the direction of the orbit track at an incidence angle of 55 as it flies toward the scene (forward view). Then, some 120s later, ATSR records a second observation of the scene at an angle close to 0 (nadir view) 9 . Some studies on ATSR data have lead to the development of a multiresolution feature-based matching procedure to retrieve cloud-top height maps with wind correction and results are compared in 10 to those from MISR. In this paper we present a novel approach to the problem, that is computer vision based. The main problem in reconstructing a scene from images is establishing correspondences between the images u . All the algorithms for recovering the cloud-top height from images that we have found in literature approach the correspondences problem as a stereo matching problem. Here we exploit the fact that, under the assumptions that images can be acquired at relatively high frame rate, the apparent motion in the image plane cannot be very large. Therefore correspondences can be obtained by computing this apparent motion, that is known in literature as optical flow 11 . Assuming that the imaging geometry is the same as for the ATSR2 data, the reconstruction is simply obtained applying the Prata and Turner equation 12 . The paper is structured as follows. Next section briefly explain the
78
geometry of the imaging system. The algorithm is detailed in Section 3. Experiments are reported and discussed in Section 4. Section 5 is left to final remarks.
2. Geometry of the satellite viewing system Two views of the same scene are acquired at different time. One view is taken by the sensor while the satellite flies towards the scene at time t\, and a second one is acquired at a time t^ such that the most of the same scene is in the field of view of the sensor at the two different times. The angles \u a n d Xt2 a r e t n e satellite zenith angles projected on the subsatellite track plane 12 , due to the curvature of the Earth's surface these angles are different for each point of the image rows (across-track direction). Assuming that the effect of the wind is neglegtable it is easy to see, from a simple geometrical reasoning on Fig. 1, that the height h of the cloud can be recovered from the two angles x*i and xt2 a n d the pixel shift uy along the vertical direction, that is the vertical component of the 2D motion vector on the image. By doing some simple algebra we obtain the formula h = uvZ : tanxt! - t a n x t 2
Figure 1.
The geometry we assume for the imaging system
(1)
79 3. Description of the method The scheme of the method that our system prototype follows in order to retrieve the cloud-top height is depicted in Fig. 2. In a nutshell the algorithm first establishes dense correspondences between two consecutive frames in the sequence, and then uses the disparity field together with the angle information, for recovering the height for cloud point. Points on the earth are filtered out from this computation by mean of a precomputed binary cloud mask, where all the pixels are flagged as earth points (i.e. image of point on the heart) or cloud points (images of clouds).
% #* ( O F Imgel-f&mgaS™)
Figure 2.
Schematic representation of the algorithm
The system is composed by the following main modules: (1) optical flow estimation: estimates the 2D motion between two consecutive frames, say 1^ and It 2 , in the sequence; the motion field is computed both ways, i.e. from frame I t l to frame I t 2 , and from frame It 2 to frame 1^. (2) consistency check and interpolation: the two motion fields computed during the previous step are validated and merged into a single field; the optical flow for pixels for which the motion vector is not considered accurate is computed by means of interpolation. (3) height estimation: this module uses the motion field produced by the previous steps and computes the height for each cloud pixel, using Eqn. (1). The rest of this section is dedicated to a more detailed description of the three modules of the system.
80 3.1. Optical
flow
estimation
The optical flow between the two input frames I t l and I t 2 is computed in order to establish a dense pixel correspondences map, necessary for synthesising frame I t . We implemented a non-iterative version of the LucasKanade optical flow algorithm 13 , where for each pixel p the motion vector u is obtained solving the system /
V I
P>\
VI,P 2 l
VvipJ
(
dt
\
v-*w
where the pixels p , are all the points in a neighbourhood of p . In order to avoid the contribution of corrupted pixels, situation that is likely to happen with this kind of images, the optical flow is robustly computed : the system is solved using a robust statistical method based on the Least Median of Squares paradigm 14 ' 15 . If the motion vector u has a large magnitude, the optical flow algorithm cannot return accurate results, and in our case this can happen if the time interval between the two consecutive frames is large. A standard way for coping with large motion vectors is to use a multi-resolution approach 16 : the motion is computed for coarsest images and this estimation is used for predicting the solution for the finer level. Our multi-resolution optical flow, similar to the one described in 17 , works as follows: (1) build a Gaussian pyramid of depth N for each one of the input images; (2) compute the optical flow uN between the coarsest images 1^ and JN.
(3) scale and interpolate uN to obtain a prediction u w _ 1 of the flow between I ^ " 1 and l £ _ 1 ; (4) warp I ^ - 1 with the flow u ^ - 1 obtaining the image I ^ - 1 ; (5) compute the optical flow between images uN~- 1 and l £ _ 1 ; this gives the residual optical flow A u w _ 1 ; (6) set u " " 1 = u " ' 1 + A u ^ - 1 ; (7) iterate the process until the finest level of the pyramid and take
The estimation of the optical flow is speeded up by computing the motion vector only for the cloud pixels.
81 3.2. Consistency
check and
interpolation
The two motion fields computed by the module described in the last section can contain corrupted vectors. These vectors should be detected and removed. A method that is straightforward and widely used, especially in stereo matching 18 , is the so called consistency check. In our case the method takes the two motion fields u j ^ and ut2tt, and for each pixel p it produces a new optical flow u t l t a via the following rules ut
t 12
( p ) = J u « i * 2 ( p ) * / l l u t i t 2 ( p ) - u t 2 t i ( p + ut 1 t 2 (p))|| < A \ oo otherwise
where A is a threshold value that we set to 1, meaning that the error we can tolerate in the motion field is not larger than one pixel. The pixels that not pass the consistency test (the ones for which the motion field is set to oo) are assigned a motion vector from the consistent vectors, by an interpolation rule. At this stage of this work we use a simple nearest point interpolation rule 19 . 3.3. Height
estimation
The height estimation is very simple as it is a straightforward application of Eqn. (1). The zenith angles are measured directly from the satellite, and we can assume that this measurement is accurate enough. Therefore the error in the recovered height depends only on the error on the image motion vector; in particular on the vertical component of the optical flow, as it is only uy appearing in Eqn. (1). 4. Experimental assessment As the only source of errors for the reconstruction of the cloud-top height is the estimated motion field, we mainly focused the experimental assessment for our method to the optical flow estimation. Moreover it must be noted that having ground truth data it is a not easy task, that we preferred to postpone at this preliminary stage of the work. 4.1. Experiments
with synthetic
motion
In order to evaluate the performance of the optical flow algorithm on infrared satellite images we produced synthetic sequences (with known motion field) starting from real satellite images taken from the ATSR2 data set. In practice we selected some images from the database, and from each one of
82
them we created a synthetic sequence by applying a chain of afnne transformation to the original image. More formally starting from a real image I we create a sequence I* as h = H4I where Hj is the 3 x 3 matrix of an afnne transformation, and Ho is the identity matrix. The images Ij are then corrupted by additive zero-mean Gaussian noise. Given a synthetic sequence we computed the optical flow between pairs of consecutive frames. As measures of goodness for the estimation of the optical flow we used the following: a) simple statistics for the error on the motion vector (mean error); b) peak signal to noise ratio between the original image and reconstructed image. This last measure needs to be explained more in detail. Given the optical flow Uj,j+i between the frames I, and It+i, it is possible to warp back Ij + i using the optical flow, obtaining an image I*, that in the noiseless case and assuming a perfect optical flow should be an exact copy of I,. Therefore we can use as measure the widely used Peak Signal to Noise Ratio (PSNR) that is defined as PSNR{I,I)
= 201og10
s
,
VE P (i(p)-i(p))V Here we present results on two synthetic sequences generated from the first frame in Fig. 5. For the first one the norm of the real motion vector is one pixel for each image point. For the second sequence the true motion is three pixels for each image point. We show for each sequence both the error measures. The PSNR is shown for all the three optical flow computed by our algorithm (the two ways optical flow and the one obtained from the consistency and interpolation step). For both the sequences we show the evolution of the error measures across the sequence. In Fig. 3 we show the results for the first sequence. The results for the second sequence are shown in Fig. 4. The experiments where run with a three levels pyramid, and the size of the neighbourhood for solving for the motion vector is 9 x 9. The results reported here in the figures show that the error in recovering the motion field is, on average, always smaller than one pixel, that is a very conforting result. The graphs on the PSNR show that the values relative to the optical flow after the consistency test are, in general, between the values relative to the other two motion fields: this means that the optical
83
Syninenc sequence
.^-__- cviffltw&.'JI'Mw.t
Synthetic sequence
Figure 3. Results of the experiments on the first synthetic sequence. Left: mean error on the OF. Right: PSNR for the reconstructed images
Pixels
PSNR
25
29
33
37
41
fcFims&l
SyntheHc sequence
>ln^
c f iM$™J£"Ac
17 !tst
21
25
2S
33
37
11
Synthetic sequence
Figure 4. Results of the experiments on the second synthetic sequence. Left: mean error on the OF. Right: PSNR for the reconstructed images
flow computed by our routine can be considered as a mean between the two motion fields that can be obtained directly from the images. 4.2. Experiments
on ATSR2
data
We also run some experiments on pairs of real images, that we extracted from the ATSR2 data set. For these pairs we could run a complete recovery of the cloud-top height. However we must admit that this can of data are not very suitable for our algorithm, as the time interval between the two images is long (120 sees), and therefore the motion vector can sometimes be very large. Moreover the two images are originally acquired at different resolutions, and then the smallest one is rescaled to the same size of the largest. This is a problem for stereo matching algorithm, but is particularly serious for a differential approach like ours. In Fig. 5 we show a pair of images from the dataset, and in Fig. 6 the relative height map is shown. Darkest pixels represents points closest to
84 the ground level. J *
m **„
Figure 5. Real pair from the ATSR2 dat set
Figure 6. Left: cloudiness binary mask for the real pair in Fig. 5. Right: height map for the same real pair, brightness is proportional to the height
5. Conclusions In this paper we presented some preliminary results of an algorithm for recovering the cloud-top height from infrared satellite images. The algorithm assumes that the images can be acquired at a relatively high frame rate (i.e., norm of the motion field no more that 3-4 pixels). The experiments
85 run on synthetic sequences returned promising results, as shown and discussed in Section 4. T h e results on real d a t a are less exciting, but this is due more to the n a t u r e of the particular d a t a set t h a t we used, t h a t makes t h e m not very suitable for our algorithm. More work is planned. First we need to evaluate the performance of the whole algorithm from real data, getting a ground t r u t h from other kind of sensors. A comparison between the results of our algorithm and the ones obtained from already existing techniques needs to be done. Moreover we would like to explore the use of s t a t e of the art stereo matching algorithm in this context.
Acknowledgments T h e a u t h o r s wish t o t h a n k M. C. Maccarone for useful discussions. T h e ATSR2 d a t a are supplied by N E R C and ESA. This work has been partially supported by the F I R B Project ASTARBAU01877R This work has been partially carried out inside the E U S O project.
References 1. A Horvarth and Ft. Davies. Feasibility and error analysis of cloud motion wind extraction from near simoultaneous multiangle MISR measurements. Journal of Atmospheric and Oceanic Technology, 18:591-608, 2001. 2. S.J. Nieman, J.Schmetz, and W.P.Menzel. A comparison of several techniques to assign heights to cloud tracers. Journal of Applied Meteorology, 32:15591568, 1993. 3. P. Minnis et al. CERES cloud property retrievals from imagers on TRMMM, TERRA and AQUA. In Proceedings of the VIIISPIE on Remote Sensing of Clouds and the Atmosphere, pages 37-48, 2004. 4. M. King, Y. Kaufman, W. P. Menzel, and D. Tanre. Remote sensing of cloud, aerosol, and water vapor properties from the moderate resolution imaging spectrometer (MODIS). IEEE Transactions on Geoscience and Remote Sensing, 30(1), 1992. 5. D.M.. O Brien and R.M. Mitchell. Error estimates for the retrieval of cloudtop pressure using absorption in the a band of oxygen. Journal of Applied Meteorology, 31(10):1179-1192, 1992. 6. G. Campbell and K.Holmlund. Geometric cloud heights from Meteosat and AVHRR. In Proceedings of the Fifth International Winds Workshop, Australia, 2000. 7. H. Yi and P. Minnis. A proposed multiangle satellite dataset using GEO, LEO and Triana. In Proceedings of A MS 11th Conference on Satellite Meteorology and Ocen., Wisconsin, 2001.
86 8. J.P. Muller, A. Mandanayake, C. Moroney, R. Davies, D.J. Diner, and S. Paradise. MISR stereoscopic image matchers: techniques and result. IEEE Transactions on Geoscience and Remote Sensing, 40(7):1547-1559, 2002. 9. C. Mutlow. Atsr-1/2 user guide. Technical report, Rutheford Appleton laboratory, 1999. 10. G.Seiz, E.P.Baltsavias, and A.Gruen. 3D cloud-products for weather prediction and climate modelling. Geographica Helvetica, 58(90-98), 2003. 11. E. Trucco and A. Verri. Introductory Techniques for 3-D Computer Vision. Prentice Hall, 1998. 12. A. J. Prata and P. J. Turner. Cloud-top height determination using ATSR data. Remote Sensing Envornment, 59:1-13, 1997. 13. B.D. Lucas and T. Kanade. An iterative image registration technique with an application to s tereo vision. In Proceedings of the International Joint Conference on Artificia I Intelligence, pages 674-679, 1981. 14. P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection. Wiley Series in Probability and Mathematical Statistics. Wiley: New York, 1987. 15. P. Meer, D. Mintz, A. Rosenfeld, and D. Y. Kim. Robust regression methods for computer vision: a review. International Journal of Computer Vision, 6(l):59-70, 1991. 16. F. Isgro, E. Trucco, and L.Q. Xu. Towards teleconferencing by view synthesis and large-baseline stereo. In Proceedings of the IAPR International Conference in Image Analysis and Processing, pages 198-203, 2001. 17. R. Krishnamurthy, P. Moulin, and J.W. Woods. Optical flow techniques applied to video coding. In Proceedings of the IEEE International Conference on Image Proc essing, volume I, pages 570-573, 1995. 18. O. Faugeras, B. Hotz, H. Mathieu, T. Vieville, Z. Zhang, P. Fua, E. Theron, L. Moll, G. Berry, J. Vuillemin, P. Bertin, and C. Proy. Real time correlationbased stereo: algorithm, implementations and applications. Technical Report 2013, INRIA, Sophia-Antipolis Cedex, France, August 1993. 19. A. Sharaf and F. Marvasti. Motion compensation using spatial transformations with forward mapping. Signal Processing: Image Communication, 14:209-227, 1999.
87
P O W E R F U L TOOLS FOR DATA M I N I N G : FRACTALS, P O W E R LAWS, SVD A N D M O R E
CHRISTOS FALOUTSOS Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213^3891, USA What patterns can we find in a bursty web traffic? On the web graph itself? How about the distributions of galaxies in the sky, or the distribution of a company's customers in geographical space? How long should we expect a nearest-neighbor search to take, when there are 100 attributes per patient or customer record? The traditional assumptions (uniformity, independence, Poisson arrivals, Gaussian distributions), often fail miserably. Should we give up trying to find patterns in such settings? We present an extended summary of two lectures, that focus on powerful but less known tools, namely on the Singular Value Decomposition (SVD) and on Fractals. SVD is a provably optimal method for dimensionality reduction and feature selection; it is the engine-under-the hood for breakthrough concepts like the Latent Semantic Indexing (LSI), the Karhunen-Loeve transform and the Kleinberg algorithm for web-site importance ranking, to name a few. Fractals, self-similarity and power laws are extremely successful in describing real datasets (coastlines, river basins, stock prices, brain surfaces, web and disk traffic, to name a few). Although both tools are impressively general and useful, their introductory papers are typically not tailored towards a database audience, rendering them inaccessible. These lectures exactly try to remedy the situation. Specifically, we have two goals: (a) to introduce the most useful concepts from SVD and Fractals, emphasizing the intuition behind them, and avoiding the unnecessary mathematical intricacies and (b) to illustrate the usefulness of SVD and fractals for a variety of data base and data mining applications.
88
1. Power laws and fractals Suppose we have the cities of the world on a 2-d map. How long will a nearest-neighbor search take? Even harder, suppose we have, say, 20-d feature vectors of images — again, how long will a nearest neighbor search take? Papers claim that for high dimensionality, nearest neighbor search will be as slow as sequential scan. We show that this is partially true: it is the high intrinsic dimensionality that creates problems. The intrinsic, or fractal dimension of a cloud of points can be defined as the slope of a power law. Let nb(r) be the average number of neighbors of a point, within distance r or less. Definition 1. If nb(r) oc rD
(1)
for a range of scales r m j„ < r < rmax, then we say that the given cloud of points has intrinsic/fractal dimension D. The above definition encompasses all Euclidean objects, giving the expected Euclidean dimension. For example, 3-d points along a line will give intrinsic/fractal dimension to be = 1 . What makes the definition very useful and surprising is that several real datasets obey (Eq. (1)), but with a non-integer exponent D\ Such datasets include coastlines (1.1-1.3); the surface of the mammalian brain (2.6-2.7); periphery of clouds and rain-patches; the cardiovascular system (3); the pulmonary system (2.9); river basins, and many more (Mandelbrot, 77) (Schroeder, '91) How is it ever possible to have a non-integer intrinsic dimensionality? The Sierpinski triangle is probably one of the most famous fractals: consider a triangle and let A,B,C be the middles of its sides; delete the triangle ABC, and repeat recursively for the resulting triangles. In the limit, the remaining set of points have infinite perimeter, zero area, and they are thus neither two dimensional, nor one-dimensional. It turns out that the intrinsic dimensionality is log(3)/log(2) = 1.58, which is between 1 and 2, as expected, but not integer! Equation (1) is a power law, because it links two quantities through an exponent: y = xD. Power laws obviously result into lines, if we plot the (x,y) pairs in bi-logarithmic scales. Moreover, power laws are closely related to fractals, and they also appear extremely often in practice. Famous power laws include the Zipf law (Zipf, '49), the Pareto law of income distribution, the Lotka law of scientific publications, the Gutenberg-Richter law of the distribution of earthquake energy (Bak, '96). More recently,
89
power law tails have been observed in the in- and out-degree distribution of web sites (Kumar et al, '99), autonomous systems in computer networks (Faloutsos et al, '99), and many more (Barabasi, '02). In the lecture we present details on how to use the fractal dimension for (a) estimation of effort for range and nearest neighbor queries (b) for dimensionality reduction (c) for discovering patterns in one or two clouds of points. We also show how to estimate quickly the fractal dimension, and the relationship between the 80-20 "law" and the so-called multi-fractals. Finally, we show several more cases where power laws have been observed. 2. Singular Value Decomposition The Singular Value Decomposition (SVD) of an m x n matrix is an extremely useful operation, that has been rediscovered multiple times, under the names of Karhunen-Loeve transform (Duda and Hart, '73), (Fukunaga '90), Principal Component Analysis (PCA) (Jolliffe, '86), Latent Semantic Indexing (LSI) (Deerwester et al., '90), Ratio Rules (Korn et al., '98). It can be applied in any setting we have a matrix, like, for example: • A cloud of N points in d dimensions, such as a set of images represented by their d-dimensional feature vectors (Faloutsos, '96) • A graph of N nodes and E edges, which can be represented by its adjacency matrix • A collection of N time series, each with n time ticks, like, for example, N stocks and their daily closing prices over the past year (Korn et al., '97). • A collection of N documents, each using some of the available V vocabulary terms. This is the vector space model of information retrieval (Salton, '83). • A collection of iV market baskets, each having one or more of the m available products, in the typical setting for association rules (Agrawal et al, '93). Let's start with the main theorem: Theorem 1. Given an m x n real matrix A we can express it as A = U xD xV
(2)
where U is a column-orthonormal rn x r matrix, r is the rank of the matrix A, D is a diagonal r x r matrix and V is a column-orthonormal n x r matrix.
90 P r o o f : See (Press et al., '92). • A matrix is column-orthonormal if its columns are unit vectors, perpendicular to each other. T h e SVD of a matrix gives the solution to multiple problems, as we shall illustrate in the lectures: • Dimensionality reduction. SVD is the optimal m e t h o d to do dimensionality reduction, if our goal is t o minimize t h e root m e a n squared error (RMSE) between the actual d a t a points and t h e reduceddimension approximations (Faloutsos et al., '94). • Clustering: if the input m a t r i x A is block-diagonal, SVD will find the appropriate ordering of rows and columns to highlight the block-diagonal property. Each block corresponds t o a group of rows (and corresponding columns), t h a t are very similar to each other. • Web search engines: two of the most successful web-site ranking algorithms, P a g e R a n k (Brin and Page, '98) and H I T S (Kleinberg, '98), use SVD t o discover the most popular nodes in a directed graph. 3.
Conclusions
We illustrate the intuition and potential usages behind two tools: (a) fractals (and the corresponding self-similarity, power laws and chaos) and (b) Singular Value Decomposition, for the analysis of matrices (clouds of points; graphs). References Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Database mining: a performance perspective. IEEE Trans, on Knowledge and Data Engineering, 5(6):914-925, 1993. Per Bak. How nature works : The science of self-organized criticality, September 1996. Albert-Laszlo Barabasi. Linked: The New Science of Networks, 1st edition. Perseus Publishing, May 2002. S. Brin and L. Page. The anatomy of a large-scale hypertextual (web) search engine. In Proc. 7th International World Wide Web Conference (WWW7)/Computer Networks, pages 107-117, 1998. Published as Proc. 7th International World Wide Web Conference (WWW7)/'Computer Networks, Vol. 30, Nos. 1-7. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, September 1990.
91 R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973. Christos Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic, 1996. Christos Faloutsos, Ron Barber, Myron Flickner, J. Hafner, Wayne Niblack, Dragutin Petkovic, and William Equitz. Efficient and effective querying by image content. J. of Intelligent Information Systems, 3(3/4):231-262, July 1994. Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-law relationships of the internet topology. SIGCOMM, pages 251-262, Aug-Sept. 1999. Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition, 2nd edition. Academic Press, 1990. I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986. Jon Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appears as IBM Research Report RJ 10076, May 1997. Flip Korn, H.V. Jagadish, and Christos Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. ACM SIGMOD, pages 289-300, May 13-15, 1997. Flip Korn, Alexandros Labrinidis, Yannis Kotidis, and Christos Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB, 1998. S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Extracting large-scale knowledge bases from the web. VLDB, pages 639-650, 1999. B. Mandelbrot. Fractal Geometry of Nature. W.H. Freeman, New York, 1977. William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C, 2nd edition. Cambridge University Press, 1992. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Manfred Schroeder. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. W. H. Freeman, New York, 1991. G.K. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, MA, 1949.
This page is intentionally left blank
93 AN UNSUPERVISED SHOT CLASSIFICATION SYSTEM FOR NEWS VIDEO STORY DETECTION M. DE SANTO Dipartimento di Ingegneria dell 'Informazione e di Ingegneria Elettrica Universita di Salerno - Via P.te Don Melillo, 1 1-84084, Fisciano (SA), Italy G. PERCANNELLA Dipartimento di Ingegneria dell 'Informazione e di Ingegneria Elettrica Universita di Salerno - Via P.te Don Melillo, I 1-84084, Fisciano (SA), Italy C. SANSONE Dipartimento di Informatica e Sistemistica Universita di Napoli "Federico II"- Via Claudio, 211-80125 Napoli, (Italy) M. VENTO Dipartimento di Ingegneria dell Informazione e di Ingegneria Elettrica Universita di Salerno - Via P.te Don Melillo, 1 1-84084, Fisciano (SA), Italy
Automatic classification of shots extracted by news video plays an important role in the context of news video segmentation, which is an essential step towards effective indexing of broadcasters' digital databases. In this paper, we propose a system for news video story detection based on an unsupervised shot classification system. Since it is always possible to find a case in which the assumptions of a single shot classification technique fail, in our system shot classification is performed by means of a multi-expert approach. In order to assess the performance of the proposed system, we built up a database significantly wider than those typically used in the field. Experimental results demonstrate the effectiveness of the proposed approach both in terms of shot classification and of news story detection capability.
1.
Introduction
In order to allow a faster and more appealing use of news video databases, indexing and retrieval are essential issues to be addressed. A first step towards an effective indexing is the segmentation of a news video into stories. It implies, at a first stage, the partition of the video into sequences of frames, called shots, obtained by detecting transitions that are typically associated to camera changes. Once the shots have been individuated by means of a shot change detection algorithm, they can be classified on the basis of their content. Two different classes are typically considered, an anchor shot and a news report shot class. Successively, the entire
94
news video is divided into stories; the latter are obtained by grouping each anchor shot with all the successive news report shots, until another anchor shot will occur. In this paper we address mainly the shot classification problem which is to be considered as a preliminary and fundamental step for the segmentation of news videos into stories. Some results on the story classification are also reported in the paper. In the literature, most of the approaches exploiting the video source information use a model matching strategy [1,2,3]. For each shot, a distinctive frame, called keyframe, is extracted. Then, the key-frame is matched against a set of pre-defined models of an anchor shot frame used for its classification. Obviously, all these approaches are strongly dependent on the model of the specific video program. This is a severe limitation, since it is generally difficult to define all the possible models needed for the different news videos we would consider. It is worth considering that nowadays the number and variety of the existing news programs in the world is so high to make it practically infeasible to build a good general model; moreover, even when limiting the attention to a single TV broadcaster, the use of digital production techniques make easy and desirable to change the news model repeatedly over the time. For these reasons it is likely to use techniques for shot classification neither using a model-based approach nor requiring a specific training phase, so avoiding a continuous updating of the video model to guarantee acceptable performance for a longer period. Other authors use a face detection approach to identify anchor shots [4]. However, face detection is generally unsuitable for practical application in the videos realm due to its time-consuming characteristics. Furthermore, a shot where a reporter (not an anchorman!) is present can be erroneously recognized as an anchorperson shot by the face detection module. A different approach based on the frame statistics is presented in [5], where the authors use an Hidden Markov model (HMM) to classify frames. The features used are the difference image between frames, the average frame color and also the audio signal. The HMM parameters are evaluated during a training phase by using the ground truth of a news video. Finally, some authors [6,7,8] propose methods that are substantially unsupervised and do not require the explicit definition of an anchor shot model. In particular, in [6] a graph-theoretical cluster analysis method is employed. As pointed out by the authors, this approach fails when identical news report shots appear in different stories of the same news program, or when an anchor shot model is present once in a program. In [7] shot classification is firstly performed on the basis of a statistical approach, without requiring any model of the anchor shot, and then refined by considering motion features. For the authors, it is reasonable to assume that in an anchor shot both the camera and the anchor are almost motionless.
95 In our opinion, this hypothesis is not completely acceptable. In [8] a template-based method is proposed. The template is found in a robust and unsupervised way and it does not depend on a particular threshold. However, the authors assume that different anchorperson models share the same background. This is not true for most news stations: because of different camera angles, different models can have different backgrounds. From the previous analysis it is evident that a definitive solution for the shot classification problem does not exist. Moreover, all the efforts for increasing performance of a single shot classification technique seem to be unjustified, since it is always possible to find a case in which the assumptions of a given technique fail. In this case, indeed, a multi-expert approach could be more effective. The underlying idea of a Multi-Expert System (hereinafter MES) is, in fact, to combine a set of rather simple classifiers (also called experts), in a system taking the classification decision on the basis of the classification results provided by each of the experts involved [9]. The rationale of this approach lies on the assumption that the performance obtained by suitably combining an ensemble of experts, complementary as regards their errors, is better than the performance of any single expert. Starting from these considerations, in this paper we propose a system for news video story detection using a Multi-Expert architecture for shot classification. We have considered only techniques that do not require the explicit definition of a specific model of the anchorperson shot. In such a way, it is possible to design a system whose performance are quite independent of the specific style of the news program. As regards the combining process, we have considered the most adequate combining rule for the problem at hand, among those that do not require a training procedure. Another important issue addressed in the paper is the experimental analysis, made on a database that is twice the biggest database reported up to now in the literature [6]. Namely, we used a news database consisting of about 9 hours with 444 anchor shots and 5,344 news report shots. From the experimentation it is evident that the proposed system performs better than each of the considered experts both in terms of shot classification and of news story detection capability. The organization of the paper is as follows: section 2 presents the architecture of the proposed system. In section 3 the database used is described, while in section 4 the tests carried out in order to assess the performance of the proposed system in terms of both shot classification and news story detection are reported. Finally, in section 5, the main conclusions are drawn.
96
2.
System Architecture
The proposed system architecture is sketched in Fig. 1. The shot classification is performed by means of a parallel MES made up of three experts (whose choice will be justified later): each one receives in input the list of all the shot boundaries of a news video calculated by the shot change detection module together with the video itself and provides its own classification for each news video shot. Then, the combination module of the MES gives the final classification for each shot on the basis of the outputs of the three experts and of the chosen combining rule. Moreover, starting from the classification results of the MES, the system performs a news story detection by assuming that each story starts with an anchor shot and ends when another anchor shot occurs (or when the whole video ends). Note that, in this model, two successive anchor shots correspond to two different news stories. Let us now illustrate the criteria inspiring the choice of the experts to be included into the MES and the choice of the MES' combining rule. As cleared in the introduction, we do not consider experts based on an explicit definition of the anchor shot model [1-4] or needing a specific training phase, such as the one presented in [5]. In particular, the experts used in our system implement the shot classification algorithms proposed by Gao and Tang in [6], Bertini et al. in [7], and Hanjalic et al. in [8]. Hereinafter, for the sake of simplicity, we will refer to these three experts with the terms, GAO, BER and HAN, according to the first three letters of the first author's name. As regards the choice of the combining rule, it must be recalled that the proposed system should not require any specific training phase. Moreover, it must be considered that the selected experts provide as classification output only a "crisp" label indicating if the shot under classification is attributed to the anchor class or to the news-report class. Thus, we chose as combining rule the majority voting rule [9], that is the more appropriate one for combining experts providing crisp outputs without requiring a training phase, according to the taxonomy proposed in [10]. In the following we will briefly recall the rationale inspiring the three selected shot classification algorithms and how the chosen combining rule works. GAO Expert: in this case, video shots are classified by using an algorithm based on graph-theoretical cluster analysis. It automatically groups similar key-frames into clusters, on the basis of their color histograms. The key-frames composing a cluster are classified as potential anchorperson frames if the cluster size is greater or equal to two. Then, a spatial difference metric is used to refine the shot classification; in fact, in some situations, the key-frames in a cluster may have similar histograms but different content. If a cluster has an average spatial difference metric value higher than a suitable threshold, it is removed from the anchorperson frame list.
97 Input video file
SHOT CHANGE DETECTION
T
0-123 124-345 346-1344 1345-2232 2233-2345 2346-3444 3445-3898
Shot boundaries list
1
T BER
SAO GAO Anchor / News-report shot classification
HAN
BER Anchor / News-report shot classification
HAN Anchor /News-report shot classification
COMBINING RULE
SHOT CLASSIFICATION
Anchor/news-report shot list
0-123 124-345 346-1344 1345-2232 2233-234S 2346-3444 3445-3B98
Anchor shot News-report shot News-report shot Anchor shot News-report shot News-report shot Anchor shot
Final MES Anchor / News-report shot classification
News-story list NEWS-STORY DETECTION
0-1344 1345-3444 3445-...
News-story 1 News-story 2 News-story 3
J0^
Fig. 1 - The proposed system architecture.
BER Expert: the shot classification is here performed on the basis of a statistical approach and of the motion features of the anchor shots. The assumption is that generally anchor shots, even of different length, are present throughout the video, so producing a set of candidate anchorperson shots. The candidates are successively re-analyzed by considering motion features: the use of these features is justified by the consideration that in an anchorperson shot both the camera and the anchorperson are almost motionless. In particular, a motion index for each candidate anchorperson shot is calculated and only those shots whose motion index does not exceed a suitable threshold are definitely classified as anchorperson shots. HAN expert, a template-based method is here proposed. It starts from the assumption that an anchorperson shot is the only type of video shot that has multiple matches of most of its visual content along the whole news video program. The template is found in a robust and unsupervised way and it does not depend on a particular threshold. In particular, the proposed anchorperson shot detection consists of two steps: an unsupervised procedure for finding the template of the anchorperson shots and its use to detect shots by applying an adaptive thresholding.
98
Majority Voting rule: the majority voting Rule decides that a sample to be classified belongs to the class Q if and only if the majority of the experts votes for the class Q. If two or more classes obtain the same number of votes, the input sample is rejected. In our case, however, since there are three experts and two classes, no ties can occur. So, the MES will always classify a shot. 3.
The News Video Database
Only a few efforts have been spent in the recent past for building video databases for benchmarking purposes; in [11] the authors present a database built for characterizing the performance of shot change detection algorithm. This database, however, is not adequate for our aims, since it is made up not only of news video but also of sport events and sitcom videos, and the duration of news videos is only 20 minutes. So, we decided to build-up a new database. The acquisition was performed by means of the digital satellite decoder emmeesse 6000pvr. This decoder has an internal hard disk which allowed us to record news video in the DVB MPEG-2 format. Then, the videos were transferred on a PC: in this way, the broadcasting quality has been preserved. We encoded the videos in the MPEG-1 format using the TMPGEnc encoder (ver. 2.01). This phase took about two times the length of each video with a Pentium IV 1.7 GHZ CPU. The parameters used to encode the videos have been selected taking into account the storage requirements while avoiding to reduce the performance of the algorithms obtainable with the original full-quality videos. To this aim, we selected four videos from our database to analyze the dependence of the algorithms' performance on both the frame size and the bit-rate. Finally, we built the ground truths of the shot boundaries and of the anchor/news report shots, respectively. Different news video editions of a single broadcaster have been considered, as well as news videos of different broadcasters. However, while in the first case the different editions are usually less than ten, in the latter the number of different models is significantly large. Consequently, we chose to test our system on a large database composed by all the different news videos captured from a single broadcaster, rather than perform tests with only few samples belonging to a large number of different broadcasters. Even if some archiving companies work with large quantities of videos from different sources, this approach fits the most realistic use case for the proposed system. A typical broadcaster, in fact, is interested in employing such a system to analyze all the editions of its news videos rather than the videos produced by other broadcasters (to understand this, simply think to the copyright-related implications of using such materials). Furthermore, let us consider that, although some broadcasters nowadays already have the edit list for their own
99
news videos, this is surely untrue for the old materials. These ones still need to be segmented into news-stories for an effective indexing. Starting from these considerations, the database used in this paper is composed by more than thirty news videos extracted from the main Italian public TV-network (namely, RAI 1). Special care has been taken to include in the database all the different editions of news videos from this TV-network. Table 1 highlights that the size of our database is very large if compared to those used in the papers proposing the GAO, BER, and HAN experts. Table 1 - Composition of the databases used in this paper and in [6], [8] and [7], Paper This [6] [7] .81
4.
Total length (hh:mm:ss) 08:52:34 05:05:17 02:41:00 00:37:00
Number of videos 33 14 12 2
Number of Broadcasters 1 2 6
Number of Anchor/News-report shots 444 / 5344 253/3654 66/ 665 22/ --
Experimental Results
In this section, we describe the results of the tests aimed at assessing the performance of the single experts and the proposed system, in terms of both shot classification and news story detection. However, before going into details it is worth briefly describing the methodology adopted for the tests. A preliminary tuning phase was required in order to determine the optimal values of the parameters needed to setup the three experts. We have determined these required values by means of an experimental evaluation, i.e. by maximizing a suitable figure of merit F (whose definition will be given further) over a predefined set of videos. In this set we included the same videos already used for setting up the MPEG-1 coding parameters for the whole database. These videos were not included in any of the tests described in the following of the Section. The performance has been measured in terms of Precision and Recall [11, 12]. Precision and Recall are related to the number of correct detections (cd), missed detections (m) and false alarms (J) by the following equations: cd
cd
Precision =
Reca//=
(1) cd + f cd+m Then, the used figure of merits (F) [12] combines Precision and Recall as reported in equation (2): r =
2 • Precision Recall Precisions Recall
(2)
100 A fair comparison of the performance of the proposed system and of the three experts should be independent of the possible errors made by the employed shot change detection system. To this aim the tests were carried out by using the shot boundaries ground truth, or, in other words, by supposing to employ an ideal shot change detection module (see Fig. 1) that makes no errors. Nevertheless, the hypothesis of having no errors at the shot change detection stage is pretty realistic. In fact, in the recent years many solutions that assure very high recognition rates have been proposed in the literature. As an example, the Multi-Expert System proposed in [13] is able to detect almost 100% of shot boundaries using the Max combining rule, even if it is paid with a higher false alarm percentage. However, when using shot boundary detection as a first step towards news-story detection, false alarms cost less than missed detections. In the context of shot boundary detection, the effect of false alarms is the oversegmentation of the video, that almost always occurs within a news-report shot. This can be automatically eliminated after the shot classification phase, since successive news-report shots are always considered as belonging to the same news-story. 4.1. Shot classification Table 2 reports the global performance of each expert: the first and unexpected result is the discrepancy between the performance reported in Tab. 2 and the results presented in the original papers (reported for the sake of completeness in Tab. 3). Table 2 - The performance of the three considered experts. Expert %cd V Recall Precision F 0.842 0.881 GAO 92.86 1.59 0.929 0.987 0.892 BER 81.61 0.10 0.816 0.692 0.655 HAN 62.29 2.40 0.623 Table 3 - The performance of the three considered experts on the basis of the results reported in their original papers. Expert %cd %f Recall Precision F 0.976 0.974 GAO 97.25 0.16 0.973 0.955 0.962 BER 96.97 0.45 0.970 0.917 0.957 HAN 100.00 N.A. 1.000
All the experts perform worse on our database: this is particularly true for the HAN expert. Moreover, the GAO expert performs better in terms of Precision, differently from what presented in the experimental phase described in [6]. Analogously, the BER expert exhibits on our database a Recall value higher than the Precision one, differently from what described in [7]. Such discordances are
101 likely due to the different size of the database used for the testing the expert in this paper. The results reported in Tab. 2 point out that, even if the BER expert outperforms the others in terms of Precision and F, GAO exhibits the best value of the Recall. On the other hand, the HAN expert performs sensibly worse than BER and GAO, but the difference between its values of Recall and Precision is less than those obtained by the other two experts. This highlights a good complementariness among the three experts' behavior, as required for a successful implementation of a MES. Let us now illustrate the shot classification performance of the proposed system. When dealing with the evaluation of multi-expert systems, it is useful to consider the performance of the so-called "oracle". The oracle is the theoretic MES that correctly classifies a shot as an anchor or as a news-report if at least one of the employed experts is able to provide the correct classification. It is evident that for a defined set of experts, the performance of the oracle is the upper bound of all the MES's obtainable from the same set of experts by using any combining rule. In order to obtain the performance of the oracle in terms of the percentage of correct classification (%cdorade), we considered the percentage of existing anchor shots which none of the three experts is able to detect on the whole database; then, we evaluated %cdomcie as the complement to 100. In our case, on the whole database we found that %cdorack is equal to 95.28. On the contrary, the performance of the oracle in terms of false alarms (%foracie) is given by the percentage of news-report shots that are simultaneously misclassified by all the three experts. In our case, on the whole database we found that %forade = 0. Hence a MES using the GAO, BER and HAN experts might perform ideally as regard the false alarms. It does not stand the same for the missed detection, which also in the ideal case would be 4.72%. In Tab. 4, the performance obtained by the oracle, the proposed system and by the best expert (BER) is reported. From this table it is possible to notice how the proposed system outperforms the best expert in terms of all the considered performance parameters. It is also worth noting that the proposed system is able to obtain ideal performance in terms of the Precision, with no false alarms on all the news report shots. Table 4 - The performance of the oracle, the proposed system and the best expert. %cd %f Recall Precision F 95.28 0.00 0.953 1.000 0.976 Oracle 1.000 0.930 Proposed system 86.94 0.00 0.869 81.61 0.10 0.816 0.987 0.892 BER
102
It is also very interesting to consider the data reported in Table 5. Here, the performance of the proposed architecture is expressed in terms of the relative improvement with respect to the performance of the oracle. Such improvement has been calculated as: parpT oposedSystem ~ParExpert
Mpar =
(3)
ParORACLE ~ ParExpert
being pare {%cd, %f, Recall, Precision, F} one of the defined parameters for evaluating the system performance. Table 5 - The relative improvement introduced by the proposed system with respect to the performance that can be obtained by using the oracle. Expert GAO BER HAN
RI'Acd
-245.3% 39.0% 74.7%
RI%r 100% 100% 100%
Rl/ttcaU
K* Precision
-245.3% 39.0% 74.7%
100% 100% 100%
RIF 51.7% 45.4% 85.7%
The results reported in table 5 show that the proposed architecture is able to achieve an improvement of F obtained by BER expert that is more than the 45% of the maximum possible improvement. This represents a valuable result if we consider that this expert is has already a good performance and that the performance of the oracle is very close to that of a perfect shot classification system. 4.2. News story detection Once shot classification has been performed, the whole news video can be segmented into news stories, by using the scheme described in Section 2. The latter, indeed, derives from a simplified model of news videos. For instance, it cannot identify a change of news within a single anchor shot sequence, situation that is sometimes present in our database. However, since it is impossible to overcome such a problem by using only visual information, it is acceptable to use this simplified model for evaluating the performance of the proposed approach, as stated also in [6]. In order to have a more quantitative estimation of the error introduced by this simplified model, we also calculated the percentage of stories that can be correctly detected by using it. Since such a percentage is over the 96%, it can be concluded that the proposed model is a very reasonable approximation of the real model. With this model in mind, it is then possible to evaluate the ability of news story detection of the proposed system and to compare it with those obtained by the single experts. Preliminarily, we give the definition of the performance indices for the news story detection. A news story is correct detected if all its shots are correctly classified and also the successive anchor shot is correctly detected. On the contrary,
103 when an error in the shot classification occurs, one or more stories can be missed or one or more sequences of shots can be erroneously grouped to generate new stories, depending on the error type. These considerations are important since they imply that the shot classification performance of a system is an upper bound for its news story detection capability. By taking into account all the possible situations deriving from errors at shot classification level, Table 6 reports the obtained performance by the proposed system, the oracle and the best two experts (BER and GAO) in terms of news-story detection. Table 6 - The news story detection performance of the oracle, the proposed system and the two best experts, on the whole database Oracle Proposed system BER GAO
Recall 0.910 0.787 0.714 0.801
Precision 0.954 0.909 0.865 0.733
F 0.932 0.844 0.782 0.766
Even if the overall performance decreases with respect to the shot classification case, our system still achieves the best performance in terms of both Precision and F, with a value of Recall slightly less than that obtained by GAO. It is worth noting that also in this case the proposed system is able to achieve an improvement of F approximately equal to 41% of the maximum possible improvement (i.e., the improvement obtainable with the oracle). 5.
Conclusions
In this paper, a multi-expert approach for the shot classification problem and the news video story detection has been proposed. We implemented three unsupervised algorithms (experts) for shot classification and combined them in a parallel Multi-Expert System. We performed our experiments by using a database significantly larger that the ones already presented in the literature; the obtained results demonstrated that the proposed system performs better than each single expert, with respect to different performance parameters. In particular, no false alarms of anchorperson shots was generated over 4,000 shots (and so, it exhibited an ideal Precision value), keeping low the number of missed anchorperson shots (that influences the Recall value). Also the performance in terms of news story detection was very significant. In order to further improve the Recall value obtained by our system, other information sources, such as the audio track of the news video, could be used. This will be subject of future investigations.
104 References 1.
2. 3. 4.
5.
6.
7. 8.
9. 10.
11.
12. 13.
B. Gunsel, A.M. Ferman, and A.M. Tekalp, "Video indexing through integration of syntactic and semantic features", in Proc. Workshop Applications of Computer Vision, Sarasota, FL, pp. 90-95, 1996. S.W. Smoliar, HJ. Zhang, S.Y. Tao, Y. Gong, "Automatic parsing and indexing of news video", Multimedia Systems, vol 2, no. 6, pp. 256-265, 1995. B. Furht, S.W. Smoliar, H. Zhang, Video and Image Processing in Multimedia Systems, Kluwer Publishers, Boston (MA), 1996. Y. Avrithis, N. Tsapatsoulis, and S. Kollias, "Broadcast news parsing using visual cues: A robust face detection approach", Proc. IEEE Int. Conf. on Multimedia and Expo, vol. 3, pp. 1469-1472, 2000. S. Eickeler, S. Muller, "Content-based video indexing of TV broadcast news using Hidden Markov Models", Proc. IEEE International Conference on Acoustic, Speech, and Signal Processing, pp. 2997-3000, 1999. X. Gao, X. Tang, "Unsupervised Video-Shot Segmentation and Model-Free Anchorperson Detection for News Video Story Parsing", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12, No. 9, pp. 765-776, 2002. M. Bertini, A. Del Bimbo, P. Pala, "Content-based indexing and retrieval of TV News", Pattern Recognition Letters, vol. 22, pp. 503-516, 2001. A. Hanjalic, R.L. Lagendijk, J. Biemond, "Semi-Automatic News Analysis, Indexing, and Classification System Based on Topics Preselection", Proc. ofSPIE: Electronic Imaging: Storage and Retrieval of Image and Video Databases, San Jose (CA), 1999. J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, "On Combining Classifiers", IEEE Trans, on Pattern Analysis and Machine Intelligence, 20(3), (1998), 226-239. L.I. Kuncheva, J.C. Bezdek, R.P.W. Duin, "Decision templates for multiple classifier fusion: an experimental comparison", Pattern Recognition vol. 34, no. 2, pp. 299-314, 2001. U. Gargi, R. Kasturi, S.H. Strayer, "Performance Characterization of Video-ShotChange Detection Methods", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 1, pp. 1-13, 2000. L. Chaisorn, T.-S. Chua, C.-H. Lee, "A Multi-Modal Approach to Story Segmentation forNews Video", World Wide Web, vol. 6, pp. 187-208, 2003. M. De Santo, G. Percannella, C. Sansone, M. Vento, "A Multi-Expert System for Shot Change Detection in MPEG Movies", International Journal of Pattern Recognition and Artificial Intelligence, 2004 (in press).
105
3D-TV - T H E F U T U R E OF VISUAL E N T E R T A I N M E N T
M.MAGNOR MPI Informatik Stuhlsatzenhausweg 85 Saarbrucken, Germany e-mail: [email protected]
Television is the most favorite pastime activity of the world. Remarkably, it has so far ignored the digital revolution; the way we watch television hasn't changed since its invention 75 years ago. But the time of passive TV consumption may be over soon: Advances in video acquisition technology, novel image analysis algorithms, and the pace of progress in computer graphics hardware together drive the development of a new type of visual entertainment medium. The scientific and technological obstacles towards realizing 3D-TV, the experience of interactively watching real-world dynamic scenes from arbitrary perspective, are currently being put out of the way by researchers all over the world.
1. Introduction According to a recent study 1 , the average US American citizen watches television 4 hours and 20 minutes every day. While watching TV is the most favorite leisure activity in the world, it is interesting to note that TV technology has shown remarkable resistance against any change. Television sets today still have the computational capacity of a light switch, while modern PC and graphics cards work away at giga-fiop rates to entertain the youth with the latest computer games. The idea of making more out of television is not new. Fifty years ago, Ralph Baer began to think about how to add interactivity to television. He invented the video game and developed the first game console, thus becoming the founding father of the electronic entertainment industry, a business segment whose worldwide economic impact has by now even surpassed the movie industry 2 . Driven by technological progress, economic competition and an ever-growing number of users, computer games have become more and more realistic over the years, while TV has remained the passive medium of its first days. In recent times, however, scientists from different fields have joined
106
forces to tear down the wall between interactive virtual worlds and the real world 3 ' 4 ' 5 . Their common goal is to give the consumer the freedom to watch natural, time-varying scenes from any arbitrary perspective: A soccer match can be watched from the referee's, goal keeper's or even the ball's point of view, a crime story might be experienced from the villain's or the victim's perspective, and while watching a movie, the viewer is seated in the director's chair. This paper intends to give an overview of current research in 3D-TV acquisition, coding, and display.
2. 3D-TV Content Creation To display an object from arbitrary perspective, its three-dimensional shape must be known. For static objects, 3D geometry can be acquired, e.g., by using commercial laser scanners. But how can the constantly changing shape of dynamic events be captured ? Currently, optically recording the scene from different viewpoints is the only financially and logistically feasible way of acquiring dynamic 3D geometry information, albeit implicitly6. The scene is captured with a handful of synchronized video cameras. To recover time-varying scene geometry from such multi-video footage, relative camera recording positions must be known with high accuracy 7 . The visual hull has been frequently used as geometry proxy for timecritical reconstruction applications. It can be computed efficiently and represents an approximate, conservative model of object geometry 8 . An object's visual hull is reconstructed by segmenting object outlines in different views and re-projecting these silhouettes as 3D cones back into the scene. The intersection volume of all silhouette cones encompasses the true geometry of the object. Today, the complete processing pipeline for on-line 3D-TV broadcast applications can be implemented based on the visual hull approach 9 . Unfortunately, attainable image quality is limited due to the visual hull's approximate nature. Allowing for off-line processing during geometry reconstruction, refined shape descriptions can be obtained by taking local photo-consistency into account. Space carving 10 and voxel coloring11 methods divide the volume of the scene into small elements (voxels). Consequently, each voxel is tested whether its projection into all unoccluded camera views corresponds to roughly the same color. If a voxel's color is different when viewed from different cameras, it is deleted. Iteratively, a photo-consistent hull of the object surface is "carved" out of the scene
107
volume. 2.1. Spacetime
Isosurfaces
Both the visual hull approach as well as the space carving/voxel coloring methods have been developed with static scenes in mind. When applied to multi-video footage, these techniques reconstruct object geometry one time step after the other, making no use of the inherently continuous temporal evolution of any natural event. Viewed as an animated sequence, the resulting scene geometry potentially exhibits discontinuous jumps and jerky motion. A completely new class of reconstruction algorithms is needed to exploit temporal coherence in order to attain robust reconstruction results at excellent quality. When regarded in 4D spacetime, dynamic object surfaces represent smooth 3D hyper-surfaces. Given multi-video data, the goal is to find a smooth 3D hyper-surface that is photo-consistent with all images recorded from all cameras over the entire time span of the sequence. This approach can be elegantly formulated as a minimization problem whose solution is a minimal 3D hyper-surface in 4D spacetime. The weight function incorporates a measure of photo-consistency, while temporal smoothness is ensured because the sought-after minimal hyper-surface minimizes the integral of the weight function. The intersection of the minimal 3D hyper-surface with a 3D hyper-plane perpendicular to the temporal dimension then corresponds to the 2D object surface at a fixed point in time. The algorithmic problem remains how to actually find the minimal hyper-surface. Fortunately, it can be shown 12 that a /c-dimensional surface which minimizes a rather general type of functional is the solution of an Euler-Lagrange equation. In this form, the problem becomes amenable to numerical solution. A surface evolution approach, implemented based on level sets, allows one to find the minimal hyper-surface 13 . In comparison to conventional photo-consistency methods that do not take temporal coherence into account, this spacetime-isosurface reconstruction technique yields considerably better geometry results. In addition, scene regions which are temporarily not visible from any camera are automatically interpolated from previous and future time steps. 2.2. Model-based
Scene
Analysis
A different approach to dynamic geometry recovery can be pursued by exploiting a-priori knowledge about the scene's content 14 . Given a parame-
108 terized geometry model of the object in an scene, the model can be matched to the video images. For automatic and robust fitting of the model to the images, object silhouette information is used. As matching criterion, the overlapping area of the rendered model and the segmented object silhouettes is employed15. The overlap is efficiently computed by rendering the model for all camera viewpoints and performing an exclusive-or (XOR) operation between the rendered model and the segmented images. The task of finding the best model parameter values thus becomes an optimization problem that can be tackled, e.g., by Powell's optimization scheme. In an analysis-by-synthesis loop 16 , all model parameters are varied until the rendered model optimally matches the recorded object silhouettes. Making use of image silhouettes to compare model pose to object appearance has numerous advantages: • Silhouettes can be easily and robustly extracted, • they provide a large number of pixels, effectively over-determining the model parameter search, • silhouettes of the geometry model can be rendered very efficiently on modern graphics hardware, and • also the XOR operation can be performed on graphics hardware. Model-based analysis can additionally be parallelized to accelerate convergence17. In addition to silhouettes, texture information can be exploited to also capture small movements 18 . One major advantage of modelbased analysis is the comparatively low dimensionality of the parameter search space: only a few dozen degrees of freedom need to be optimized. In addition, constraints are easily enforced by making sure that during optimization, all parameter values stay within their physically plausible range. Finally, temporal coherence is maintained by allowing only a maximal change in magnitude for each parameter from one time step to the next.
3. Compression Multi-video recordings constitute a huge amount of raw image data. By applying standard video compression techniques to each stream individually, the high degree of redundancy among the streams is not exploited. However, the 3D geometry model can be used to relate video images recorded from different viewpoints, offering the opportunity to exploit inter-video correlation for compression purposes. Model-based video coding schemes
109 have been investigated for single-stream video data, and MPEG4 1 9 provides suitable techniques to encode the animated geometry, e.g. by updating model parameter values using differential coding. For 3D-TV, however, multiple synchronized video streams depicting the same scene from different viewpoints must be encoded, calling for new coding algorithms to compress multi-video content. To efficiently encode the multi-video data using object geometry, the images may be regarded as object textures. In the texture domain, a point on the object surface has fixed coordinates, and its color (texture) varies only due to illumination changes and/or non-Lambertian reflectance characteristics. For model-based coding, a texture parameterization is first constructed for the geometry model 20 . Having transformed all multi-video frames to textures, the multi-view textures are then processed to de-correlate them with respect to temporal evolution as well as viewing direction 21 . Shapeadaptive 22 as well as multi-dimensional wavelet coding schemes20lend themselves to efficient, progressive compression of texture information. Temporarily invisible texture regions can be interpolated from previous and/or future textures, and generic texture information can be used to fill in regions that have not been recorded at all. This way, any object region can later be displayed without holes in the texture due to missing input image data. For spacetime-isosurface reconstruction, deriving one common texture parameterization for all time instants is not trivial since the reconstruction algorithm does not provide surface correspondences over time. Encoding the time-varying geometry is also more complex than in the case of modelbased analysis. Current research therefore focuses on additionally retrieving correspondence information during isosurface reconstruction.
4. Interactive Display The third component of any 3D-TV system consists of the viewing hardand software. In a 3D-TV set, one key role will be played by the graphics board: It enables displaying complex geometry objects made up of hundreds of thousands of polygons from arbitrary perspective at interactive frame rates. The bottleneck of current, PC-based prototypes constitutes the limited bandwidth between storage drive and main memory, and, to a lesser extent, between main memory and the graphics card. Object geometry as well as texture must be updated on the graphics board at 25 frames per
110 second. While geometry animation data is negligible, the throughput capacity of today's PC bus systems is sufficient only for transferring low- to moderate-resolution texture information. Since any user naturally wants to exploit the freedom of 3D-TV to zoom into the scene and to observe object details from close-up, object texture must be updated continuously at high resolution to offer the ultimate viewing experience. To overcome the bandwidth bottleneck, object texture must be stored in some form that is at the same time efficient to transfer as well as fast to render on the graphics board. In addition, rendered image quality shall not be degraded, preserving the realistic, natural impression of the original multi-video images. These requirements can be met by decomposing object texture into its constituents: local diffuse color and reflectance, and shadow effects. Since typically neither object color nor reflectance change over time (only exceptions: chameleons and sepiae), diffuse color texture and reflectance characteristics need to be transferred only once. Given this static texture description, the graphics board is capable of computing very efficiently object appearance for any illumination and viewing perspective. The remaining difference between rendered and recorded object appearance is due to small-scale, un-modeled geometry variations, e.g. clothing creases. Only these time-dependent texture variations need to be updated per frame, either as image information, or, more elegantly, as dynamic geometry displacement maps on the object's surface. One beneficial sideeffect of representing object texture in this form is the ability to vary object illumination: The object can be placed into arbitrary environments while retaining its natural appearance. To represent object texture in the above-described way, new analysis algorithms need to be developed. These must be capable of recovering reflectance characteristics as well as surface normal orientation from multivideo footage. While research along these lines has only just begun, first results are encouraging, and multi-video textures have already been robustly decomposed into diffuse and specular texture components.
5. Outlook So will 3D-TV supersede conventional TV anytime soon ? The honest answer is: probably not this year. The TV market exhibits enormous momentum and has already defied a number of previous attempts at technological advances, e.g. HDTV and digital broadcast. The new possibilities interactive 3D-TV offers to the user, however, are
111 too attractive to be ignored for long. In a few years, 3D-TV will start as a new application for conventional P C s (much like the RealPlayer© some years ago), probably with later adaptation to game consoles, which are then already hooked u p to the T V set situated in t h e living room. T h e pace of progress will depend on the effort required to create attractive content. Nevertheless, from today's state-of-the-art one can be optimistic t h a t the scientific and technological challenges of 3D-TV will be surmountable: A brave, new, and interactive visual entertainment world lies ahead. References 1. P. Lyman and H. Varian. How much information, 2003. University of California Berkeley, http://www.sims.berkeley.edu/research/projects/howmuch-info-2003/. 2. J. Gaudiosi. Games, movies tie the knot, December 2003. http://www.wired.eom/news/games/0,2101,61358,00.html. 3. M. Op de Beeck and A. Redert. Three dimensional video for the home. Proc. EUROIMAGE International Conference on Augmented, Virtual Environments and Three-Dimensional Imaging (ICAVSD'01), Mykonos, Greece, pages 188-191, May 2001. 4. C. Fehn, P. Kauff, M. Op de Beeck, F. Ernst, W. Ijsselsteijn, M. Pollefeys, E. Ofek L. Van Gool, and Sexton I. An evolutionary and optimised approach on 3D-TV. Proc. International Broadcast Conference (IBC'02), pages 357365, September 2002. 5. K. Klein, W. Cornelius, T. Wiebesiek, and J. Wingbermiihle. Creating a "personalised, immersive sports tv experience" via 3D reconstruction of moving athletes. In Abramowitz and Witold, editors, Proc. Business Information Systems, 2002. 6. L. Ahrenberg, I. Ihrke, and M. Magnor. A mobile system for multi-video recording. In 1st European Conference on Visual Media Production (CVMP), pages 127-132. IEE, 2004. 7. I. Ihrke, L. Ahrenberg, and M. Magnor. External camera calibration for synchronized multi-video systems. Journal of WSCG, 12(l-3):537-544, January 2004. 8. A. Laurentini. The visual hull concept for sihouette-based image understanding. IEEE Trans. Pattern Analysis and Machine Vision, 16(2):150-162, February 1994. 9. M. Magnor and H.-P. Seidel. Capturing the shape of a dynamic world fast ! Proc. International Conference on Shape Modelling and Applications (SMI'03), Seoul, South Korea, pages 3-9, May 2003. 10. K.N. Kutulakos and S.M. Seitz. A theory of shape by space carving. International Journal of Computer Vision, 38(3):199-218, 2000. 11. S.M. Seitz and C.R. Dyer. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2):151-173, 1999. 12. B. Goldluecke and M. Magnor. Weighted minimal hypersurfaces and their
112
13.
14.
15.
16.
17.
18.
19. 20.
21.
22.
applications in computer vision. Proc. European Conference on Computer Vision (ECCV'Oi), Prague, Czech Republic, 2:366-378, May 2004. B. Goldluecke and M. Magnor. Space-time isosurface evolution for temporally coherent 3D reconstruction. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'04), Washington, USA, June 2004. to appear. J. Carranza, C. Theobalt, M. Magnor, and H.-P. Seidel. Free-viewpoint video of human actors. ACM Trans. Computer Graphics (Siggraph'OS), 22(3):569577, July 2003. M. Magnor and C. Theobalt. Model-based analysis of multi-video data. Proc. IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI2004), Lake Tahoe, USA, pages 41-45, March 2004. R. Koch. Dynamic 3D scene analysis through synthesis feedback control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6):346351, July 1993. C. Theobalt, J. Carranza, M. Magnor, and H.-P. Seidel. A parallel framework for silhouette-based human motion capture. Proc. Vision, Modeling, and Visualization (VMV-2003), Munich, Germany, pages 207-214, November 2003. C. Theobalt, J. Carranza, M. Magnor, J. Lang, and H.-P. Seidel. Enhancing silhouette-based human motion capture with 3D motion fields. Proc. IEEE Pacific Graphics 2003, Canmore, Canada, pages 185-193, October 2003. Motion Picture Experts Group (MPEG). N1666: SNHC systems verification model 4.0, April 1997. M. Magnor, P. Ramanathan, and B. Girod. Multi-view coding for imagebased rendering using 3-D scene geometry. IEEE Trans. Circuits and Systems for Video Technology, 13(11):1092-1106, November 2003. G. Ziegler, H. Lensch, N. Ahmed, M. Magnor, and H.-P. Seidel. Multi-video compression in texture space. Proc. IEEE International Conference on Image Processing (ICIP'04), Singapore, September 2004. accepted. H. Danyali and A. Mertins. Fully scalable texture coding of arbitrarily shaped video objects. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'03), pages 393-396, April 2003.
113
ENTROPY AS A FEATURE IN THE ANALYSIS AND CLASSIFICATION OF SIGNALS ANDREA CASANOVA Dipartimento di Scienze Mediche, Facoltd di Medicina v. S.Giorgio 12, 09124, Cagliari, Italy SERGIO VITULANO Dipartimento di Scienze Mediche, Facoltd di Medicina v. S.GiorgioH, 09124, Cagliari, Italy In the following paper, it is introduced the entropy features for the analysis and classification of signals. It is carried out a comparison with others well-known methods in literature. The entropy method seems to work quite well and it is less time consuming compared to other methods.
1.
Introduction
The development of new imaging methods for medical diagnosis has significantly widened the scope of the images available to physicians. The traditional X-ray photograph is now only one of the possibilities: other options include Computed Tomography (CT), Positron Emission Tomography (PET), Nuclear Magnetic Resonance (NMR), Thermography, Mammography and more. The volume of data generated by an average sized medical faculty, imaging is such that both retrieval and classification of the images become essential issues. Multimedia database management is one of the hottest research topics of the day. Indeed, multimedia computing systems are widely used, even among the general public, who quickly recognized their usefulness for everyday tasks. Homogeneous databases -such as those utilized by an application in a specific field -e.g. medicine- are characterized by having very small differences among the objects. The most effective approaches to data are based on object contour shapes and spatial relationships among them. Images in the databases present a much wider range of variability and, therefore, they can usually be represented by coarser global features, such as texture or colour percentage. The problem is to bring about a mapping from the set of all possible images (image space) to the set -usually smaller- of all possible features value (feature space).
114
The most relevant points, related to the problem, are the qualitative and quantitative choices of these features. In this paper we wish to verify the role of the signal-entropy as a feature in the analysis and classification of texture problems. The paper is organized as it follows: Section 2 describes how our method works and some of its properties; Section 3 shows the results obtained from experimentation, and finally Section 4 makes a few concluding remarks. 2.
The Method
Before presenting the steps of the method proposed in this work, we would like to introduce the definition. Definition 1: we define "signal crest" the part of the signal contained between the absolute minimum and the absolute maximum of the signal itself; Definition 2: we define "signal entropy" the ratio between the crest energy and the signal energy, expressed as it follows:
2j(yt-ym) (1)
D
2> r=l
where: D = signal domain; y = amplitude of the signal at the-t11 point; y = amplitude of the signal at the absolute minimum; The entropy S can assume one of the values contained in the [0,1] interval. The 0 value occurs when the absolute minimum and the absolute maximum of the signal have the same value (i.e. flat signal); instead, the(l) value appears when the signal is monotone increasing (i.e. triangular signal). The relation (1) can be written in a simpler way. As a matter of fact, if we define:
h=^
(2)
D
and
115 D
I* *.^p
0)
where: D = signal domain; h = amplitude of the signal at the-th point; h = amplitude of the signal corresponding to the minimum; we can write (l)as:
absolute
In a previous work, we introduced a linear transformation, "spiral", of 2D signal in a ID one. With this transformation, any picture, or part of it, may be transformed in a ID signal without losing any information. Here are the following main steps of the proposed method: 1) We select the region of interest r; from each of the pictures belonging to the data set; 2) We perform the transformation of the 2D signal of the region r; in a ID signal via the application of the spiral linear transformation; 3) We compute the entropy value S for each of the signal ys; The days where computers were text-only systems are long gone. Nowadays, even low-end personal computers are able to display, employ and process images, at least in the form of icon, and even in much more complex ways. While the typical user has several image files in his disk device, the high-end user or the graphic specialist may have thousands. Furthermore, there are computer systems that are entirely devoted to the archival or treatments of image: many multimedia databases are made in great part by images, and several applications in a wide range of specific fields rely on digital images for many of their functions (i.e. PACS system in medicine). As the number of images available to a system increases, the need for an automatic image retrieval system of some kind, becomes more stringent. We believe that, for a first screening, the entropy of the signal query should be used to select signal in a homogeneous data set. Given the signal y ; as a query signal, we use its entropy value as a similarity measure between yj and all the signals belonging to the data set. This way we realise a very strong reduction of the data set dimension. In fact, we only need to store the entropy value of each of the signals, instead of managing the whole picture data set or the ID corresponding signals. In this work, in order to verify the entropy method when applied to real signals, we use breast mammographies as data set, in which benignant or malignant masses are also included.
116 The mammographies taken into consideration belong to the DDSM (Digital Database for Screening Mammography, South Florida University) database, a collection of about 10.000 mammographies, digitised at 12 bits in a mateix of 2000x4(1)0 pixels. Our choice was due to the characteristics of these images, and to their highdegree of difficulties even for a human expert. For each of the mammographies belonging to the DDSM database, the zone of suspected mass is given. The mammographies for a malignant and a benign mass are shown respectively in Figure 1 and Figure 2.
Figure 1: It is shown a case of malignant mass in a mammography belonging to the DDSM database.
Figure 2: It is shown a benign mass in a mammography belonging to the DDSM database.
117
Let us introduce the Figure 3 and 6, in order to show better the role and the significance of the signal entropy. In Figure 3, it is shown the crest of the signals, obtained when the spiral method is applied respectively to the picture of Figure 1 and 2, is shown.
Malignant signal
300
400
Benignant signal
Figure 3:(top) it is given the crest of the malignant mass a, (bottom) it is given the crest of the benign mass In another previous work, we have introduced HER, a method for the analysis and classification of ID signals. HER was widely experimented on different signals (textures, contours, medical images and aerophotogrammetries). The main features of HER are:
118
1) To weight, in a hierarchical way, each maximum in relation to its role with respect to the behaviour of the signal; 2) To order the extracted maximum in function of the extraction order, the distance of each maximum respect to the previous one; 3) To classify each signal in function of the number of maximum extracted, amount of the energy used for the extraction, value (o) of the energy to each of the maximum associated;
3.
Results
In order to test the role of entropy in the analysis and classification problems, we have considered several different mammographies. The experimental results, obtained when the "crest" method is applied on a sample of 207 mammographies, are summarized on Table 1. On table 1, it has been carried out the results obtained using the entropy method, proposed in this essay, to 108 cases with malignant masses and 99 cases with benign masses. Table 1: Tabular results due to the crest method applied to the masses.
Malignant Benign
N. Cases 108 99
# Errors 22 20
Percentage 79,63% 79,63%
In this essay, we plan to propose entropy as a measurement of benignancy and malignancy of the case under examination. The crucial point of the mentioned method, something common to many other proposed methods in the literature, is to determine the threshold value. We used the threshold to classify the case under examination, which means to assign it to the benign class or to the malignant one. Of course, every time that real cases are being used, the sets of the elements, that we wish to select, are not disjointed. The whole problem can be expressed in the following way: minimise the intersection of the sets instead of the union of the sets. In this paper, we have used a database that suggests, for every class contained in it, a training set. In the graphic, Figure4, we give the state of the obtained results, using the entropy method and HER to the training set regarding the malignant and benign masses.
119 The curve concerning the malignant masses has been produced by making graphics, in percentages, of the cases that show a value of entropy bigger than the one of the interval under examination (i.e. 100% of the examined cases have values of entropy bigger than 1 and 0% of the cases have entropy bigger than 11). The curve concerning the benign masses shows how many cases, in percentage, have values of entropy less than the one regarding the interval under examination (i.e. in none of the cases exist value of entropy less than 2, in the interval of entropy 2-16 are embraced 100% of the benign cases).
0 —
1
2
- - Maligne
3 4
5
6
7
• Benigne
8
9 10 11 12 13 14 15 16 17 18 19 •MaligneHer —
—
BenigneHer
Figure 4: State of the obtained result, using the entropy method and HER
The value of the threshold is given by the abscissa of the common point between the two curves (the value that maximises the number of the cases in support of the malignant and minimises the number of the cases against to the benign ones).
120
Figure 5: State of the using a high pass filter
Table 2: Results due to the different methods
Crest Fourier 400 HER
Malignant 79,63% 53% 95,37%
Benignant 79,63% 53% 95,37%
The cases contained in the intersection set between the set of cases with malignant masses and the set of cases with benignant masses, require a greater attention and more accurate analysis, or completely different (i.e. biopsy). To study those cases, we applied the method HER with a further research. On table 2, it is shown the obtained results. It is also given the results obtained by applying a high pass filter with a cut frequency witch lets by 200 highest frequencies of Fourier spectrum. The results obtained with the high pass filter are shown in Figure 5. The above introduced experimental results show that for some mammography there is a certain degree of uncertainty we have tested considering the same cases with on HER methods.
121 On Table 2, we can observe that the application of HER method doesn't change so much the percentage for the malignant cases, but it realizes a good performance if compared to other cases when HER wasn't applied. Because of the strong interest in the detection of micro calcifications in mammographies, we decided to test the crest method also to this kind of signals, whose results are summarized on Table 3, The analysis of the signals of the micro calcifications (Figure 6) highlights that for the malignant cases, the impulses are characterized by a small amount of energy (impulse area), a significant shape and a remarkable value of the entropy in the bottom of the signal if compared to the signals of the benignant ones. Table 3: Results due to the crest method when applied to micro calcification.
Malignant Benignant
N.Cases 144 126
# Errors 10 9
Percentage 93% 93%
Table 4: Results due to the different methods
Crest Fourier 400 HER 4.
Malignant 93% 51% 95,83%
Benignant 93% 51% 95,83%
Discussion
The Equation (4) shows that the entropy is due to the ratio between h' and h, that, roughly speaking, may be considered as a measure of the signal "roughness" with respect to the signal energy. The method proposed in this work gets the guide reasons by observing that, when a malignant lesion comes up, not only does it provoke alterations in the cellular tissue, but also a greater disorder. We believe that characteristics such as the frequencies of the signal, the shape of calcification or methods that they look to thresholds more or less proper, are less suitable. The proposed measure of the entropy give the idea of the disorder subdomain of the tissue, hit by a lesion (energy of the crest), to the total energy of the signal concerning the same under-field. In other words, we feel that the alterations concerning the same tissue, can be a valid measure or an increasing of the malignancy of the lesion. In figure 6 we showed the signal concerning two different tissues: a tissue hit by a malignant micro calcification and a tissue hit by a benignant micro calcification.
122
The comparison between the two signals reveals that, malignant micro calcification produces a bigger alteration (entropy) if compared to those one produced by benignant micro calcification. It is remarkable that the entropy measure of the signal doesn't require a large amount of operations, because it is less computational time consuming. If we compare the results obtained with the entropy measure and the Fourier Transforms, we may notice that for the malignant case the percentage is quite the same. For the benignant case, instead, the results obtained with the Fourier method are better with respect to what obtained with the entropy method. In order to better understand the comparison of the results due to the different methods, it is noteworthy that the performance of the Fourier Transform were evaluated considering the first 360 (or 480) harmonics, i.e. we need 720 (or 960) parameters, instead of the 2 necessary when the entropy method is applied. We believe that the results obtained with the wavelet are strictly connected to the behaviour of the signal, characterized by the absence of the carrier wave. If we apply the Her method to those cases that are characterized by a certain degree of uncertainty, Table 2 shows that, the results thus obtained, are better with respect to the other methods considered in our study. Anyway, we believe that the application of the entropy and HER is less time consuming if compared to the other considered methods.
123
Malignant signal
235
r
~\
i
1
r
I
L
230 225 220
wWM^
215 210
J
0
100
I
200
I
300
I
400
500 600
Benignant signal 250
n
r
n
r
Figure 6: Crest of a malignant micro calcification signal (top) and a benignant micro calcification signal (bottom) after the application of spiral method.
124 5.
Conclusions
In this paper, we introduced the entropy method, characterised by fastness and simplicity. Entropy plays an important role in visual perception problems, such as the evaluation of the order and disorder of the visual field. When applied for a first screening in breast analysis, entropy seems to work well, while Her method appears to be more suitable for those cases that are characterised by a certain degree of uncertainty. Each of the methods presented in this paper may be used as a "second" opinion for the physicians. The authors are grateful to Msr Giulia Napoli for her contribution. References 1 Brambilla C. et alt (1999): 'Multiresolution wavelet transform and supervised learning for content-based image retrieval', IEEE Int. Conf. on Multimedia Computing and System, vol.1, 1999, pp. 183-188 2 Petrakis E.G.M., et alt. (1997): 'Similarity searching in medical image databases', IEEE Trans. Knowledge and Data Eng., 9, pp. 435-447 3 Brunelli R. et alt. (2000): 'Image retrieval by examples', IEEE Trans. On Multimedia, 2, pp. 164-171 4 Heath M , et alt. (1998): 'Current status of digital database for screening mammography', (Kluwer Academic Pub.), pp. 457-460 5 Vitulano S., et alt. (2000): 'A hierarchical representation for content-based image retrieval', Jour. Of Visual Lang, and Comp., 5, pp. 317-326 6 Vitulano S., et alt. (2002): 'HEAT: Hierarchical Entropy Approach for Texture Indexing in Image Databases', Int. Jour. Of Soft. Eng. And KnowL, 12, pp. 501-522 7 Melloul M., Joskowicz L. (2002): 'Segmentation of microcalcification in Xray mammograms using entropy thresholding', CARS 2002, H.U.Lemke et al. Editors. 8 Ullman J.D. (1989) 'Principle of database and knowledge-based system', (Computer Science Press.).
Proceedings of the Workshop on 2004 There is a strong need for advances in the fields of image indexing and retrieval and visual query languages for multimedia databases. Image technology is facing both classical and novel problems for the organization and filtering of increasingly large amount of pictorial data. Novel kinds of problems, such as indexing and highlevel content-base, accessing to image databases, human interaction with multimedia systems, approaches to multimedial data, biometrics, data mining, computer graphics and augmented reality, have grown into real-life issues. The papers in this proceedings volume relate to the subject matter of multimedia databases and image communication. They offer different approaches which help to keep the field of research lively and interesting.
World Scientific
ISBN 981-256-137-4
www.worldscientific.com 5735 he
9"789812"561374"