edited by
C H Chen Electrical and Computer Engineering Department, University of Massachusetts Dartmouth, N. Dartmouth, MA, USA
L F Pau Ericsson, Sweden
P S P Wang College of Computer Science, Northeastern University, Boston, MA, USA
orld Scientific New Jersey. London Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. P 0 Box 128, Farrer Road, Singapore 912805
USA ofJice: Suite 1B, 1060 Main Street, River Edge, NJ 07661 UK oflice: 57 Shelton Street, Covent Garden, London WC2H 9HE
Library of Congress Cataloging-in-Publication Data Handbook of pattern recognition & computer vision /edited by C.H. Chen, L.F. Pau, P.S.P. Wang -- 2nd ed. p. cm. Includes bibliographical references and index. ISBN 9810230710 (alk. paper) 1. Pattern recognition systems. 2. Computer vision. I. Chen, C. 11. Pau, L.-F. (Louis Franpis), 1948H. (Chi-hau), 1937111. Wang, Patrick S.-P. (Shen-pai) IV. Title. Handbook of pattern recognition and computer vision. TK7882.P3H35 1999 006.4--dc21 98-51616 CIP
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library
First published 1999 Reprinted 2001
Copyright 0 1999 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, orparts thereoJ may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the Dublisher.
Printed i n Singapore by U t e P r i n t
PREFACE TO THE SECOND EDITION
The progress in pattern recognition and computer vision since the publication of the first edition in 1993 has been enormous. While the first edition is never out of date, t o incorporate new activities in the field, and to incorporate many valuable comments by the readers of the first edition, it is useful now to present a new edition. More than half of the pages of this volume are new, reporting a number of new activities. This edition also contains five parts: (1) Basic Methods in Pattern Recognition, (2) Basic Methods in Computer Vision, (3) Recognition Applications, (4) Inspection and Robotic Applications, and (5) Architectures and Technology, with a total of 34 chapters of original work on various topics. Part 1 starts with a chapter on cluster analysis which is followed by a chapter on statistical recognition and a chapter on syntactic pattern recognition. Until about twenty years ago, these three areas are what fundamental pattern recognition was about. The re-discovery of artificial neural networks in the mid-eighties has made it a n essential topic of modern pattern recognition. The fourth chapter addresses particularly neural network computing from the viewpoint of implementing pattern recognition algorithms. Additional basic pattern recognition methods are presented in the next two chapters on Gaussian Markov random field model and 3-D object pattern representation respectively. The first two chapters of Part 2 deal with texture analysis and segmentation. We agree with some readers’ comments that the first edition placed a lot emphasis on texture analysis and in fact Chapters 1.5; 2.1; 2.2; 4.2 of this edition are texture-based. This reflects that more progress has been made on this topic. The third chapter on color in computer vision has been fully rewritten from the first edition t o provide some latest progress. The next two chapters are on projective geometry and 3-D motion analysis. The last two chapters of Part 2 present aspects of 3-D shape representation and 3-D surface-based systems. Part 3 provides a comprehensive study of nine recognition applications including: nondestructive testing of materials, speech recognition, remote sensing application, fingerprint processing using an opto-electronic system, human chromosomes classification, document processing, biomedical signal recognition, geographic data analysis, and face recognition. Obviously it is not possible to present all application areas and the readers are kindly requested to refer t o the first edition and other publications for additional recognition applications. The first two chapters of Part 4 are in fish industry inspection and textured surface inspection with industrial objects. The next chapter on context related issues in image understanding is of concern t o all inspection and robotic applications. The chapter on computer vision in postal automation represents one aspect of robotic application. The last V
vi
Preface to the Second Edition
chapter on vision-based automatic vehicle guidance presents an area of emerging importance in intelligent transportation systems and again robotic vision can play an important role. Part 5 deals with several aspects of architecture and technological development. The first chapter on vision provides a tutorial discussion of technological issues in computer vision. The next chapter presents highlights of optical pattern recognition. Infrared imagery technology has been around for some time and its application has not been limited t o military and medical areas. The third chapter deals with this technology and its use in classification problems. Another major technological development is the video content analysis and retrieval which is discussed in the fourth chapter. The VLSI architecture for computing central moments for pattern recognition and image processing is presented in the final chapter with applications to breast cancer detection and road pavement distress detection. We understand that even with all chapters in this volume, there are still a lot of activities in pattern recognition and computer vision not presented here. Our view is that a handbook is not a dictionary and it is not desirable to cover all topics superficially. The in-depth treatment of a topic in each chapter can provide readers with a good understanding on the subject while collectively the chapters capture well all major developments in this field. The second edition is specially dedicated to the memory of Professor King Sun Fu. There are five chapters written by Professor Fu’s former students. During his career, Professor Fu has approached almost every area of recognition application in his time. By following his emphasis on applications, we have expanded the application part considerably. We are fortunate to have Professor Azriel Rosenfeld, who knew Professor Fu well, to write on the topic of vision with some speculations. Professor Rosenfeld’s words of wisdom will always be helpful to the research and education community. We are also fortunate to have Professor C. C. Li, who was Professor Fu’s classmate at the National Taiwan University, to prepare a memorial article. We thank both of them for their contributions. We like to take this opportunity to express our deep gratitude to all new and old contributors to this handbook series. This volume continues to represent the most comprehensive and most up to date handbook publication in pattern recognition and computer vision. The book will certainly help us to prepare for a new century of dynamic development in this field.
The co-editors June 1998
PREFACE
The area of pattern recognition and computer vision, after over 35 years of continued development, has now reached its maturity. The theories, techniques and algorithms are mostly well developed. There are a number of applications which are still being explored. New approaches motivated by applications and new computer architectures available are still being studied. Also the recently renewed and intensive efforts on neural networks have had great and positive impact on pattern recognition and computer vision development. Pattern recognition and computer vision will definitely play a very major role in advanced automation as we enter the 21st century. Amid all of these activities now going on, this new Handbook of Pattern Recognition and Computer Vision is much needed to cover what has been well developed in theory, techniques and algorithms and the major applications of pattern recognition and computer vision as well as the new hardwarelarchitecture aspects of computer vision and the related development in pattern recognition. The previous Handbook of Pattern Recognition and Image Processing, edited by T. Y. Young and the late K. S. Fu (Academic Press, 1986) was well received. The progress in pattern recognition and computer vision has been particularly significant in the recent past. We believe this new handbook that reflects more recent developments especially in computer vision will serve well the increasingly larger community of readers in the area. As students and friends of Prof. Fu, we remember well his vigorous efforts to broaden the frontiers of pattern recognition and computer vision in both theories and applications, to build it as an interdisciplinary area, and to lay down the foundation of intelligent and automated systems based on pattern recognition and computer vision. The book, in keeping up with his vision for the area, provides an extensive coverage of major research progress since the publication of Young and Fu’s book. The book is organized into five parts. Part 1 presents a thorough coverage of the basic methods in pattern recognition including clustering techniques, statistical pattern recognition, neural network computing, feature selection, and syntactic, structural and grammatical pattern recognition. Part 2 presents comprehensively the basic methods in computer vision including texture image analysis and model based segmentation, color and geometrical tools, 3-D motion analysis, matheniatical morphology, and parallel thinning algorithms. Part 3 presents several major pattern recognition applications in nondestructive evaluation, geophysical signal interpretation, economics and business, underwater signals, character recognition and document understanding, biomedical image recognition and medical image vii
viii
Preface
understanding. Part 4 focuses on unique applications in inspection and robotics with topics on computer vision in the food processing industry, context modeling and position estimation for robots, and related issues. Part 5, on the other hand, examines the broader system aspects, including designing computer vision systems, optical pattern recognition, spatial knowledge representation, neural network architecture for image segmentation, architectures for computer vision and image information systems. More than 85 per cent of the chapters are original and unpublished work while the remaining reprint chapters provide complementary coverage. There is no doubt that a single volume handbook like this cannot examine every aspect of pattern recognition and computer vision, nor can it present the contributions of all leading researchers. However, we believe the book has captured both the scope and depth of progress in this highly dynamic and multidisciplinary area. In preparing the book, we are most fortunate to bring together all contributors who are among the leaders in the area. We would like t o take this opportunity to express our deep gratitude to their unselfish and timely efforts to share their expertise with the readers. We also like to thank Dr. K. K. Phua and Ms. Jennifer Gan of World Scientific Publishing for their help and encouragement throughout the preparation of this volume. C. H. Chen L. F. Pau P. S. P. Wang September 1992
FOREWORD VISION: SOME SPECULATIONS
AZRIEL ROSENFELD C e n t e r for A u t o m a t i o n Research University of Maryland, College Park, M D 20742-3275, USA
1. Introduction The purpose of vision is t o extract useful information about the world from images. Computer vision attempts t o do this by computer analysis of digitized images. Efforts along these lines have been under way for over 40 years; but even the domains first studied in the 1950s (which included handwriting, photomicrographs of biological specimens, and aerial photographs of built-up areas) still present many unsolved problems. This very brief note presents some comments, many of them intended t o be provocative, about the reasons for the slow progress in the field and the ways in which successes are likely to be achieved. 2. Vision and Mathematics
As an hommage t o Wigner [l],this section might have been entitled “On the unsurprising ineffectiveness of mathematics in the visual sciences”. Most vision problems, even those that were first tackled in the 1950s, are mathematically ill-defined (reading handwritten words, counting cells, recognizing buildings). Real-world visual domains do not satisfy simple mathematical (even probabilistic) models. Even when such models are assumed, problems that involve inferring information about a scene from images are often mathematically ill-posed or computationally intractable; but a more serious difficulty is that the models themselves are unrealistic, and are likely to remain so for a long time to come [2]. On a more positive note, mathematical and statistical tools do have their uses in formulating vision problems; in particular, they provide methods of describing image formation processes and image analysis algorithms. There also exist domains (typically involving machine-made scenes: printed documents, integrated circuits, mechanical arts, all with controlled lighting) that can be mathematically modelled quite accurately (as long as the scene is not too dirty!). In these sorts of domains, vision systems can be designed t o perform quite successfully, provided they take the model seriously and bring adequate computer power to bear on the task. ix
x
Foreword
3. Vision and Biology In this section we put mathematics aside and consider what might be called the “multilegged existence theorems” for vision [3]. Animals use vision quite effectively in the real world. Apparently, simple algorithms can be quite useful for extracting useful information about natural environments (for example [4],insects rarely collide with the underbrush even when flying through it at many body lengths per second); though an organism may have trouble coping if its environment changes significantly (for example, a frog surrounded by unmoving flies will probably starve to death [ 5 ] ) . Simple computer vision algorithms can in fact be designed, sometimes with the aid of learning techniques, that will usually perform usefully in real-world environments; but they may fail disastrously if they encounter unusual situations. For simple organisms, nature overcomes their individual limitations by providing large populations in which there are variations among the individuals. Thus even if many individuals fail, others survive (or learn to survive) and may pass their successful characteristics on to their descendants. (It has been suggested, halfseriously, that this might provide a biological justification for the hundreds of “yet another ...” computer vision algorithms in the literature; but note that the analogy is valid only if these algorithms are compared on a variety of tasks, so their potential advantages can be discovered!) Only a few pieces of research are truly innovative and can be regarded as major mutations; as in nature, such breakthroughs are sometimes lethal, but they may sometimes succeed in producing new successful species of vision systems. This “biological” approach could be used to discover useful computer vision algorithms through extensive experiments with real data; but it does not seem practical to develop robot vision systems in this way, since the cost of the failures would probably be unacceptable. A possible strategy for developing useful vision algorithms, without incurring many costly failures, might be t o initially use large sets of recorded real-world data t o design and test the algorithms. Higher-level organisms are more flexible in their uses of vision. Their flexibility may be based on an ability to evaluate and combine the information extracted from the images in multiple visual “areas” or “pathways”. This suggests that computer vision systems could benefit from the deliberate use of multiple techniques an approach that has been generally avoided in the past because of its computational cost. Another possible basis for the successful visual performance(s) of organisms is that they make use of redundant visual data. Computer vision systems often t,ry to reduce computational cost by analyzing only single (sets of) frames, and (for dynamic scenes) using results obtained from earlier frames as predictors t o simplify the analysis of later frames. Biological organisms, on the other hand, process all the data that their visual systems provide (though results obtained from earlier processing may serve to direct attention to subsets of the results obtained from later processing). The availability of redundant data allows an organism to discover
Foreword
xi
processing errors, since they give rise to results that are not persistent. Computer vision systems are just reaching; the levels of processing power that will allow them to handle, in real time, amounts of input data comparable to those handled by biological visual systems, and t o apply multiple processing techniques to the data. As these levels are reached, vision system performance may significantly improve. 4. Concluding Remarks Over the past four decades, the performance of computer vision systems has largely kept pace with available computer power. Many domains remain intractable from a theoretical standpoint, but systems that take advantage of data and algorithm redundancy may eventually achieve performances comparable t o those of biological organisms, which serve as living demonstrations of the effectiveness of vision in the real world. References [l] E. Wigner, The unreasonable effectiveness of mathematics in the natural sciences, Comm. Pure Appl. Math. 13 (1960). [2] A. Rosenfeld, Some thoughts about image modeling, In K. V. Mardia and G. K. Kanji, (eds.), Statistics and Images 1 (Advances in Applied Statistics, Supplement to J. Appl. Stat. 20 (5/6), Carfax Pub. Co., Abingdon, Oxfordshire, UK, 1993) 19-22. [3] R. A. Kirsch (personal communication, June 1967) used the phrase “two-legged existence theorem” as an argument for the feasibility of computer vision (or, as we were calling it then, “pictorial pattern recognition”). He probably had in mind only featherless bipeds, but an even earlier example (an existence proof for the air-to-ground ATR problem) is Skinner’s 1945 demonstration that pigeons could be trained to serve as effective bombsights. [4] J. A. Albus, personal communication, about 1990. [5] J. Y. Lettvin, H. R. Maturana, W. S. McCulloch and W. H. Pitts, What the frog’s eye tells the frog’s brain, Proc. Ire 47 (1959) 1940-1957.
A MEMORIAL TO T H E LATE PROFESSOR KING-SUN FU
Twelve years have gone by since the passing away of Professor King-Sun Fu, an eminent scholar and a pioneer in the field of pattern recognition, computer vision, and machine intelligence. This volume is dedicated to his memory by his friends and students as a testimonial to his profound and enduring influence. I am honored to be asked to write a memorial article. Here I will give a biography* of King-Sun as well as a short addendum t o the bibliography of his published works compiled in 1985. King-Sun was born on October 2, 1930, in Nanking, China, the second son of his parents, General and Mrs. Tzao-Jen Fu. The Fu family’s native city, however, is Hangzhou, Zhejiang. He pursued his middle school education at the Chinese Air Force Youth Preparatory School. After his father died in 1949, his mother took him and two younger brothers to Taiwan, where he matriculated in the National Taiwan University in the Fall of 1949. He was one of the top students in the Electrical Engineering Department. Not only did he excel in mathematics and engineering subjects, but he also cultivated interests in classical music and literature. Perhaps the latter pursuit laid the foundation for his own prolific writing in serial journals and scientific books. His writing was distinct: clearly expressive and succinct. The demands of academia did not serve to stifle his personal exuberance and vitality in life, however, and, in finding a balance to his studies, King-Sun was also active on the basketball and volleyball teams in the school. He was indeed a versatile student on the campus. After graduating from the National Taiwan University with a B.S.E.E. degree in 1953 and completing one year of ROTC training, King-Sun received a graduate assistantship from the Electrical Engineering Department of the University of Toronto, Canada, and went there in September 1954 to start his graduate study. He wrote his master’s degree thesis on dynamic analysis of large electric machines and received his M.A.Sc. degree in the summer of 1955. In September 1955 he transferred to the University of Illinois, Urbana, IL, USA, for his doctoral study in Electrical Engineering, and completed his Ph.D. dissertation in network theory on “An Approximation Method for Both Magnitude and Phase by Rational Functions” in February 1959 under Professor M. E. Van Valkenburg. During his three and a half years at the University of Illinois, King-Sun became immensely interested in statistical methods, information theory, abstract algebra, and modern analysis. He was a top student in Professor J . L. Doob’s course *Portions of the biography are reprinted, with permission, from IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-8, No. 3, pp. 291-294, May 1996. @ 1986 IEEE.
Xiii
xiv
A Memorial to the Late Professor King-Sun Fu
on stochastic processes which was intended for mathematics graduate students only. This extraordinary background paved the way for his later research in sequential methods of statistical pattern recognition and machine learning. On the Urbana campus, King-Sun fell in love with Miss Viola Ou, then a graduate student in library science. They were married in Urbana, Illinois, on April 7, 1958. After receiving his Ph.D. degree from the University of Illinois in 1959, King-Sun worked for a year and a half as a research engineer at Boeing Airplane Company, Seattle, Washington, from February 1959 to August 1960. He also taught as a special lecturer at the Seattle University during the Spring Semester of 1960. In September 1960 he accepted a faculty position a t Purdue University, West Lafayette, Indiana, as an assistant professor in the School of Electrical Engineering. The following semester he was selected by Purdue to be a visiting scientist with the Research Laboratory of Electronics at Massachusetts Institute of Technology from February to June 1961. During that summer he was with the IBM Thomas J. Watson Research Center at Yorktown Heights, NY. After he returned to Purdue University in September 1961, he began to pursue his research in pattern recognition and machine intelligence - a field in which he played a prominent role in its development during the next quarter of a century. He became a n associate professor in September 1963 and was promoted to the rank of professor of electrical engineering in September 1966. In 1967 he was a visiting professor of electrical engineering and computer science at the University of California, Berkeley. He was the assistant head for research a t the School of Electrical Engineering a t Purdue from 1969 to 1972. In 1972 he was awarded a prestigious Guggenheim fellowship and was a visiting professor of electrical engineering a t both Stanford University and the University of California, Berkeley. After returning to Purdue, he established the Advanced Automation Research Laboratory in the School. He was named the Goss Distinguished Professor of Engineering at Purdue University in 1975. During the Fall of 1984 he, along with other colleagues at Purdue, initiated the highly innovative program of research in intelligent manufacturing. This program resulted in the startup of the National Science Foundation Engineering Research Center for Intelligent Manufacturing Systems during early 1985. King-Sun was the founding director of the Center. At the peak of his career, his sudden death in April 1985 was a tremendous loss to our scientific and engineering community. During the earlier years he first focused his study on statistical pattern recognition and learning systems. From 1961 t o 1970 he and his students developed sequential methods for feature selection and pattern recognition, non-parametric procedures for pattern classification, a stochastic approximation approach to learning control systems, and stochastic and learning automata. His first research monograph entitled “Sequential Methods in Pattern Recognition and Machine Learning” was published in 1968. By the late 1960’s he began his unique research on syntactic pattern recognition, which was introduced by the earlier efforts of Murray Eden, R. Narasimhan, R. A. Kirsch, Robert S. Ledley, and Alan Shaw. King-Sun initiated and launched in-depth studies on stochastic context-free
A Memorial to the Late Professor King-Sun Fu
xv
programmed languages and stochastic syntax analysis for pattern recognition and image analysis. His book, “Syntactic Methods in Pattern Recognition”, was published in 1974. In the ensuing years he and his students made the greatest and foremost impact on syntactic pattern recognition research. His school developed fundamental methodologies of stochastic error-correcting syntax analysis, error-correcting parsers for formal languages and, in particular, for attributed and stochastic tree grammars, and error-correcting isomorphisms of attributed relational graphs for pattern recognition. The syntactic methods for texture analysis, shape recognition, and image modeling were introduced in the late 1970’s, and the three-dimensional plex grammar in 1984. Attributed grammars were developed from the viewpoint of combining syntactic and statistical pattern recognition. In the meantime, the contextual information was also introduced into statistical pattern recognition. The unification of both syntactic and statistical approaches was always in his thoughts. Inference procedures of context-free programmed grammars, multi-dimensional grammars, transition network grammars, and stochastic tree grammars were developed one after another in the late 1970’s through early 1980’s. It is probably appropriate to say that all these constitute what we may call Fu’s theory of syntactic pattern recognition. His treatise, “Syntactic Pattern Recognition and Applications”, published in 1982, made this subject material more easily understandable to researchers and practitioners in various disciplines. King-Sun and his colleagues also made important contributions to pattern recognition applications. His work on pattern classification of remotely sensed agriculture data (1969) and earth resources (1976) is considered classic in the field. During the mid 1970’s through the early 1980’s, his biomedical pattern recognition research extended to chest radiographic image analysis, automatic recognition of irradiated chromosomes, nucleated blood cell classification, and Pap smear and cervical cell image analysis and classification. The Moayer-Fu paper on fingerprint pattern recognition based on the syntactic approach received the 1976 outstanding paper award of the IEEE Transactions on Computers. His work on seismic signal discrimination and bright spots detection appeared in 1982 and 1985. His research on industrial automatic inspection and computer vision included IC chip inspection (1980), metal surface inspection (1984), and inspection of industrial assemblies (1985). An expert system was developed by his group for the assessment of structure damages caused by earthquakes (1983). Since the late 1970’she envisioned the importance of integrated and special computer architectures and parallel algorithms for pattern recognition, image processing, and database management. This led t o his works in 1980’son parallel parsing of tree languages, query languages for image database systems, and VLSI implementation of parallel parsing algorithm and hierarchical scene matching. In the meantime, his research on three-dimensional object representation and shape description, orientation estimation, overlapping workpiece identification, knowledge organization, and robotic vision for path planning laid the foundation for the establishment of
xvi
A Memorial to the Late Professor King-Sun Fu
the Engineering Research Center on Intelligent Manufacturing Systems by Purdue University and the National Science Foundation in 1985. As mentioned earlier, King-Sun was the chief architect and the first director of the research center. He wrote six books, edited or co-edited eighteen books, authored or co-authored forty-four book chapters, and one hundred sixty-two serial journal papers. In addition, he authored and co-authored two-hundred forty-eight conference papers. Seventy-two Ph.D. dissertations were completed under his supervision. His activities in professional societies started in 1965-67 as the chairman of the Institute of Electrical and Electronic Engineers (IEEE) Discrete Systems Committee and the chairman of the Fifth Symposium on Discrete Adaptive Processes. Under his leadership he organized and served as the first chairman (1967-69) of the IEEE Automatic Control Group’s Learning and Adaptive Systems and Pattern Recognition Technical Committee. He was on the official American delegation to the International Conference on Artificial Intelligence, Moscow, USSR, in 1967, and an official American delegate to the 1969 International Federation of Automatic Control (IFAC) International Congress held in Warsaw, Poland. He served on the administrative committee of the IEEE Automatic Control Group (1969-71) and later of the IEEE Control System Society (1974-76), was the chairman of the 1969 IEEE International Convention, a director (for IEEE) of the American Automatic Control Council in 1972, and the general chairman of the 1977 IEEE Conference on Decision and Control. He took part in the IEEE Systems, Man, and Cybernetics Society activities, beginning in 1969 when he served as the chairman of the Adaptive Systems Technical Committee of its predecessor, the IEEE Systems Science and Cybernetics Group, and was on the administrative committee of the Group (1970-72). He became the Cybernetics Technical Committee chairman (1972-76) and then the Society’s Vice President for Technical Committees (1978-79). He was an associate editor of IEEE Transactions on Systems, Man, and Cybernetics (1969-1985). In order to provide an international forum to promote advances in pattern recognition, he and the contemporary leaders in the field organized the first International Conference on Pattern Recognition in Washington, DC, in 1973, for which he served as chairman. The biannual conferences evolved into the formation of the International Association for Pattern Recognition (IAPR) by 1976. He was elected to be its president for 1976-78, a member of its executive committee (1976-80), chairman of its long range planning committee (1979-81), and a member of its governing board (1976-85). (In memory of his distinctive contributions, the IAPR has established since 1986 the “K. S. Fu Award, to be given to a distinguished contributor in the field once every two years.) In the meantime, he reorganized the Pattern Recognition Committee into the Machine Intelligence and Pattern Analysis Technical Committee (later renamed as the Pattern Analysis and Machine Intelligence Technical Committee) of the IEEE Computer Society and was its first chairman (1974-77). He was an associate editor of IEEE Transactions on Computers during 1977-78. His initiative led to the founding of the IEEE Transactions on Pattern
A Memorial to the Late Professor King-Sun Fu xvii
Analysis and Machine Intelligence, and he served as its first editor-in-chief (1978-81) as well as a member of the editorial committee (1981-85). In addition, he served on editorial boards (editor, associate editor, and advisory board) of many other scientific journals. These include Pattern Recognition (associate editor, 1971-85); International Journal on Information Sciences (associate editor, 1970-82; editor 1982-85); Journal of Cybernetics of the American Society of Cybernetics (editorial board, 1970-85); International Journal on Computer and Information Sciences (advisory editor, 1971-85); Journal of Information Processing (editorial advisory committee, 1978-81); Journal of Analytical and Quantitative Cytology (editor, 1978-85); International Journal of Fuzzy Sets and Systems (advisory editor, 1979-1985); International Journal of Cybernetics and Systems (advisory board, 1980-85); Computer Vision, Graphics and Image Processing (associate editor, 1981-85); Pattern Recognition Letters (advisory editor, 1982-85); IEEE Transactions on Geoscience and Remote Sensing (associate editor, 1984-85); IEEE Computer (editorial board, 1983-85); and Journal of Parallel and Distributed Computing (editorial board, 1984-85). He chaired and co-chaired the Engineering Foundation Conference on Pattern Information Processing in 1972, on Algorithms for Image Processing in 1976, and on Algorithms for Image and Scene Analysis in 1978. He was Program Chairman of the 1975 IEEE-ACM Conference on Computer Graphics, Pattern Recognition and Data Structure, of the 1978 IEEE Computer Society Conference on Pattern Recognition and Image Processing, and of the 1979 IEEE Computer Society COMPSAC Conference. He chaired the 1980 IEEE Picture Data Description and Management Workshop, and initiated and chaired the IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management in 1981 and 1983. He was the general chairman of the 1984 IEEE Workshop on Language for Automation; Honorary Chairman of the 1984 IEEE Workshop on Visual Languages at Hiroshima, Japan; and general chairman of the 1985 IEEE International Conference on Robotics and Automation. His leadership was well recognized for organizing many international joint seminars and workshops. He served as the coordinator of the NSF supported U.S.-Japan Seminar on Learning Processes in Control Systems at Nagoya, Japan, in 1970 and the Second U.S.-Japan Seminar on Learning Control and Intelligent Control at Gainesville, Florida, in 1973; as the vice-coordinator of the US-Japan Seminar on Fuzzy Sets and Their Applications at Berkeley, California, in 1974; as a co-director of the NATO Advanced Study Institute on Pattern Recognition and Applications in 1975; as the co-chairman of the Dahlem Konferenzen on Biomedical Pattern Recognition and Image Processing at Berlin, West Germany, in 1979; and the coordinator of the NSF sponsored U.S.-France Seminar on the Applications of Pattern Recognition and Machine Intelligence to Automatic Testing at Alexandria, Virginia, in 1983. As a guest editor, he helped put together the special issue on “Feature Extractions and Selection in Pattern Recognition” of IEEE Transactions on Computers, September 1971; the special issue on “Syntactic Pattern Recognition” of Pattern Recognition, Part One,
xviii
A Memorial to the Late Professor King-Sun Fu
November 1971, and Part Two, January 1972; special issue on “Pattern Recognition” of IEEE Computer, May 1976; and on “Robotics and Automation” of IEEE Computer, December 1982. He was on the IEEE Computer Society Governing Board (1978-81), and served as the Society’s Vice President for Publications and a member of the Executive Committee (1982-83) and Fellow Committee (1972-76, 1984-85). He served as the president of the Chinese Language Computer Society (1983-85). He was the Vice President (1984-85) and President-elect of the then newly formed IEEE Robotics and Automation Council. He was on the IEEE Fellow Committee (1977-79), IEEE TAB Awards and Recognition Committee (1979-78), American Federation of Information Processing Societies (AFIPS) Harry Goode Memorial Award Committee chairman (1982-85), American Society of Engineering Education (ASEE) Award Committee (1983-85), and IEEE Award Board, Education Medal Committee (1983-85). King-Sun was literally showered with honor in recognition of his monumental research contributions and contributions to the profession. He was elected a Fellow of the Institute of Electrical and Electronic Engineers in 1971. He was elected a member of the National Academy of Engineering in 1976 and a member of the Academia Sinica in 1978. He served on the National Science Foundation’s Advanced Automation Panel in 1973, Automation Research Council for 1972-78, and Committee on Cytology Automation of the National Institutes of Health, 197881. Among the many awards which he received are the Herbert N. McCoy Award in 1976 for Contributions to Science; the American Society of Engineering Education Senior Research Award in 1981 for outstanding loyalty and contributions as a pioneer in the contemporary engineering disciplines of pattern recognition, image processing and machine intelligence; IEEE Education Medal in 1982 for contributions to engineering education through inspired teaching and research in computer engineering, system theory and pattern recognition; American Federation of Information Processing Societies Harry Goode Memorial Award in 1982 in recognition of his contributions in pattern recognition and its applications and his leadership in education in information processing; Chinese Institute of Engineers - USA (CIE-USA) Achievement Award in 1983 for leadership in engineering education and contribution to pattern recognition; and the IEEE Centennial Medal in 1984. King-Sun helped Taiwan, China, with his scientific advice in various ways. Over a period of fifteen years (1970-1985), he gave invited lectures there almost every year. He was the Program Chairman of the Academia Sincia International Computer Symposium at Taipei in 1978. He helped found the Institute of Information Science, Academia Sinica and was instrumental in establishing the Microelectronics and Information Science and Technology Research Center at the National Chiao Tung University, Taiwan, in 1984. He nurtured a number of young scholars who have become the principal researchers and engineers for the vital development of computer engineering and information science in Taiwan. Likewise he educated a number of scholars from mainland China during 1979 through 1985. He was invited
A Memorial to the Late Professor King-Sun Fu xix
to give lectures to the Institute of Automation, Chinese Academy of Sciences in 1979, and was honored as a Distinguished Visiting Professor of Beijing University, an Honorary Professor of Tainghua University, Beijing, and an Honorary Professor of Fudan University, Shanghai. King-Sun took great pride in his two sons, Francis and Thomas, and one daughter, June. When they were young he always spent his leisure time playing ball or other sports with them. Together with Mrs. Fu, he provided their children with the best education in the home and at school. They are all grown up now: Francis is a computer engineer; Thomas, an oceanographer; and June, a biochemist. They have their own accomplishments in their respective professions. In spite of his overwhelming achievements, King-Sun was a modest man with great sensitivity. Considerate and generous to his friends and students, he exemplified the notion of greatness both professionally and in his personal relationships with others. Standing alongside his outstanding contributions to the scientific world, King-Sun’s great wisdom and human warmth will always be remembered. Ching-Chung Li Department of Electrical Engineering University of Pittsburgh Pittsburgh, PA 15261, USA January 1997
Addendum to “A Bibliography of Published Works of the Late Professor King-Sun Fu,” I E E E Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-8, No. 3, pp. 295-300, May 1986. At the time of compiling the bibliography, eight items were missed (seven of which appeared later in 1986 and 1987). To make his bibliography complete, these are listed below.
Books: K. S. Fu, T-X Cai and G-Y Xu, Artificial Intelligence and Its Applications, (in Chinese) (Beijing: Tsinghua University Press, 1987).
Book Chapters: K. S. Fu, M. Ishizuka and J. T. P. Yao, “Application of Fuzzy Sets in Earthquake Engineering,” in Fuzzy Set and Possibility Theory, R.R. Yager, Ed. (New York: Pergamon Press, 1982) 504-523. Serial Journal Articles E. K. Wong and K. S. Fu, “A Hierarchical Orthogonal Space Approach t o ThreeDimensional Path Planning,” I E E E Journal of Robotics and Automation, Vol. RA-2, March 1986, 42-53.
xx A Memorial to the Late Professor King-Sun Fu
W. C. Lin and K. S. Fu, “A Syntactic Approach to Three-Dimensional Object Recognition,” IEEE Trans. Syst., Man, Cybern., SMC-16, May/June 1986, 405-422.
H-S Don and K. S. Fu, “A Parallel Algorithm for Stochastic Image Segmentation,” IEEE Trans. Pattern Anal. Machine Intell., PAMI-8, Sept. 1986, 594-603.
M. A. Eshera and K. S. Fu, ‘&AnImage Understanding System Using Attributed Symbolic Representation in Inexact Graph-Matching,” IEEE Trans. Pattern Anal. Machine Intell., PAMI-8, Sept. 1986, 604-618. K-Y Huang and K. S. Fu, “Decision-Theoretic Approach for Classification of Ricker Wavelets and Detection of Seismic Anomalies,” IEEE Trans. Geoscience and Remote Sensing, GE-25, March 1987, 118-123. S. Basu and K. S. Fu, “Image Segmentation by Syntactic Method,” Pattern Recognition, 20, No. 1 (1987), 33-44.
CONTENTS
Preface to the Second Edition
V
Preface
vii
Foreword
ix
A Memorial to the Late Professor King-Sun FU
xiii
Contents
xxi
PART 1. BASIC METHODS IN PATTERN RECOGNITION 1.1 Cluster Analysis and Related Issues Richard C. Dubes
3 33
1.2 Statistical Pattern Recognition
Keinosuke Fukunaga 61
1.3 Syntactic Pattern Recognition
Kou- Yuan Huang 1.4 Neural Net Computing for Pattern Recognition
105
Yoh-Han Pao 1.5 On Multiresolution Wavelet Analysis using Gaussian Markov Random
Field Models C. H. Chen and G. G. Lee
143
1.6 A Formal Parallel Model for Three-Dimensional Object Pattern Representat ion
183
P. S. P. Wang PART 2. BASIC METHODS IN COMPUTER VISION 2.1 Texture Analysis
207
Mihran Tuceryan and Anil K. Jain 2.2 Model-Based Texture Segmentation and Classification
249
R. Chellappa, R. L. Kashyap and B. S. Manjunath 2.3 Color in Computer Vision: Recent Progress
283
Glenn Healey and Quang- Tuan Luong 2.4 Projective Geometry and Computer Vision
Roger Mohr xxi
313
xxii
Contents
2.5 3-D Motion Analysis from Image Sequences using Point Correspondences
339
John. J. Weng and Thomas S. Huang 2.6 Signal-to-Symbol Mapping for Laser Rangefinders Kenong W u and Martin D. Levine
387
2.7 3-D Vision of Dynamic Objects S.-Y. Lu and Chandra Kambhamettu
425
PART 3. RECOGNITION APPLICATIONS 3.1 Pattern Recognition in Nondestructive Evaluation of Materials
455
C. H. Chen 3.2 Discriminative Training - Recent Progress in Speech Recognition
473
Shigeru Katagiri and Erik Mcdermott 3.3 Statistical and Neural Network Pattern Recognition Methods for Remote Sensing Applications
507
Jdn Atli Benediktsson 3.4 Multi-Sensory Opto-Electronic Feature Extraction Neural Associative Retriever H.-K. Liu, Y.-H. Jan, Neville I. Marzwell and Shaomin Zhou 3.5 Classification of Human Chromosomes - A Study of Correlated Behavior in Majority Vote
535
567
Louisa Lam and Ching Y. Suen 3.6 Document Analysis and Recognition by Computers Yuan Y. Tang, M. Cheriet, Jiming Liu, J. N . Said
579
and Ching Y. Suen 3.7 Pattern Recognition and Visualization of Sparsely Sampled Biomedical Signals
613
Ching-Chung Li, T. P. Wang and A. H. Vagnucci, M.D. 3.8 Pattern Recognition and Computer Vision for Geographic Data Analysis
625
F. Cavayas and Y. Baudouin 3.9 Face Recognition Technology
667
Martin Lades PART 4. INSPECTION AND ROBOTIC APPLICATIONS 4.1 Computer Vision in Food Handling and Sorting Hordur Arnarson and Magnth Asmundsson
687
Contents
4.2 Approaches to Texture-Based Classification, Segmentation and Surface Inspection Matti Pietakainen, Timo Ojala and Olli Silven 4.3 Context Related Issues in Image Understanding L. F. Pau 4.4 Position Estimation Techniques for An Autonomous Mobile Robot-A Review Raj Talluri and J. K. Aggarwal 4.5
Computer Vision in Postal Automation G. Garibotto and C. Scagliola
4.6 Vision-Based Automatic Road Vehicle Guidance Dieter Koller, Quang-Tuan Luong, Joseph Weber and Jitendra Malilc
xxiii
71 1 737
765 797 817
PART 5. ARCHITECTURE AND TECHNOLOGY 5.1 Vision Engineering: Designing Computer Vision Systems Rama Chellappa and Azriel Rosenfeld
857
5.2 Optical Pattern Recognition for Computer Vision David Casasent
869
5.3 Infra-Red Thermography: Techniques and Applications M. J . Varga and P. G. Duclcsbury
891
5.4 Viewer-Centered Representations in Object Recognition: A Computational Approach Ronen Basri 5.5 Video Content Analysis and Retrieval Hongjiang Zhang 5.6 VLSI Architectures for Moments and Their Applications to Pattern Recognit ion Heng-da Cheng, Chen-Yuan Wu and Jaguang Li
Index
925 945
979
1003
PART 1
BASIC METHODS IN PATTERN RECOGNITION
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 3-32 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 1.1I CLUSTER ANALYSIS AND RELATED ISSUES
RICHARD C. DUBES Department of Computer Science, Michigan State University, East Lansing, M I 48824-1087, U S A This chapter explains how cluster analysis organizes information in applications such as Computer Vision and Pattern Recognition. Information is represented as points in multidimensional feature spaces where each coordinate represents a measurement. Some tools from exploratory data analysis are discussed, with an emphasis on linear projections derived from the covariance matrix. Two types of clustering are reviewed - hierarchical and partitional. Hierarchical clustering leads to nested partitions of the data. SAHN algorithms for hierarchical clustering are defined and some of the common characteristics are explained. Partitional clustering arranges data in separate clusters, as with the K-Means algorithm. The chapter ends with a discussion of validation that centers on external and internal tests of validity and tests for the number of clusters. A bibliography is provided for further reading. Keywords: Proximity, exploratory data analysis, projection, hierarchies, dendrograms, K-means, cluster validity, algorithms.
1. Introduction
Organizing information is an essential part of any learning task. Cluster analysis is the formal study of methods and algorithms for objectively organizing numerical data. One finds cluster analysis in the literature of almost all disciplines, including engineering, statistics, psychology, sociology, biology, astronomy, business, medicine, archeology, psychiatry, geography, anthropology, economics, and computer science, to name a few. No single definition of “cluster” is universally accepted. Cluster analysis includes the process of “looking” at data, known as exploratory data analysis, which is a tool for igniting creativity and suggesting alternative models for the data. This chapter views cluster analysis as the initial step in organizing numerical data so as to abstract the essence of the data and describe the data as simply as possible. The discussion is informal and omits mathematical proofs. Computational issues are mentioned only briefly. Cluster analysis is sometimes called “unsupervised learning” because only actual observations affect the data organization. By contrast, pattern recognition uses a priori labels to “learn” the parameters of models for the categories, or pattern classes, present in the data. A pattern recognition algorithm seeks to define a good 3
4
R. C. Dubes
decision rule for labeling patterns of unknown origin, based on information gleaned from labeled patterns. The algorithms of cluster analysis make no decisions but fit various structures, such as partitions and hierarchies, to the data. Although the literature in several fields of application carry papers on clustering, the only journal exclusively devoted to the methodology of clustering is the Journal of CZasszfication, published by the Classification Society of North America since 1984. Some general books on the topic are [1,2,3,4,5]. One application, image segmentation, will help explain the context of this chapter. Each pixel, or each small sub-image, is characterized by a set of numbers [6]. Candidates for the numbers are co-occurrence features?gray-level intensities, measures of fractal dimension? estimates of Markov random field-parameters?and other indices popular in the computer vision community. Cluster analysis labels each pixel or sub-image so that regions from the same underlying class, such as land-use category in remote sensing, have the same label and regions from different classes have different labels. The organization is done in the feature space, in which each axis represents one of the measurements. One must develop faith in the clustering algorithm and must formally validate the results. The cluster labels are then transferred to the image for interpretation. This chapter will concentrate on the process of assigning the cluster labels, and not on the choice of features. Figure 1 is an overview of the most important aspects of a typical cluster analysis. Once data have been gathered and some type of exploratory data analysis has been applied to evaluate the data representation, one can apply clustering tendency algorithms to ensure that the data are not random. This avoids the embarrassment and futility of imposing sophisticated procedures for analyzing data that contain no clusters [4]. This chapter covers two types of clustering: hierarchical clustering creates a complete hierarchy, or nested sequence of partitions; partitional clustering creates one partition of the data. Omitted due to lack of space are treatments of fuzzy clustering [7,8], conceptual clustering [9,10], and any mention of neural nets [ l l ] . The validation step, in which one applies statistical tests to ensure that the structure recovered from the clustering algorithm is “real” in some sense, is the most difficult step of the entire process. The interpretation of the results requires experience and interaction with the expert in the field of application. The entire process, or any part of it, may need to be repeated until one is satisfied with the result. All this effort should reveal the underlying structure of the data so that sharper and more definitive studies can be planned. 2. Data
The procedures and algorithms of cluster analysis are geared towards the type and scale of the data, so this section begins by reviewing some basic definitions about data in Section 2.1 and about normalization in Section 2.2. The most important
1.1 Cluster Analysis and Related Issues 5
I
Xrtrix
Clurtrring
Birrrrchicil
Prrtitionrl Clurt-ring
Clurtrring
I A
Partition
Vilidrtion
Intrrprrtrtion
Fig. 1. Methodology of cluster analysis.
characteristic of a set of data is its dimensionality, which is briefly explained in Section 2.3. 2.1. Representing Numerical Information
2.1.1. Scale and type Data occur in several types and scales. The simplest unit of data is a number. Vectors and matrices are built from numbers, but all numbers should be on the same scale and have the same type. The scale of a number refers to its relative significance. Usually recognized are nominal, ordinal, interval, and ratio scales. A number on a nominal scale is simply a numerical tag, such as “1” for “Ford”, “2” for “Chevy”, and “3” for “Toyota”. Numbers on an ordinal scale have significance only in their relative positions. Numbers on nominal and ordinal scales are sometimes called qualitative, whereas numbers on the interval and ratio scales are called quantitative. The gap between numbers has significance when the numbers are on an interval scale. If, for example, a person were asked to state his preference for soda on a scale of “1” to “10” with “10” being most preferred, then responses 1,5,9 and 1,2,9 would have different meanings on an interval scale, but not on an ordinal scale.
6
R. C. Dubes
The most important data scale in engineering work is the ratio scale, which is the interval scale with a natural zero. Data from sensors, and numbers which can be placed on the real line are examples. For example, distance is measured on a ratio scale. Doubling the distance between two towns means using twice as much gas to get between them, whatever the unit of distance. Temperature, on the other hand, is an interval measurement because its significance depends on the unit. Measuring in degrees Kelvin is a ratio-scale measurement, while temperature in degrees Celsius is an interval-scale measurement. Data type refers t o degree of quantization. The three types recognized here are binary, or two-valued, discrete, or multi-valued, and continuous, or data taken from the mathematical real line. Binary data, also called dichotomous data, are for situations where the possible responses are (“yes”, “no”), or (“on”, “off”). A discrete type has a small number of values, where “small” depends on the situation. Since instruments have finite resolution, all data measured in the real, as opposed to the mathematical, world are discrete. Calculus ordinarily requires that data be continuous. Thus, we often assume data are continuous and ignore the unpleasant reality. 2.1.2. Patterns and proximity Whatever the scale and type, data are collected in one of two basic formats, called a pattern matrix and a proximity matrix. A pattern matrix represents each object under examination as a set of measurements. Each measurement is called a feature and a pattern is a set of feature values measured on an object. The set of d measurements form the feature space, each feature corresponding to one orthogonal axis. A pattern matrix is thus an n x d matrix, where n is the number of patterns and d is the number of features. The notation for the j t h feature of pattern i will be xij and the ith pattern itself will be denoted by the column vector xi. Letting superscript T denote matrix transpose, xi = [ZZl xi2
. . . ZZd]T .
We require n >> d and think of the patterns as a swarm of n points floating in a d-dimensional space. The transpose of the n x d pattern matrix X is written as:
A pattern matrix, called the speaker data, that consists of 40 rows (patterns) and 5 columns (features) will be used to demonstrate several procedures. A listing is given in Table 1. Each pattern represents a spoken phrase and each feature represents a measurement taken on the spectrum derived from the phrase. There are eight categories, one for each person involved in the study. Eight male speakers spoke the same phrase five times; patterns 1-5 are from speaker number 1, patterns 6-10 from speaker number 2 , and so forth. The category labels are on a nominal scale
1 . 1 Cluster Analysis and Related Issues 7 Table 1. Speaker data. Features
Category
Pattern
1
1 2
107.0000
1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5
105.5000
18.0000 15.0000
87.0000 90.0000
94.6500 96.4750
89.0000 91.0000
3
106.5000
13.0000
88.0000
94.9500
93.0000
4 5
102.5000
13.0000 17.0000
85.0000 85.0000
91.1250 93.1250
90.0000 87.0000
17.0000 13.0000
97.0000 97.0000
103.3000 103.3500
100.0000 103.OOOO
6 7
106.0000 118.0000 116.5000
8
123.0000
16.0000
100.0000
105.7250
104.0000
9 10
119.0000
15.0000 16.0000
98.0000 101.oooo
103.5000 105.1750
104.0000 104.0000
11
103.5000
83.9000
81.0000
104.OOOO 104.5000
10.0000 1 2.oooo
78.0000
12
81.0000
86.9750
81.0000
10.0000
77.0000
91.0000
11.0000 11.0000
84.5250 86.3000
13 14
122.0000
104.5000 100.0000
15 16
109.0000
16.0000
78.0000 79.0000 82.0000
17 18
113.0000 111.0000
22.0000 16.0000
85.0000 85.0000
19
113.0000
15.0000
78.0000
94.0250 92.6750 94.3250
20 21
112.0000 122.5000
20.0000 30.0000
84.0000 87.0000
93.5500 89.6500
5
22
119.0000
81.0000
87.0750
93.0000 90.0000
5
23
87.0000
92.1750
97.0000
5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8
24 25
122.5000 125.5000
28.0000 24.0000 32.0000
87.0000 83.0000
94.3000 89.7000
90.0000 96.0000
72.0000
78.9000 79.0250
74.0000
87.0000 91.6000
83.0000 84.0000 87.0000 90.0000 90.0000 100.0000 90.0000
26
119.0000 95.5000
23.0000 23.0000
27
94.0000
28 29
103.0000
15.0000 24.0000
76.0000 77.0000
82.7500
78.0000 78.0000
106.5000 102.5000
28.0000
76.0000
84.2750
78.0000
19.0000
75.0000
79.9750
84.0000
31 32
111.oooo 110.0000
25.0000 26.0000
77.0000 78.0000
86.7750 89.7500
82.0000 78.0000
33
109.0000 109.0000
22.0000 23.0000
78.0000
87.5250
84.0000
34
110.5000 107.0000
22.0000 16.0000
89.3250 88.7750
83.0000
35 36
78.0000 81.0000 79.0000
87.3250
82.0000 81.0000
37
107.0000
16.0000
77.0000
86.3500
81.0000
38
105.0000
20.0000
78.0000
80.0000
39 40
111.0000 110.0000
18.0000 15.OOOO
80.0000 85.0000
87.6500 87.5000
30
86.2000
84.0000 88.0000
8
R. C.Dubes
while the features are on a ratio scale since all have natural zeros. The features are assumed to be continuous. A proximity matrix is a square, symmetric matrix. Its rows and columns both correspond to patterns, or t o features. The (2, j) entry contains an index of proximity that denotes the degree of closeness or alikeness between the objects corresponding to row i and column j . If the proximity matrix is a dissimilarity matrix, then the larger the entry (i, j ) , the less items i and j resemble one another, as when Euclidean distance measures the proximity between two patterns. In a similarity matrix, a large value indicates a close resemblance between the two objects, as when a correlation coefficient measures the proximity between two features. 2.1.3. Indices of proximity Clustering algorithms require that an index of proximity he established between all pairs of items. Anderberg [l]defines several such measures of proximity. This chapter covers only the case when proximities are computed from a pattern matrix. An index of dissimilarity d(q, r ) between patterns xq and x, is a real-valued function satisfying the following for all q and r .
An index of similarity satisfies the first two conditions, hut replaces the third with 4%Q ) 2 my{d(q, .)I. The most common index of dissimilarity in engineering work is the Minkowski metric (2.4). (llm)
.>(jcr,
4%
=
b q j - Z,jIrn
)
(2.4)
See [12] for a full discussion of this and related indices of dissimilarity. Common parameter values are m = 2, or Euclidean distance, m = 1, or Manhattan distance, also called taxicab and city-block distance, and m + 00, or “sup” distance. The Euclidean distance must be carefully separated from the squared Euclidean distance. The Minkowski metric is for continuous data on a ratio scale and is itself on a ratio scale. The Minkowski metric satisfies the triangle inequality.
d ( q , r ) 5 d ( q , s)
+ d(s, r ) for all ( 4 , r, s)
(2-5)
A common index of similarity between features u and w is the sample correlation coefficient.
1 . 1 Cluster Analysis and Related Issues 9
The sample means, mu and m,, and the sample standard deviations, s, and s,, are defined in (2.7). The sample correlation coefficient indicates the degree of linear dependence between two features, with a value of 0 indicating linear independence. The absolute value measures the degree of resemblance between two features since negative and positive correlations with the same magnitude have the same interpretation. A number of indices of proximity for binary data, such as the simple matching coefficient and the Jaccard coefficient, have been proposed [1,4]but are not discussed here. 2.2. Normalization and Standardization
A pattern matrix is normalized t o equalize the contributions of the features to a projection or a clustering. Milligan and Cooper [13] noted that some normalizations that appeal to our intuition may affect performance in unexpected ways. In this section, a “*’, superscript will denote the raw, or un-normalized, data, as in x:~. Throughout the remainder of the chapter, the context must make clear which normalization has been applied. One common normalization is t o move the origin of the feature space t o the grand mean vector, which is the vector of sample means for the d features. The sample mean m3,and the sample standard variance 532, for feature j are:
The origin is shifted to the grand mean vector by defining, for all i and j = 1, 2, . . . , d: 2 . . - xt. - m j . 23 - a j
=
1, 2, . . . , n (2.8)
This shifting of the origin does not affect Euclidean distance and merely simplifies equations. The z-score normalization divides each feature by its standard deviation s j in (2.9). x*. - m x . . - 23 3 (2.9) 23 sj
This normalization stretches or squeezes each of the coordinate axes t o equalize the spreads along all axes. A third normalization, called a range method, reduces all features to the range [0,1] by subtracting the smallest value in each column and dividing by the range of each column. x.. ” -
xij - min{xij} max{xij} - min{xij}
The min and max are taken over the j t h column of the pattern matrix. Miligan and Cooper [13] found that the range normalization outperformed the z-score normalization when extracting the structure of the data.
10 R. C. Dubes
2.3. Dimensionality
The dimensionality of data refers to t,he number of independent parameters required t o describe the data. These parameters for each pattern are often taken to be the feature values on the d coordinate axes. However, data can sometimes be projected to fewer than d dimensions, as discussed in Sections 3.1 and 3.2. The number of dimensions in the target space is called the intrinsic dimensionality [14,15] since it represents the dimensionality suggested by the data themselves and provides a parsimonious characterization of the data. Dimensionality has taken on new importance with the emergence of fractal geometry as a cross-disciplinary field of study [16]. The true dimensionality of the data, whether fractional or integer, may become an important characteristic of the data in exploratory data analysis. The estimation of fractal dimensionality [17] is beyond the scope of this paper. 3. Exploratory Data Analysis
Cluster analysis includes a variety of heuristic procedures for getting to know data. Whereas clustering algorithms strive to be objective and quantitative, exploratory data analysis is purposefully subjective. The goal is to use whatever tools are available t o look a t the data, with emphasis on graphs, charts, projections, and any graphical representation that assists the visual system and nudges intuition. Everitt [18] describes several techniques that go beyond simple graphing of functions. Recent advances in graphical computer displays have made the techniques of exploratory data analysis widely available. Two that I have found useful are the S program [19] originally developed at AT&T and the MacSpin program for the Macintosh.” A few standard techniques are explained in this section to indicate the flavor of exploratory data analysis. 3.1. Linear Projections
Why represent d-dimensional data in two or three dimensions? One reason is to be able to see the data. A second reason is to simplify the data by eliminating redundancy and isolating the important characteristics of the data. The process of representing data in a new space is sometimes called ordination. No two- or three-dimensional representation can fully capture the intricacies and complexity of, say, ten-dimensional data. A linear projection does little “violence” to the data since relative distances are preserved and certain geometrical characteristics are maintained. Several schemes for representing data have been proposed [20], including discriminant analysis [4]. The transformation discussed here has been called the princzpal component, the Karhunen-Loeve, and, simply, the eigenvector transformation. It is based on the eigenvalues and eigenvectors of the d x d sample covariance matrix R computed aThe S software is available from Statistical Sciences, Inc., 1700 Westlake, Seattle, WA 98109, USA. The MacSpin program is marketed by Abacus Concepts.
I . 1 Cluster Analysis and Related Issues
11
from the given n x d pattern matrix, X.
Equation (3.1) assumes that either (2.8) or (2.9) has been applied to the raw data. If (2.8) has been applied, R is a covariance matrix whose diagonal entries are the variances of the columns in X. If (2.9) has been applied, R is a correlation matrix meaning that all diagonal entries are unity and off-diagonal entries are correlation coefficients between -1 and 1. Most linear algebra texts show that the eigenvalues of R are solutions for the scalar X to the determinant equation (3.2).
Here, I is a unit matrix of order d x d and 0 is a vector of zeros. If R has full rank of d , and we denote its (necessarily real and non-negative) eigenvalues by 2 2 . . . 2 A d , then the eigenvectors c1, cz, . . . , Cd are orthonormal (column) vectors satisfying, for each j from 1 to d:
(R - X j ) c j
=0
C T C =~ 0 i f j # IC =
1 if j = k .
(3.3) (3.4) (3.5)
Eigenvectors are not unique. The eigenvector transformation from d to m 5 d dimensions is defined by a coefficient matrix whose rows are eigenvectors corresponding to the m largest eigenvalues of R. If xi is a pattern in the original d-dimensional space, its image yz is the m-vector defined in (3.6).
Yi =
If the rank of R is full and m = d, then (3.6) rotates the coordinate axes and decorrelates the features. That is, the covariance matrix computed from the new vectors { y i } is a diagonal matrix whose diagonal entries are the eigenvalues XI, X2, . . . , A d . Thus, the eigenvalues of R can be interpreted as sample variances in the rotated space. The d-dimensional patterns can be projected to two or three dimensions for viewing by setting m to 2 or 3 in (3.6). Why the top rows? A reasonable criterion for projecting data is square-error, meaning that the target space saves as much of
12
R. C. Dubes
the variance as possible. The total variance is the sum of the diagonal elements of R . d
d
j=1
j=1
Since the eigenvalues are ordered by size, it makes sense to save eigenvectors corresponding to the largest eigenvalues. Tou and Heydorn [21] phrase this problem nicely and prove these facts. Figure 2 shows the eigenvector transformation for the speaker data both with pattern numbers and with category labels. Some clustering of the patterns is suggested, especially in Fig. 2(b).
u 7
2
' >
r)
0
4
2
4
:
e8
rq
I
P29
*'
25
.
*
' 6
1
17
s 6
7'
23
u
6
6 u
I
7 -
n
5 21
5
s
s24
5 160
110
110
190
?m
210
fir.,
Fig. 2. Eigenvector projections of speaker data; (a) pattern labels; (b) category labels.
3.2. Nonlinear Projections Nonlinear projections share the goals of linear projections but the details differ. The projection introduced by Sammon [22] illustrates these differences. Sammon's projection begins with a n n x n dissimilarity matrix whose entries are Euclidean distances between all pairs of patterns in the &dimensional feature space. A set of n points is scattered randomly in a portion of the two-dimensional target space, with each point representing one pattern. Given such a configuration of n points, the following stress criterion can be computed.
1.1 Cluster Analysis and Related Issues 13 Here, d(q, r ) is the Euclidean distance between patterns q and r in the ddimensional feature space and D(q, r ) is the Euclidean distance between the points representing patterns q and T in the two-dimensional target space. The sums are over all pairs of patterns 1 5 q 5 r 5 n. Stress is a function of the 2n coordinates of the points in the target space. The idea is to move the points around so as to minimize E . The configuration a t which E is minimum is taken t o be the Sammon projection. A configuration for which D(q, r ) = d(q, T ) for all q and r would be ideal, but it is seldom possible to match all d-dimensional distances in two dimensions. The stress function resembles the criterion function in multidimensional scaling [23]. However, multidimensional scaling begins with ordinal data and Sammon’s method begins with ratio data. Since E uses the difference between two distances, one must have ratio-scale dissimilarities between all pairs of patterns. The denominators normalize the stress. The term d ( r , q ) in the sum on the right weights small distances more heavily than large ones and tends to preserve local structure. The multiplier on the left, which is fixed throughout the minimization procedure, tends to make E insensitive to changes in scale and sample size. Several algorithms exist for minimizing functions of many variables. Sammon [22] proposed a gradient descent algorithm. Simulated annealing has also been tried [24] but the results were disappointing. Whatever the minimization technique, one encounters the usual problems of stopping at local, rather than global, minima and of dependence on the starting configuration. Any minimization should be run several times and the one achieving the smallest stress should be retained. Our implementation requires that the user supply the maximum number of iterations allowed in the gradient descent algorithm and a “magic factor”, which influences the internal stopping criterion. An advantage of a nonlinear projection over a linear projection is its greater flexibility and ability to “see” complex structures. A disadvantage is that a nonlinear method can distort the data unduly and paint a misleading picture. In addition, extra points cannot be easily located in the target space of a nonlinear projection whereas the entire feature space can be projected to two dimensions with (3.6). Figure 3 pictures the speaker data with Sammon’s projection. The gradient descent algorithm was run ten times and the best of the ten runs is exhibited. The stress values for the ten runs ranged from 0.02512 to 0.0009157. The algorithm always stopped because it had achieved a minimum, not because it had reached the maximum number of iterations. Figures 2 and 3 agree on the structure of the data and provide evidence for the conclusion that the clustering by category is real, and not an artifact of an algorithm. One can try to name the variables on the Sammon plot in Fig. 3 from characteristics of the pattern, but no conclusions about correlation between these new variables can be drawn. The last step in the Sammon projection is to apply the eigenvector rotation to decorrelate the variables.
14
R. C. Dubes 8 10
8 -
2 2 8
1 7
e
2
2
a-
0
Ji
-
n
24 z1
n
26
11
2
11
20
in
7
I
R4
0 -
*
o
1
3
0
3
' 3
6 6
1
B
3 6
6 6
Fig. 3. Sammon projection of speaker data; (a) pattern labels; (b) category labels.
3.3. Graphical Procedures Everyone has their favorite graphical representation of data. A few possibilities are mentioned in this section. Both the columns (features) and the rows (patterns) of the pattern matrix can be depicted by graphical methods. Box plots and histograms provide quick looks at the distribution of a feature. Figure 4(a) shows box plots for the first three features of the speaker data. Each box identifies the 25th and 75th percentiles of the data. The horizontal line inside the box is at the median value. The dashed vertical lines are drawn from the minimum value to the 25th percentile and from the 75th percentile to the maximum. A star is drawn to signal that an outlier has occurred according to a particular criterion. Figure 4b gives histograms for the same three features. The number of bins is about filor 7 in this case. The ordinate is the number of feature values in each bin. Box plots are best for quick comparisons among features. Differences in ranges and locations are obvious a t a glance. The fact that feature 3 has a larger right tail than left is also obvious. Histograms provide more detailed information and exhibit the modes in multi-modal data. Graphical representations of the patterns can also be useful. Figure 5 exhibits Chernoff faces [25] for the first 20 patterns from the speaker data. Each face depicts one row in the pattern matrix by associating one characteristic of the face with each feature. Feature 1 is proportional to the area of the face, feature 2, to the shape
1.1 Cluster Analysis and Related Issues 15
-0
T
I
88 -
8 -
0
.
T
0
-
_-(4
-.
1 :,
(b)
Fig. 4. Graphical views of the first three features of the speaker data. (a) Box plots; (b) histograms.
Fig. 5. Chernoff faces of the speaker data.
of the face, feature 3, to the length of the nose, feature 4, to the location of the mouth, and feature 5 is related to the curve of the smile. The faces are arranged in Fig. 5 so that each column contains faces from a different category. Associations
16 R. C. Dubes
among the patterns can be recognized by mentally clustering the faces. Outliers can sometimes be quickly identified. For example, pattern 19 in Fig. 15 might be considered different from other faces. All features were normalized t o fall between 0 and 1 before the faces were drawn. 4. Hierarchical Clustering Clustering data means grouping either patterns or features such that items in the same group are more alike than are items in different groups. The objective is to abstract the essence of the data by isolating groups, or clusters, which explain the data. A hierarchical clustering is a sequence of nested groupings. In some biological applications, the hierarchical structure itself is fitted to the data. In many engineering problems, one searches the hierarchy for a single significant grouping. This section explains the basic mathematical structure for describing hierarchical clustering and presents some standard clustering algorithms. For more complete treatments, see [1,4,2,26]and Chapter 5 of [27].
4.1. Hierarchies and Dendrograms
A hierarchical clustering method is a mathematical procedure for creating a sequence of nested partitions from a proximity matrix, assumed to be a dissimilarity matrix for patterns. Let X denote the set of n patterns. A partition C = {Cl, C2, . . . , Cm} of X is a set of disjoint, non-empty subsets of X which, taken together, constitutes X. That is, if i # j ,
c,n cj = a ; c1u c2u . . . u cm= x.
(4.1)
where lLn” denotes set intersection, “U” denotes set union and W7’denotes the empty set. Each set Ci is called a component of the partition. A clustering is a partition of X ;its components are formally called clusters. A hierarchical clustering is a sequence of nested partitions starting with the trivial clustering in which each pattern is in a unique cluster and ending with the trivial clustering in which all patterns are in the same cluster. Partition B is nested into partition C if every component of B is a subset of a component of C, so C is formed by merging components of B. A dendrogram is a binary tree that depicts a hierarchical clustering. Each node in the tree represents a cluster. Cutting the dendrogram horizontally creates a clustering. A dendrogram represents n clusterings of n patterns, including the two trivial clusterings. Figure 6 is a simple dendrogram for five patterns. The five pattern labels can be permuted in several ways without altering the information in the dendrogram. Murtagh [28]enumerated several types of dendrograms. Dendrograms can simply picture the clusterings, as in Fig. 6, or can have a scale showing the level of dissimilarity at which each clustering is formed.
1 . 1 Cluster Analysis and Related Issues 17
Fig. 6 . Dendrogram for five patterns.
4.2. Recovered Structure and Ultrametricity
A hierarchical clustering method tries to fit a dendrogram to the given dissimilarity matrix. The dendrogram imposes a measure of dissimilarity called the cophenetic dissimilarity on the patterns as follows. Let {CO,C1, . . . , Cn-l} be the sequence of partitions in a dendrogram where Co is the trivial clustering that puts each pattern in its own cluster and C,-1 places all patterns in the same cluster. The clusters in the mth clustering are denoted {Cml,Cmz, ... , Cm(n-m)}.A level function L ( m ) is defined on the partitions as the dissimilarity level a t which clustering m first forms. Specific level functions are defined by each clustering method. The cophenetic dissimilarity d c between patterns x q and x, is defined in (4.2).
where k q , , = min[m : ( x q ,x,) E Cmt, some t].
The cophenetic dissimilarity is a dissimilarity index in the sense of (2.1). However, it also satisfies the ultrametric inequality.
The nesting of partitions to form the hierarchy and the monotonicity of the level function ensure that (4.3) is satisfied. This inequality is stricter than the triangle inequality (2.5). For the distances between patterns in a feature space t o satisfy the ultrametric inequality, for example, all triples of patterns must form isosceles triangles with the one side being shorter than the two sides of equal length. This demands that many ties in proximity occur at just the right places.
4.3. Hierarchical Clustering Algorithms A hierarchical clustering algorithm is a process for constructing a sequence of partitions from a proximity matrix. A particular clustering method can be implemented
18 R. C. Dubes
by several algorithms. Some hierarchical clustering algorithms for constructing dendrograms by hand are based on graph theory [4]. This section is limited to a class of algorithms commonly known as “SAHN” (Sequential, Agglomerative, Hierarchical, Nonoverlapping) algorithms. A SAHN algorithm begins with a n n x n dissimilarity matrix [d(q, r ) ] between patterns. All SAHN algorithms are appropriate when the dissimilarities are on a ratio scale, as with distances in a feature space. In addition to a sequence of clusterings, a SAHN algorithm creates a level function. The algorithm begins with L(0) = 0 and each pattern in a unique cluster. A cluster is denoted (s).
SAHN Algorithm for Hierarchical Clustering 1. Set the sequence number of the clustering: m = 0. Repeat the following steps until m = n. 2. Find the pair of clusters [(s), (t)] for which
4 ( s ) , (t)l = min{d[(q),
(.)I)
(4.4)
where the minimum is taken over all pairs of clusters. 3. Increment m by 1. Merge clusters (s) and ( t ) into a single cluster to define clustering m. Define the level of this clustering as:
L ( m ) = d [ ( s ) , (t)I.
(4.5)
4. Update the dissimilarity matrix by deleting the rows and columns corresponding to clusters ( s ) and ( t ) and adding a row and column for the newly formed cluster (s, t ) . The dissimilarity between this new cluster and an existing cluster, cluster ( k ) ,depends on the clustering method being employed and is given by (4.6).
4.4. Characteristics of Hierarchies and Algorithms
Day [29] explains the computational complexities of hierarchical and partitional clustering algorithms. Some comments on practical aspects of hierarchical clustering are given below without justification. The primary application of hierarchical clustering in engineering work is to create a representative sequence of clusterings while searching for a single “good” clustering. It is seldom the case that data are hierarchically related on more than a few levels.
1.1 Cluster Analysis and Related Issues 19 Clustering method Single-link Complete-link UPGMA WPGMA
fft
P
Y
112
112
0
-112
112
112
0
112
ns ns f nt
ns ns nt
+
0
0
112
112
0
0
1J2
-114
0
ffS
UPGMC WPGMC Ward’s Flexible
112
ns
+ nk
ns f n t ff
+nk
nt +nk ns +nt ff
+n k
-nk
ns f n t
+nk
1 - 2ff
0 0
Fig. 7 . Coefficients for SAHN hierarchical clustering algorithms.
The single-link method is also called the connectedness method, the minimum method, and the nearest neighbor method. The updating equation uses the minimum of the two dissimilarities. Single-link clusters can be “straggly” since the smallest of the pairwise dissimilarities between two clusters determines when the clusters join. Other algorithms for the single-link method involve the minimum spanning tree [30] and graph theory [31]. By contrast, the complete-link method is also called the completeness method and the maximum method. The updating equation translates into taking the maximum of the dissimilarities between pairs of patterns in different clusters. Complete-link clusters often form in small clumps because the largest of the pairwise distances determines when two clusters are merged. Figure 8 shows the complete-link dendrogram for the speaker data. The visual impact of the dendrograms is obvious. It should also be clear why dendrograms are not very useful for more than 200 patterns. The scale on the left is on the same scale as the dissimilarity matrix. The patterns were normalized by (2.9). Some of the clusters in Fig. 8 can be identified in Figs. 2 and 3. With the single-link and complete-link methods, hierarchies merge only at dissimilarities that occur in the given dissimilarity matrix. Other clustering methods do not have this property. Ward’s method 1321 is also called the minimum square error method because it merges, at each level, the two clusters which will minimize the square error from among all mergers of pairs of existing clusters. Ward’s method does not create the clustering which minimizes square error among all clusterings with that number of clusters, however. Two characteristics of Ward’s method should be noted. First, it minimizes square error as described above only when the dissimilarity index is the squared Euclidean distance. Second, the scale on the dendogram is not the same as the dissimilarity scale between patterns. Hierarchies can merge at levels several hundred times the largest dissimilarity between patterns. The visual effect on dendrograms is to accentuate the “lifetime” of individual clusters.
20
R. C. Dubes
Fig. 8. Complete-link dendrogram for speaker data.
The acronym “PGM’ refers to “pair group method” since clusters are merged in pairs. The prefixes “U” and “W’ refer t o “unweighted” and “weighted” methods, repectively. A weighted method treats all clusters the same in (4.6), so patterns in small clusters achieve more individual importance than do patterns in large clusters. An unweighted method takes the size of the clusters into account, so the patterns are treated equally. The suffixes “A” and “C” refer to “arithmetic averages” and “centroids”, respectively. Thus, the full name for the UPGMA method is “unweighted pair group method using arithmetic averages” , sometimes called the group average method. The UPGMC and WPGMC methods have direct geometric interpretations with patterns in a feature space [27]. The UPGMA and WPGMA methods have no simple geometric interpretation. An implicit assumption underlying this entire discussion is that no two entries in any of the matrices encountered in matrix updating are the same. Ties in dissimilarity can produce unexpected and baffling dendrograms for all methods except the single-link method. Jardine and Sibson [33] show that the single-link method is the only method having a continuity property, whereby dendrograms merge smoothly as dissimilarities approach one another. Unfortunately, the single-link method has performed much worse that other methods in extracting the true structure of the data. If the dissimilarity matrix contains ties, either break the ties or use the single-link method. A clustering method is monotone if, when merging clusters (s) and ( t ) into cluster (s, t ) , then for all clusters (Ic) distinct from ( s ) and ( t ) ,
1.1 Cluster Analysis and Related Issues 21 Monotonicity permits dendrograms to be drawn as binary trees, as in Figures 6 and 8. The cophenetic dissimilarity is satisfied only for monotone clustering methods. Without monotonicity, crossovers or reversals can occur in which two clusters merge at a level lower than the level a t which one of the two clusters was formed. This counter-intuitive phenomenon has nothing to do with the data, but is a characteristic of the clustering method. Interpreting a non-monotone dendrogram is extremely difficult. A simple condition for monotonicity, proved by Milligan [34], is that the coefficients in (4.6) satisfy (4.7).
Which clustering method is best? No clear answers exits, even though several comparative studies have been conducted [35]. If the input dissimilarity matrix satisfies the ultrametric inequality then the single-link and complete-link dendrograms will be exacty the same. This is usually taken to mean that the data are organized in a perfect hierarchy. The degree to which the two dendrograms resemble one another is an indication of how appropriate a hierarchical structure is for the data. Defining a quantitative measure of similarity between dendrograms is a very difficult task. 5. Partitional Clustering
A dendrogram displays n clusterings, or partitions, of n patterns, but how do we find a single good clustering of the data? A partitional clustering method handles a large number of patterns and applies an objective criterion in an attempt to achieve the “best” clustering. A solution to the clustering problem is easy to state. Select a criterion and evaluate it for all clusterings, saving the optimal result. This solution is impractical for two main reasons. First, the number of possible clusterings is astronomical [4]. For example, there are about 11,259,666,000 clusterings of 19 objects into 4 clusters. Evaluating a criterion for all clusterings is out of the question. Even if one could enumerate all clusterings, the choice of a single criterion function raises severe difficulties. Clusters of patterns in d dimensions can have a variety of shapes, from spherical to line-like [36]. No single criterion can search for shapes simultaneously. Before choosing a clustering criterion, one must determine what is meant by “cluster” in the application at hand. One general definition is that a cluster is a set of patterns whose inter-pattern distances are smaller than interpattern distances for patterns not in the same cluster. This idea of “cluster” leads to accepting the square-error criterion stated below. One can also reasonably decide that a cluster is a region of high density in the feature space, surrounded by regions of low density. Square-error is not the only criterion. However, square-error is relatively easy to compute, makes good intuitive sense, and the clusters can be interpreted as hyperspheres in the feature space.
22
R. C.Dubes
Given n patterns {XI,x2, . . . , x,} in the d-dimensional feature space, define an indicator function as follows. Let z i k = 1 if pattern xi is a member of the kth cluster and 0 if not. Every pattern must belong to one cluster and the clusters are labeled sequentially. The center of the kth cluster is the centroid of the patterns belonging to the cluster.
Here, n k = C:==, &k is the number of patterns in cluster k . The square-error e$ for cluster k is the sum of the squared Euclidean distances between patterns in cluster k and mk. n
The square-error for the entire clustering is the sum of the square-errors for the individual clusters. K
(5.3) k=l
The clustering problem can be stated as the problem of minimizing E L , for K fixed, by selecting the binary weights { z i k } in such a way that each row of the n x K matrix of weights has exactly one “1” in each row and a t least one “1” in each column. That is, each pattern can belong to only one cluster and no cluster can be empty. In fuzzy clustering [7] zik is taken to be the degree of belonging for pattern i in cluster k and z i k in (5.2) is replaced by z:k where q is chosen empirically, and is usually 2. One can approach this optimization in many ways, including simulated annealing [24]. Gordon and Henderson [37] translate the constrained optimization problem into an unconstrained problem to simplify the solution. The computational demands of formal minimization techniques have encouraged the development of simple, heuristic algorithms [1,4]. The most popular of these is the K - m e a n s method.
K-Means Algorithm for Partitional Clustering 1. Select an initial clustering of the n patterns with K clusters and initialize the cluster centers. Repeat until the clustering stablizes: 2. Assign cluster labels to all patterns by finding the closest cluster centers. 3 . Compute cluster centers for all clusters with (5.1). Repeat Steps 2 and 3 until no cluster labels change. 4. Apply heuristic splitting and lumping criteria. End the repeat. 5 . Compute statistics of the final clustering.
1.1 Cluster Analysis and Related Issues 23 The clustering with the smallest square-error is retained as the solution. Although the convergence of the algorithm can be proven [38] no assurances that the solution is optimal can ever be given. It is easy to propose data that will defeat the algorithm [4]. One hopes that running the algorithm several times for different starting configurations leads to a reasonable solution. The problem of choosing the number of clusters and of formally validating the results is discussed in Section 6. Steps 2 and 3 constitute a K - m e a n s pass. An alternative is to recompute the cluster centers after each cluster label is changed. Step 4 changes the number of clusters by merging small clusters, removing outliers, or splitting large clusters. Criteria for these operations are chosen heuristically. The initial partition can be chosen in several ways [l].For example, one can choose k patterns at random as seed points. One can also choose k patterns that are reasonably separated from one another as seed points. In any event, the algorithm should be run with different seed points to seek the best clustering. Any program for implementing a partitional clustering method involves several parameters. The primary ones are K and any parameters associated with splitting clusters, lumping clusters, and identifying outliers. An exception is the CLUSTER program [4] which creates a sequence of clusterings and uses no parameters. Only experience can dictate selection of these parameters. The manner in which the data are normalized must also be considered. Normalization (2.9) can reduce the effects of spreads in the individual variables and equalize the contributions of all variables to Euclidean distance. The results of any partitional clustering algorithm is a set of tables of numbers showing the cluster centers, the square-errors for the individual clusters, the covariance matrices for the clusters, and various statistics which try to quantify the characteristics of the clustering [4]. Clustering algorithms are often run to see if patterns are clustered according to some a priori category information and a clusterby-category table is displayed to see if the clusters correspond to categories in any way. The cluster centers can be taken as a sampling of the original data and one can represent the data by projecting the cluster centers t o two dimensions. Figure 9 shows the cluster-by-category table from the CLUSTER program with the speaker data, after normalizing by (2.9). Also shown are the square-errors for all clusters. Comparing the table entries to Fig. 2, 3, and 8 may justify some of the entries in the table. For example, categories 1, 2, 3, and 5 are in unique clusters. However, cluster 3 has a much larger square-error than clusters 1, 2, and 5. This suggests that clusters 1, 2, and 5 have smaller dispersions than cluster 3 . 6. Validation and Interpretation
Validation refers to the objective assessment of a clustering structure so as to determine whether a structure is meaningful. The structures under consideration are hierarchies, clusterings, and individual clusters. The sense of validity explained
24
R. C. Dubes
here calls a structure valid if it is unusual under the circumstances of the study. That is, a structure is valid if it cannot reasonably be assumed to have occurred by chance or to be an artifact of a clustering algorithm. Validation is accomplished by carefully applying statistical methods and testing hypotheses.
Cluster 1 2
1 0 0
2
3
4
5
0
0
0
0
5
0
0
3 4
0
0
0
0
0 5 0 0 0
0
5
0 0
5
6 7 8
7
8
0
0
0
6 3 0
0
0
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
5
1
0
0
0
0
0
0
0
0
4 1
0
0
0
4
e: 274.1 139.5 441.4 177.4 192.4 438.3 296.9 221.5
6.1. A n Attitude Towards Validation
One might argue that a structure is “valid” if it makes sense or is useful or can be interpreted or provides new insight. The list of ways t o justify a result is endless. This section considers only objective measures of validity that can be tested statistically. Formal testing may not be required in every application. One might use experience and judgement t o interpret a clustering structure or use clustering in an exploratory manner. In addition to the three types of structure (hierarchy, clustering, cluster), three types of validation studies called external, internal, and relative can be defined [4]. An index must be chosen to reflect the sense of validity being examined, with the structure and type of study in mind. An external assessment of validity objectively compares the recovered structure t o an a praora structure and tries t o quantify the match between the two. For example, one might test how closely cluster labels match category labels. An internal examination of validity uses no a priori information but tries t o determine if the structure is intrinsically appropriate for the data. For example, one might try to determine if a cluster derived from the single-link method is unusually compact or isolated, as compared t o other single-link clusters of the same size in random data. A relative test compares two structures and measures their relative merit. For example, one might compare a 4-cluster clustering to a 5-cluster clustering without using any a prior2 information. Several indices have been proposed for this purpose and some are explained in Section 6.4. The
1.1 Cluster Analysis and Related Issues 25 paradigm for testing the validity of a clustering structure is summarized below.
Validity Paradigm 1. Identify the clustering structure (hierarchy, clustering, cluster) and the type of validation (external, internal, relative). 2. Select an index. 3. Select an hypothesis of “no structure”. 4. Obtain (by theory or simulation) the baseline distribution of the index under the “no structure’’ hypothesis. 5. Compute the index for the structure being tested. 6. Formally test the hypothesis of “no structure’’ by determining whether the observed index is LLunusual”.
Establishing the baseline distribution might require extensive statistical sampling such as Monte Carlo estimation or bootstrapping. Although the same index can apply to all three types of tests, the circumstances of the application are very different. External tests of validity are generally much easier to apply than internal tests because hypotheses of randomness are easier to propose and baseline distributions are easier to derive than for internal tests. One example of a n external test is the Bailey profile [39,4] which measures the validity of an individual cluster that is defined before the analysis begins. The natural indices of validity for an individual cluster are simple measures of compactness and isolation derived from graph theory. Bailey profiles are restricted to ordinal data and use the hypergeometric distribution to derive baseline distributions. Baseline distributions of compactness and isolation indices cannot be derived when the cluster is obtained by applying a clustering method because the cluster depends on the method itself, among other things. Even though data are purely random, a diligent clustering method might uncover an unusual cluster. An internal test of validity should recognize such clusters as being artifacts. Three validation problems of practical interest are discussed in this section. 6.2. External Tests of Validity f o r a Partition
Suppose two partitions of n patterns are to be compared. One is from category information, obtained before the analysis is begun. The other is obtained from some clustering method. How well does the clustering match the categories? The labels themselves are not important, so renaming categories or clusters cannot affect the degree of match. Hubert and Arabie [40] studied the Rand index as a means for assessing the match between two such clusterings. Let a denote the number of pairs of objects that are in the same cluster in both clusterings and let d denote the number of pairs in different groups. There are
26
R. C. Dubes
n(n - 1)/2 pairs of objects to check. The Rand index [41] measures the degree of match.
Other statistics have been suggested for this purpose but they are linear functions of one another [4]. A clustering is termed “valid” if R is unusually high, as measured with respect to some baseline distribution. To make R less sensitive to problem parameters. Hubert and Arabie [40] corrected it for chance. The hypergeometric distribution was applied to find E ( R ) ,the expected value of R under the baseline distribution. The maximum possible value of R is 1, so the corrected Rand index is :
R’
=
R - &(R) 1 - & ( R ).
Detailed formulas are given elsewhere [40,4]. Using the Rand index, corrected or not, in a test for external validity requries that a baseline distribution be developed. The variance of R‘ is known when one of the partitions is assumed to be assigned a t random. However, the full baseline distribution of R’ is required t o formally test this hypothesis of randomness. 6 . 3 . Internal Tests of Validity for a Partition
The paradigm for external tests of validity is the same as for internal tests. The hypothesis of randomness and the baseline distributions differ. This has led t o some confusion in the literature. For example, the distribution of certain Fstatistics and x2 statistics are listed in standard books on multivariable statistics. These distributions assume that the groups have been chosen without reference to the data, as when one assigns a priori category labels. These distributions are not applicable to the internal validation of clusterings found by sifting through the data. Using them can create misleading results [35]. Milligan [42] compared the performances of 30 internal indices of validity. Three primary difficulties arise in obtaining the baseline distribution needed for an internal test of validity. The first difficulty lies in choosing a hypothesis of “no structure” or “randomness”. To create purely random data, one must choose the region of space in which the random data are to be generated. To be fair, this region must match the characteristics of the data. The second difficulty is the necessity to match all the data parameters. This implies that one must estimate the baseline distribution anew in every application, usually by Monte Carlo sampling. Bock [43] has derived asymptotic distributions for some indices, but it is seldom clear when the assumptions of the derivation are satisfied. The third difficulty is a bit subtle. It is not fair to compare a clustering obtained from a clustering algorithm t o just any clustering of random data. One should compare it to the best clustering of random data. That is, before calling a result valid, one should be reasonably certain that the same result could not be obtained
1.1 Cluster Analysis and Related Issues 27 from any random data, not simply that the result could not be obtained from some random data. Engelman and Hartigan [44] demonstrated this point in estimating the distribution of the ratio of between t o within scatter for one-dimensional data. Given a set of random data (from a normal distribution), they found the ratio for the best separation of the data into two clusters. They published the percentage points for the distribution of the best value of the ratio. However, the result is only applicable t o one-dimensional data. The methodology cannot be extended to more than one dimension easily because the number of clusterings to be examined increases exponentially with dimension. These difficulties may explain why more internal validation is not performed. 6.4. Relative Tests of Validity -How
M a n y Clusters?
The problem of determining the “true” number of clusters has been called the fundamental problem of cluster validity [45,46]. This question is particularly important in image segmentation, where the number of categories, such as land-use categories, is not know in advance. The question “HOWmany clusters are there in the data?” can be asked in at least three meaningful ways. 0
0
0
Do the data contain the number of clusters I expect? This clearly calls for an external test of validation and one might use the procedures of Section 6.2. Is it unusual to find this many clusters with data of this sort? This somewhat vague statement can best be answered with an internal criterion, as in Section 6.3. The basis for comparison must be defined and a baseline distribution must be derived- two difficult tasks. Which of a few clusterings is best? This is more difficult to answer than the first question, but is more specific than the second. The question implies that several clusterings, such as those derived by cutting a dendrogram at several levels, or running a partitional clustering algorithm several times, be considered as candidates. The number of clusters in the best of these clustering is taken to be the “correct” number.
This section considers the third question by examining a sequence of clusters as the number of clusters changes monotonically. For example, one might seek a stopping rule for choosing the best level for cutting a dendrogram. What is a good index? One possibility is to pick the clustering that minimizes square-error (5.3). Square-error is a strong function of the dimensionality, the number of patterns, and the number of clusters [4]. As the number of clusters increases, the square-error has a tendency to decrease whether or not one clustering is better than another. Milligan and Cooper [47]compared 30 indices as stopping rules by applying each to a wide variety of data and ranking them according to the one which found the correct answer most frequently. The correct answer was known because they generated their own data. Any index that performed well should, logically, be trusted with real data. Dubes [48] made more detailed comparisons of two other indices. Zhang
28
R. C.Dubes
and Modestino [49] integrate formal estimation of the number of clusters into image segmentat ion. Three representative indices are defined below; n is the number of patterns, K is the number of clusters in the clustering being evaluated, and E$ is the square-error of the clustering (5.3). To estimate the number of clusters, plot the chosen index as K varies and look for either a peak, a valley, or a knee in the curve, depending on the index. Two underlying assumptions are that the data are not random and at least two clusters exist. The Calinski-Harabasz index, CH(lc) was the best of the 30 indices tested by Milligan and Cooper [13].
The index will always be positive and will be zero for K = 0. Its upper bound depends on problem parameters. The value of K that maximizes C H ( K )is chosen to estimate K . This index normalizes the square-error and tends t o depend less on problem parameters than does the square-error itself. The C index is a normalized form of the r statistic [50] proposed to measure the correlation between spatial observations and time. Let c ( q , r ) be 1 if patterns xq and x, are in the same cluster and 0 if not. Let d(q, r ) denote the dissimilarity, or Euclidean distance, between the two patterns. The “raw” r statistic is: n-1
n
g=l
r=q+l
The dissimilarities need not be distances. Let a K be the number of pairs of patterns in which both patterns are in the same cluster. Define the following two statistics as the smallest and largest possible values of I? for a clustering of the kind being examined. min(r) = sum of
UK
smallest dissimilarities
m a x ( r ) = sum of
UK
largest dissimilarities
The C index is defined as:
C ( K )=
r - min(r) max(r) - min(r) .
The range of the C index is limited to [0,1]. The value of K that minimizes C(K) estimates the number of clusters. 0 The Goodman-Kruskal y statistic [51,4] measures the rank correlation between Euclidean distances d ( q , r ) and function f ( q , r ) for the clustering being evaluated, where f ( q , r ) = 1 - c(q, r ) is 1 if xq and x, are in different clusters and 0,
1 . 1 Cluster Analysis and Related Issues 29 if in the same cluster. In standard notation, S(+) denotes the number of concordant quartets and S ( - ) , the number of discordant quartets. This requires some explanation. A “quartet” is two pairs of numbers. One is a pair of dissimilarities, say [d(q, r ) , d(s, t ) ] ,and the other is the corresponding pair of indicator values, [ f ( q , r ) , f(s, t ) ] . A quartet is concordant either if d ( q , r ) < d(s, t ) and f ( q , T ) < f ( s , t ) or if d(q, r ) > d(s, t ) and f ( q , r ) > f(s, t ) . A quartet is discordant either if d(q, r ) < d(s, t ) and f ( q , r ) > f(s, t ) or if d(q, r ) > d(s, t ) and f ( q , r ) < f(s, t ) . If either pair is tied, the quartet is neither concordant nor discordant. Then,
This index is limited t o the range [-1,1]. The value of K then maximizes r ( K ) estimates of the number of clusters. Studies [13] have shown that some indices proposed in the literature perform very poorly, while some perform very well. It is impossible to claim optimality for any of them, because the characteristics of the data can affect performance in unknown ways.
7. Final Comments Cluster analysis is a valuable tool for organizing, summarizing, and exploring multivariate data. Among the problems that appear in applications are the choice of a clustering criterion that recognizes only clusters of interest, and the choice of a clustering algorithm. Validating the results objectively is the most difficult problem of all. Notwithstanding these real problems, cluster analysis has proved enlightening, especially when invoked by a careful practioner who is aware of inherent limitations and has the proper computer tools.
Acknowledgements
I acknowledge the support of the National Science Foundation, most recently through grant IRI-8901513 and grant CDA-8806599. References [I] M. R. Anderberg, Cluster Analysis for Applications (Academic Press, New York, NY, 1973). [2] A. D. Gordon, Classification (Chapman and Hall, London, 1981). [3] J. A. Hartigan, Clustering Algorithms (John Wiley & Sons, New York, NY, 1975). [4] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data (Prentice-Hall, Engelwood Cliffs, NJ, 1988). [5] L. Legendre and P. Legendre, Numerical Ecology (Elsevier Scientific, Amsterdam, 1983).
30
R . C. Dubes
[6] 3. M. Jolion, P. Meer and S. Bataouche, Robust clustering with applications in computer vision, ZEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 791-802. [7] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum Press, New York, NY, 1981). [8] X. L. Xie and G. Beni, A validity measure for fuzzy clustering, ZEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 841-847. [9] R. S. Michalski and R. E. Stepp, Automated construction of classifications: Conceptual clustering versus numerical taxonomy, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 396-410. [lo] G. Matthews and J. Hearne, Clustering without a metric, ZEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 175-184. [ll]Institute of Electrical and Electronics Engineers, Special Issue of Proceedings o n Neural Networks, I: Theory and Modeling, Sept. 1990. [12] J. C. Gower and P. Legendre, Metric and Euclidean properties of dissimilarity coefficients, J . Classification 3 (1986) 5-48. [13] G. W. Milligan and M. C. Cooper, A study of standardization of variables in cluster analysis, J. Classification 5 (1988) 181-204. [14] K. Pettis, T. Bailey, A. K. Jain and R. Dubes, An intrinsic dimensionality estimator from near-neighbour information, ZEEE Trans. Pattern Anal. Much. Intell. 1 (1979) 25-37. [15] N. Wyse, R. Dubes and A. K. Jain, A critical evaluation of intrinsic dimensionality algorithms, in E. S. Gelsema and L. N. Kana1 (eds.), Pattern Recognition in Practzce (North-Holland Amsterdam, 1980) p. 415-425. [16] K. Falconer, Fractal Geometry (John Wiley & Sons, New York, NY, 1990). [17] J. Theiler, Estimating fractal dimension, J. Opt. SOC.Am. A 7 (1990) 1055-1073. [18] B. S. Everitt, Graphical Techniques for Multivariate Data (Elsevier North-Holland, New York, NY, 1978). [19] R. A. Becker, J. M. Chambers and A. R. Wilke, The New S Language (Wadsworth & Brooks/Cole, Pacific Grove, CA, 1988). [20] T. Okada and S. Tomita, An optimal orthonormal system for discriminant analysis, Pattern Recogn. 18 (1985) 139-144. [21] J. T. Tou and R. P. Heydorn, Some approaches to optimum feature extraction, in J. T. Tou (ed.), Computer and Information Sciences I 1 (Academic Press, New York, NY, 1967) 57-89. [22] J. W. Sammon, A nonlinear mapping for data structure analysis, IEEE Trans. Comput. 18 (1969) 401-409. [23] J. B. Kruskal, Multidimensional scaling and other methods for discovering structure, in K. Enslein, A. Ralston, and H. S. Wilf (eds.), Statistical Methods f o r Digital Computers (John Wiley & Sons, New York, NY, 1977) 296-339. [24] R. W. Klein and R. C. Dubes, Experiments in projection and clustering by simulated annealing, Pattern Recogn. 22 (1989) 213-220. [25] H. Chernoff, The use of faces to represent points in k-dimensional space graphically, J. Am. Stat. Assoc. 68 (1973) 361-368. 126) A. D. Gordon, Hierarchical classification, in P. Arabie and L. Hubert (eds.), Clustering and Classification (World Scientific, Singapore, 1992).
1.1 Cluster Analysis and Related Issues 31 [27] P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy (W. H. Freeman and Company, San Francisco, CA, 1973). [28] F. Murtagh, Counting dendrograms: A survey, Disc. Appl. Math. 7 (1984) 191-199. [29] W. H. E. Day, Complexity theory: An introduction for practitioners of classification, in P. Arabie and L. Hubert (eds.), Clustering and Classification (World Scientific, Singapore, 1992). [30] J. C. Gower and G. J. S. Ross, Minimum spanning trees and single-linkage cluster analysis, Appl. Stat. 18 (1969) 54-64. [31] L. J. Hubert, Some applications of graph theory to clustering, Psychometrika 39 (1974) 283-309. [32] J. H. Ward, Hierarchical grouping to optimize an objective function, J. Am. Stat. ASSOC. 58 (1963) 236-244. [33] N. Jardine and R. Sibson, Mathematical Taxonomy (John Wiley & Sons, New York, NY, 1971). [34] G. W. Milligan, Ultrametric hierarchical clustering algorithms, Psychometrika 44 (1979) 343-346. [35] G. W. Milligan, A review of Monte Carlo tests of cluster analysis, Multivar. Behav. Res. 16 (1981) 379-407. [36] C. T. Zahn, Graph-theoretical methods for detecting and describing gestalt clusters, I E E E Trans. Comput. 20 (1971) 68-86. [37] A. D. Gordon and J. T. Henderson, Algorithm for Euclidean sum of squares classification, Biometrics 33 (1977) 355-362. [38] S. Z. Selim and M. A. Ismail, K-means type algorithms: A generalized convergence theorem and characterization of local optimality, I E E E Trans. Pattern Anal. Mach. Intell. 6 (1984) 81-87. [39] T. A. Bailey and R. C. Dubes, Cluster validity profiles, Pattern Recogn. 15 (1982) 61-83. [40] L. J. Hubert and P. Arabie, Comparing partitions, J . Classification 2 (1985) 193-218. [41] W. M. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. ASSOC. 66 (1971) 846-850. [42] G . W. Milligan, A Monte Carlo study of 30 internal criterion measures for clusteranalysis, Psychometnka 46 (1981) 187-195. [43] H. H. Bock, On some significance tests in cluster analysis, J . Classification 2 (1985) 77-108. (441 L. Engelman and J. A. Hartigan, Percentage points of a test for clusters, J. Am. Stat. ASSOC. 64 (1969) 1647-1648. [45] R. 0 . Duda and P. E. Hart, Pattern Classification and Scene Analysis (John Wiley & Sons, New York, NY, 1973). [46] B. S. Everitt, Unresolved problems in cluster analysis, Biometrics 35 (1979) 169-181. [47] G. W. Milligan and M. C. Cooper, An examination of procedures for determining the number of clusters in a data set, Psychometrika 50 (1985) 159-179. [48] R. C. Dubes, How many clusters are best?-An experiment, Pattern Recogn. 20 (1987) 645-663. [49] J. Zhang and J . W. Modestino, A model-fitting approach to cluster validation with application to stochastic model-based image segmentation, I E E E Trans. Pattern Anal. Mach. Intell. 12 (1990) 1007-1009.
32
R. C.Dubes
[50] L. J. Hubert and J. Schultz, Quadratic assignment as a general data-analysis strategy, British J . Math. Stat. Psychol. 29 (1976) 19G-241. [51] L. A. Goodman and W. H. Kruskal, Measures of association for cross-classifications, J . Am. Stat. ASSOC. 49 (1954) 732-764.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 33-60 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
[ CHAPTER 1.2 I STATISTICAL PATTERN RECOGNITION
KEINOSUKE FUKUNAGA School of Electrical Engineering, Purdue University West Lafayette, IN 47907, USA In the introductory Section 1, the problems of statistical pattern recognition are defined, and a flow chart is presented to show how a classifier ought to be designed. In Section 2, the theoretically optimal (Bayes) classifier and its variations are introduced. The discussion is extended to show how the resulting classification (Bayes) error can be computed for some limited cases. Also, the upper and lower bounds of the Bayes error are shown. Section 3 discusses how a classifier is designed in practice. Linear, quadratic and piecewise classifiers are included. These are based on the expected vectors and covariance matrices of underlying probability distributions. In practice, these vectors and matrices are not given, and must be estimated from an available set of samples. Consequently, the designed classifier variates and the classification error becomes a random variable. Section 4 discusses how the number of available samples affects the classification performance, and also how to allocate the samples into design and test. In Section 5, nonparametric techniques are presented. They are needed in the estimation of the Bayes error and the structure analysis of data, where a mathematical formula such as Gaussianness cannot be applied. Both the Parzen and Ic nearest neighbor approaches are discussed. Feature extraction and clustering are discussed in other chapters. Keywords: Statistical pattern recognition, classifier, probability of error, hypothesis tests, effect of sample size, nonparametric.
1. Introduction The purpose of statistical pattern recognition is to determine to which category or class a given sample belongs. Through observation and measurement processes, we obtain a set of numbers which make up the measurement vector. The vector is a random vector and its density function depends on its class. The design of a classifier consists of two parts. One is to collect data samples from various classes and to find the boundaries which separate the classes. This process is called classifier design, training, or learning. The other is to test the designed classifier by feeding the samples whose class identities are known. Figure 1 shows a flow chart of how a classifier is designed [I]. After data is gathered, samples are normalized and registered. Normalization and registration are very important processes for a successful classifier design. However, different data require different normalization and registration, and it is difficult to discuss 33
34
K. Fukunaga Y
DATA GATHERING
---
-I-
REGISTRATION
.-----
1 SEARCH FOR NEW MEASUREMENTS
ERROR ESTIMATION
NONPARAMETRIC PROCESS
ERROR ESTIMATION (NONPARAMRRICI DATA STRUCTURE ANALYSIS
7
$. FEATURE EXTRACTION CLUSTERING STATISTICAL TESTS MODELING LINEAR CLASSIFIER QUADRATIC CLASSIFIER PlECEWlSE CLASSIFIER NONPARAMETRICCLASSIFIER
PARAMETRIZATION PROCESS ERROR ESTIMATION
Fig. 1. Process of designing a classifier. (From [l],reprinted with permission.)
these subjects in a generalized way. Therefore, these subjects are not included in this chapter. After normalization and registration, the class separability of the data is measured. This is done by estimating the Bayes error, the overlap among different class densities, in the measurement space. Since it is not appropriate a t this stage to assume a mathematical form for the data structure, the estimation procedure must be nonparametric. If the Bayes error is larger than the final classifier error we wish to achieve (denoted by E O ) , it means the data does not carry enough classification information to meet the specification. Selecting features and designing a classifier in the later stages merely increase the classification error. Therefore] we must go back to data gathering and seek better measurements. Only when the estimate of the Bayes error is less than E O , may we proceed to the next stage of data structure analysis in which we study the characteristics of the data. All kinds of data analysis techniques are used here. They include feature extraction, clustering] statistical tests, modeling, and so on. Note that each time a
1.2 Statistical Pattern Recognition 35 feature set is chosen, the Bayes error in the feature space is estimated and compared with the one in the measurement space. The difference between them indicates how much classification information is lost in the feature selection process. Once the structure of the data is thoroughly understood, the data dictate which classifier must be adopted. Our choice is normally either a linear, quadratic, or piecewise classifier, and rarely a nonparametric classifier. Nonparametric techniques are required in off-line analyses to carry out many important operations such as the estimation of the Bayes error and data structure analysis. However, they are often too complex for any on-line operation. After a classifier is designed, the classifier must be evaluated. The resulting error is compared with the Bayes error in the feature space. The difference between these two errors indicates how much the error is increased by adopting the classifier. If the difference is unacceptably high, we must re-evaluate the design of the classifier. At last, the classifier is tested in the field. If the classifier does not perform as was expected, the database used for designing the classifier is different from the test data in the field. Therefore, we must expand the database and design a new classifier. In this chapter, only the boldfaced portions of Fig. 1 will be discussed briefly. More details are found in [I]. Clustering and feature extraction are discussed in other chapters. Also, unless otherwise stated, only the two-class problem is discussed in this chapter, although the results for a more general multi-class problem are listed whenever available. The notations, which are frequently used in this chapter, are summarized as
Dimensionality Number of classes Number of total samples Number of class i samples Class i A priori probability of wi Vector Random vector Conditional density function of wi Density function A posteriori probability of wi given X Expected vector of wi Expected vector Covariance matrix of wi Covariance matrix
36
K. Fukunaga
2. The Bayes Classification In this section, the classification algorithms and the resulting errors are presented, based on the assumption that pi(X) and Pi are known. The classification algorithms are also known as hypothesis tests. 2.1. Likelihood Ratio Classifier
The probability of the classification error can be minimized by classifying X into either w1 or w2, depending on whether q1(X) > q2(X) or q1(X) < q2(X) is satisfied. That is, qi(X)
qz(X)
(Bayes classifier) .
W2
The resulting risk at X is
r * ( X ) = min[ql(X), q2(X)]
(Bayes risk).
(2.2)
The overall error is obtained by taking the expectation of (2.2) over X: E* = E{r*(X)} =
Pi
p2(X)dX
sr2
(Bayes error)
s,
where ~1 = p l ( X ) d X and ~2 = pz(X)dX are called the w1 and respectively. ri is the region where X is classified to wi. For the multiclass problem, qk(X) = max 2 q,(X) + X E E* =
w2
(2.3) errors,
(2.4)
Wk
-L
E 1 - max q,(x)}.
(2.5)
A more convenient form of the Bayes classifier is obtained by applying the Bayes theorem, q2(X)= P$pp(X)/p(X),and taking negative logarithm: h(X) = - ln[pi(X)/pz(X)]
3 ln[Pi/P2]
(2.6)
W2
The h ( X ) combined with a threshold is called the likelihood ratio classifier. When X is distributed Gaussianly with M , and C, for w,, 1 -In pi(X) = - ( X 2
-
1 Mi)TC;l(X - M i ) + - In lCil 2
+
(:)
1n27-r.
(2.7)
The threshold of the classifier could be changed according to various requirements as follows: Bayes classifier for minimum cost. Let c j i be the cost of classifying a wi sample into wj. The expected cost of classifying X into wi is
1.2 Statistical Pattern Recognition 37 The classification rule and the resulting cost are ck(X) = min z ci(X) + X E wk
(2.9) (2.10)
For the two-class problem, w1
h(X)
= - ln[pl(X)/p2(X)]
5 ln[(c12 - C I I ) ~ ' I / ( C ~ I - c22)P2].
(2.11)
w2
This is a likelihood ratio classifier with a new threshold. Neyman-Pearson test. Let ~1 and ~2 be the error probabilities from w1 and w2, as shown in (2.3). The likelihood ratio classifier minimizes ~1 subject to ~2 being equal to a given constant, say, EO. The threshold value must be selected t o satisfy ~2 = EO and is normally determined empirically. A plot of ~1 vs. E Z for the likelihood ratio classifier with varying threshold is called the operating characteristics and is used frequently as a visual aid to see how two errors are traded by changing the threshold. In the Neyman-Pearson test, ~2 = EO is the operating point and the corresponding threshold value is chosen. When w2 represents a target t o be identified against the other class ( w l ) , E I , E Z , and 1 - ~2 are called the false alarm, the leakage, and the detection probability, respectively. Minimax test. We can make t,he expected cost invariant even when Pa varies unexpectedly after the classifier has been implemented. This is done by selecting the threshold of the likelihood ratio classifier to satisfy (c11 - c22) i- (c12 - C 1 1 ) E l
- (C2l -
C22)&2
=0
f
(2.12)
Particularly, when c11 = c22 and c12 - c11 = c21 - c22, the threshold is chosen t o satisfy ~1 = ~ 2 .This classifier eliminates the possibility of having an unexpected large error due t o the unexpected variation of Pi, In all of these three cases, the likelihood ratio classifier is commonly used, and only the threshold varies. This may be interpreted as replacing the true Pi's of (2.6) by artificial Pi's. Therefore, theoretically, all these cases may be treated as the Bayes classifier, assigning the different meaning to Pi for each application. Some other subjects related to hypothesis tests are as follows. Independent measurement set. When X consists of statistically independent measurement sets as X T = [XTXr.. . XL], the Bayes classifier becomes
This suggests how to combine, for classification, seemingly unrelated information such as radar and infrared signatures.
38
K. Fukunaga
One-class classifier. When one clearly defined class is classified against all other (sometimes not well-defined) possibilities, the boundary may be determined from the knowledge of one class only. A typical example is a hyperspherical boundary around a Gaussian distribution with M = 0 and C = I. This technique could work when the dimensionality of the data, n, is very low (such as 1 or 2). However, as n increases, the error of this technique increases significantly. The mapping from the original n-dimensional space to a one-dimensional distance space destroys valuable classification information which existed in the original space. For an example with n = 64, the error increases from 0.1% t o 8.4%. Reject. When X falls in the region where the Bayes risk r * ( X ) is high, we may decide not to classify the sample. This concept is called reject. The reject region r,(t) and the resulting probability of rejection R ( t ) are specified by the threshold
tas
r,(t) = { X : r * ( X ) > t } R(t)= Pr{r*(X) > t } = 1 - Pr{r*(X) 5 t } .
(2.14) (2.15)
Note that P r {r*(X) 5 t } is the distribution function of a random variable r * ( X ) . The probability of error with reject is the integration of r * ( X ) p ( X )in outside r,, and thus depends on t. The error may be evaluated directly from the reject probability as
r,,
1 two. t
4t)= -
(2.16)
The error decreases as the reject probability increases, and vice versa. A plot of ~ ( tvs.) R ( t ) is called the error-reject curve, and is used as a visual aid to see how E ( t ) and R(t) are traded by changing the threshold t. With the largest possible t = 1 - 1/L for the L-class problem, R(1- 1/L) = 0 and ~ ( 1 -1/L) is equal t o the Bayes error,
E*.
Model validation. The distribution function of the random variable r* (X) is a simple and good parameter to characterize the classification environment, determining both E ( t ) and R ( t ) .Thus, when samples are drawn and a mathematical model is assumed, two distribution functions of r * ( X ) may be obtained: one empirically from the samples and the other theoretically from the model. The comparison of these two distribution functions could be used to test the validity of the mathematical model. 2.2. The Bayes Error The Bayes error of (2.3) is generally hard to compute, except for the following two cases. (1) Gaussian X with C1 = Cz = C. For this case, the Bayes classifier becomes a linear function of X as 1 h ( X ) = (Mz - M I ) ~ C - ~ X- ( M T C - ' M i - M,TC-'M2) 6 t . (2.17) 2
+
1.2 Statistical Pattern Recognition 39 Since X is Gaussianly distributed, h ( X ) is also a Gaussian random variable. Therefore,
where ml = E { h ( X ) l w l } =
m2
= E{h(X)Iwz} =
g: = Var{h(X)Iwi} =
-i1 (M2 - M l ) T C - 1 ( M 2
-
+-21 (M2
-
M1)
MI)~C-~ - M1) (M~
( M~ M ~ ) ~ c - ~- (M IM) ~ ( i = 1, 2 ) .
(2.19) (2.20) (2.21)
(2) Gaussian X with C1 # C2. The Bayes classifier for this case is 1 1 h ( X ) = - ( X - Ml)TC,l(X - M I ) - - ( X - M2)TC,1(X - M2) 2 2 (2.22) The distribution of h ( X ) is no longer Gaussian, but the errors for a general h ( X ) can be expressed as follows:
where the unspecified integral regions are the entire domain for X and [-m, +co] for w. Particularly, when h ( X ) is the quadratic function of (2.22) and p i ( X ) ’ s are Gaussian, we can integrate (2.23) and (2.24) explicitly with respect to X to obtain
(2.25)
(2.26) where (2.27)
L&(w)
=
1 1 tan-’(ujiw) - - w[cji 2 2
-
+ (ujib’$w2)/(1 + w2u’$)]
(2.28) (2.29) (2.30)
40
K. Fukunaga
The A’S are the diagonal components of A which are obtained by simultaneously diagonalizing C1 and Cz as
AT&A = I and AT&A
=A
(2.31)
and the (dzi - d l i ) is the i t h component of the vector AT(Mz - M I ) . Equations (2.25) and (2.26) must be integrated numerically, but they are one-dimensional integrations.
Upper and lower bounds. The computation of the Bayes error is very complex unless h ( X ) is linear. Furthermore, since a numerical integration is involved, the Bayes error cannot be expressed explicitly. An alternative is to use the upper and lower bounds of the Bayes error as a measure of class separability. Some of popular bounds are listed as follows: E{yl(X)yZ(X)} :
2 nearest-neighbor error
E{min[ql(X) , yz(X)]} : 2E{yl(X)yz(X)} :
Bayes error
Nearest-neighbor error
2k2E{yi(X)ln yl(X) + yz(X)lnyz(X)} :
--
E{}-/,
:
Bhattacharyya bound
(2.32) (2.33) (2.34)
Equivocation
(2.35)
(2.36)
The inequalities (2.32) 5 (2.33) 5 (2.34) 5 (2.35) 5 (2.36) hold regardless of the distributions. One of the popular bounds is the Bhattacharyya bound, which has an explicit expression for Gaussian distributions: (2.37)
The first term of (2.38) indicates the class separability due to the mean difference, and the second term gives that due to the covariance difference. When the distributions are non-Gaussian, (2.37) with (2.38) is no longer guaranteed to bound the Bayes error. Still, in order to use (2.38) as an effective class separability measure, one may transform each variable to a Gaussian-like one. For example, power transforms, yi = xy (i = 1, . . . , n ) , may be used for causal distributions. Such a variable transformation is useful not only for measuring the class separability, but also for designing a better classifier. Designing the Bayes classifier for Gaussian distributions, even with additional variable transformations, is often easier than designing the Bayes classifier for non-Gaussian distributions.
Scatter measures. The bounds discussed above are valid only for two class problems. There exist no well accepted extensions of the above bounds t o multiclass. Scatter measures, introduced here as an alternative, are intuitively derived, simple, but not
1.2 Statistical Pattern Recognition 41 directly related to the Bayes error. They are for multiclasses. Let us define for L-class: L
Pi(hfi- M ) ( M i - hf)T Between-class scatter matrix
Sb =
(2.39)
i=l
L
Pi&
S, =
Within-class scatter matrix
(2.40)
i= 1
s, = E{(X
-
M)(X - k f ) T= } s b f sw Mixture scatter matrix
(2.41)
The class separability can be measured by the combinations of these matrices: trSLlSb,
t r SL'S,,
In ~S~'swl etc.
(2.42)
where trace and log-determinant are for converting a matrix to a number. The first and second ones are often used for feature extraction and clustering respectively. All combinations, SG'Sb, s;'Sw, sk'sb etc., share the same eigenvectors, and their eigenvalues are well related. The first one of (2.42) measures class separability based on the scatter of class means, normalized by S,, and the others are the variations. Therefore, the scatter measures can be applied only for the cases where classes are separated mainly by mean-difference. There are no measures available for multiclasses mainly separated by covariance-difference.
3. Classifier Design Once the structure of data is studied thoroughly, it is easy to select a proper classifier for the data. This section presents how several typical classifiers can be designed. 3.1. Linear Classifiers
The Bayes classifier becomes linear for the following two cases. (1) Gaussian X with C1 = Cz = C: For this case, the Bayes classifier is expressed by (2.17), which is linear. In particular, when C = I ,
h ( X ) = (MZ - M1)TX
+ 51 ( M T M 1 -
M,TM2) 5 t .
(3.1)
This classifier is also known as the distance classifier in which ( X - M 1 ) T ( Xw1
w1
w2
W2
M I ) g ( X - M z ) ~ ( -XM z ) , or the correlation classifier in which M F X 2 M T X with an energy constant condition MFM1 = MFMz. In both cases, PI = Pz and thus t = 0 is assumed. When C # I , the distance or correlation classifier may still be applied, but only after X is linearly transformed to Y = ATX in order to make C y = ATCA = I . (2) Binary independent x,'s: For independent xJ7staking either +l or -1, n
(1+2,)/2
p z ( X )=
wy j=1
(1 - wz3)(1-z')/2[s(s, - 1)
+ b(z, + l)]
(3.2)
42
K. Fulcunaga
where wiJ
= Pr{zj = fllwi}.
Substituting (3.2) into (2.6),
For Gaussian X with C1 # C2 and more general non-Gaussian X, a linear classifier is not the best one. However, because of its simplicity and robustness, a linear classifier is frequently adopted. The design procedure is as follows. W1
h ( X )= V T X 5 t
(3.4)
w2
Equation (3.4) indicates that X is linearly mapped down t o a variable h and the distributions of h for w1 and w2 are separated by a threshold t. Thus, the optimum V is found by minimizing the probability of error in the h-space. Because of complexity in the error computation, simpler criteria such as f ( m l , 77x2, g f , a;) are often used, where mi = E{h(X)lwi} = VTMi and ~7: = Var{h(X)lwi} = VT&V. The typical examples are
f=
Pl(m1 - may + P2(m2 - m0l2 PI.-::
+P 2 4
between class scatter within class scatter
)
(3.6)
+
where mo = P1ml P2m2 is the mixture mean. These criteria measure the class separability of the distributions of h. The solution of d f /dV = 0 is
where
That is, the optimum V always takes the form of (3.7) regardless of the functional form of f . The effect of f is observed only in the averaging coefficient of covariance matrices, s. For example, s = 0.5 for (3.5) and s = PI for (3.6). V can be found even without specifying f . Since the form of V is known as in (3.7), we change s from 0 to 1 with a certain increment, say, 0.1, compute the empirical distribution functions of h = V T X for w1 and w2 from the given data set, select the value of the threshold, and count the number of misclassified samples. The optimum s is the one which gives the smallest error in this operation. The Bhattacharyya bound of (2.38) gives a simple test t o decide whether or not a linear classifier is appropriate. When the first term of (2.38) is dominant, the classifiability comes mainly from the mean difference. Therefore, a linear classifier is
1.2 Statistical Pattern Recognition 43
a proper choice. However, if the second term is significant, the covariance difference plays an important role, and a quadratic classifier is called for. 3.2. Quadratic Classifiers
For Gaussian X, the Bayes classifier becomes quadratic, as shown in (2.7) or (2.22). In practice, the quadratic classifier of (2.22) is widely adopted in many applications, even without checking the Gaussianness of X, and with much success. Probably, this is the classifier everyone may try first, even before conducting data structure analysis. However, it is not known how to design the optimum quadratic classifier for non-Gaussian distributions, as the linear classifier was designed. The optimization of f(m1,m2, c;,c;) for h = X T Q X + VTX with respect to a matrix Q and a vector V is too complex. If quadratic terms xJxk's are treated as new variables y,, h = CCqJkxJxk C v , x , becomes a linear equation as h = Ca,y, Cv,x,. However, for high-dimensional cases, the number of y z ' s becomes prohibitively large.
+
+
Two-dimenszonal display. One of the procedures used to improve the performance of the quadratic classifier is t o plot X in a two-dimensional display where d : ( X ) = ( X - Mz)TC,l ( X - M,) for i = 1, 2 are used as the J: and y axes. If X is Gaussian, the Bayes classifier becomes a 45" line with a proper y-crossing point. When the distribution is not perfectly Gaussian, we can observe in the display that the 45" line is not the best boundary to minimize the number of misclassified samples. Then, visually we can find a better line to classify samples by changing the slope and the y-crossing point of the line. It corresponds to adjusting (Y and p of the following quadratic classifier:
Once samples are plotted and examined, the boundary in the display need not be restricted to a line. Any curve could be drawn. This flexibility is the advantage of seeing the data on the display.
Fourier transform. When a stationary random process is time-sampled, the coefficients of the discrete Fourier transform are uncorrelated, and its covariance matrix becomes diagonal. Thus, the quadratic classifier of the Fourier coefficients y , is reduced to h = Cr=,(q,ly,12 v,y,) vo. This is the Bayes classifier, if y,'s are Gaussian.
+
+
Approximation of covariance matrices. If we can assume a structure for a covariance matrix, we can simplify the design of a quadratic classifier. In addition, the classifier becomes less sensitive to the parameter variation due to the estimation process by using a finite number of design samples. This will be discussed in the next section. One of the possible structures is the toeplitz f o r m for the correlation matrix, allowing each variable to have its distinct variance. In this case, parameters must be selected to assure that the toeplitz matrix be positive-definite. In particular, when
44
K. Fukunaga
the correlation coefficient between the ith and j t h variables, ~ i j is, approximated by pli-jl ( p : a constant), the entire correlation matrix is characterized by one parameter, p, and its determinant and inverse matrix may be expressed explicitly.
3.3. Piecewise Classifiers For the multiclass problem, the boundary must have a piecewise structure as follows.
Piecewise quadratic classifiers. If X is Gaussian, the classifier becomes, from (2.7), piecewise quadratic:
The first term of (3.10) is widely used even for non-Gaussian distributions. However, it must be noted that the normalized distance of X from each class mean, M i , must be adjusted by two constant terms, In [Cil and 1nPi.
Piecewise linear classifiers. When Xi's are similar, the quadratic term X T C i ' X is eliminated from (3.10) to get a piecewise linear classifier:
Or, replacing Ci by the averaged covariance, C = (Cl
+ . .. + C L ) / L , (3.12)
Another possibility is to design the optimal linear classifier for each pair of classes. In this case, L(L - 1)/2 classifiers must be designed, instead of L in (3.11) or (3.12).
Clustering. In some applications, each class distribution would be handled better by dividing it into several clusters. For example, take the signatures of a target viewed from the front and side. Since they are so different, it may be more appropriate t o separate them into several clusters rather than to treat all of them as one class. Considering each cluster as a class, we can form a new multiclass problem with a significantly increased number of classes. However, the details of designing such a classifier depend very much on how clusters are defined and obtained and how many classes are generated. Therefore, although important, the subject is not discussed in this chapter.
IC nearest neighbor (ICNN). The ICNN classifier forms a piecewise linear boundary, although it is very complex and data dependent. The simpler boundary could be obtained by merging samples into a smaller number of representatives and then applying the ICNN classifier t o these representatives.
1.2 Statistical Pattern Recognition 45
3.4. Sequential Classifiers When m consecutive observation vectors, XI, . . . , X,, are known as coming from the same class, we can use this additional information to reduce the classification error. That is, the number of variables is extended from n for one vector t o m x n for m vectors. Thus, we can form a new random vector with m x n components and design a classifier in the (m x n)-dimensional space. However, when these vectors are statistically independentl a simpler formula could be adopted:
That is, the likelihood ratio classifier is applied t o the incoming sample X i , and the output is accumulated. Rewriting the left-hand side of the inequality as (3.14)
the expected values and variances of s and h E{s(wi} = E{h(wi} and
= h(X)
are related by
1 Var{s(wi} = - Var{h(wi}. m
(3.15)
Thus, we can reduce the variances of s by increasing m, while maintaining the expected values of s. Furthermore, the density function of s becomes close to a Gaussian by the central limit theorem. Two important properties of the sequential classifier emerge from the above discussion. One is that we can make the error as small as we like by increasing m. The other is that the error is determined by a small number of parameters, E{hlwi}, Var{hlwi}, and m, and is little affected by the higher-order moments of h. In practice, the true p i ( X ) ’ s are never known, and h ( X ) = - l n [ p ~ ( X ) / p z ( X ) ] must be replaced by some function i ( X ) . A desired property for i ( X ) is E{k(X)(w1} 5 0 and
E{L(X)Iwz}
2 0.
(3.16)
regardless of the distribution of X . As long as (3.16) is satisfied, the random variable i ( X ) carries classification information, however small, regardless of the distribution of X. The classifiable information can be enhanced as much as we like by increasing m in the sequential operations. Two k ( X ) ’ s are known to satisfy (3.16) for all distributions of X, whose expected vectors and covariance matrices are A41 and C1 for w1 and A42 and Cz for w2: 1
- (M,TC-lA41-
2
- MI)
1 (X 2
- -
-
A4,Tc-’A42) A42)TC,1(X
-
(3.17) A42)
(3.18)
46
K. Fukunaga
Any positive-definite matrix C in (3.17) satisfies (3.16). But, the averaged covariance matrix such as the one in (3.7) would be a better C to achieve the same performance with a smaller m. Equation (3.17) could be used, if the first term is dominant in the Bhattacharyya bound (2.38), but (3.18) is more appropriate otherwise. Note that these equations are the same as (2.17) and (2.22), respect ively. One of the most important aspects in classifier design is t o make the classifier robust. That is, the performance of the classifier must be maintained, even if the distribution of test samples becomes somewhat different from the one used for design. The sequential technique can compensate the degradation of the performance of h ( X i ) 5 t by increasing m. 4. Estimation of Classification Errors
So far, we have discussed the design of parametric classifiers, assuming that Mi and Ci are given. In practice, these parameters are estimated by using a finite number of available samples, and the estimates are random variables. Consequently, the classifier designed with these estimates variates, and its performance is also random. Therefore, it is important to know how the sample size affects classifier design and performance. 4.1. Eflect of Sample Size o n Estimation
General formula. First let us consider the problem of estimating f = f (y1, . . . , yq) by f = f (yl, . . . , y q ) ,where f is a given function, the yi's are the true parameter values, and yi's are their estimates. When the deviation of yi from yi is small, f may be expanded by a Taylor series as follows:
where Ayi
= yi
-
yi. If the estimates are unbiased,
In most parametric cases, the yi's are the components of M , and C, ( r = 1, 2), and Y T = [ y l . . . yq] can be expressed as
Y T = [ml(1)
7
m(l) " ' 7
7L
7
(1) J1) (2) m,(2) 7 .. ., m,(2) 1 c11 > . .. , n7L7 c11, . . . ,
?.A]
(4.4)
where mi') and cii) (i 5 j ) are the components of M , and C, respectively. Their unbiased estimates are obtained by the sample m e a n and sample covariance matrix
1.2 Statistical Pattern Recognition 47 as
where X i T ) is the ith sample from w,. When X:')'s are drawn from Gaussian distributions, E{AyiAyj} for the y;'s of (4.4) are known, and (4.2) and (4.3) become
(4.7) where both C1 and Cz are assumed to be diagonal with ( r = 1, 2) as their diagonal components. Without loss of generality, any two covariance matrices can be simultaneously diagonalized. Also, N1 = N2 = N is assumed for simplicity. Note that both the bias of (4.6) and the variance of (4.7) are proportional to 1/N. The other terms are determined by the underlying distributions. This is true even when XiT)'sare drawn from non-Gaussian distributions. Estimation procedure o f f . Equation (4.6) can be rewritten as U
E{f}" f + - N .
(4.8)
This equation suggests the following procedure t o estimate f : (1) Change N to N1, . . . , Ne. For each Ni, compute MTand 2,and subsequently f . Repeat this T times independently, and approximate E{f} by the sample mean of the 7 results. (2) Plot the empirical points of E{f} vs. 1/N. Then, find the line best fitted to these points. The slope of the line is u , and the y-crossing point is the estimate off.
Bhattacharyya distance. The Bhattacharyya distance of (2.38) is a function of M , and C,. Thus, the bias of (4.6) can be further reduced by computing the partial derivatives of this function. Treating the first and second terms separately, r
48
K. Fukunaga
where M I = 0, M2 = [ml . . . m,IT, Xi1) = 1 (A1 = I) and Xi2) = X i (A2 = A) are used without losing generality. For example, when mi = 0 and X i = 1, E{Apl} %’ n/(4N) and E{Ap2} n(n 1)/(8N). This is the case where the Bhattacharyya distance is measured between two sample sets generated from the same Gaussian distribution. Although p1 = p2 = 0 for N = 00, a finite N creates the biases. Note that E{Ap1} is proportional to l / k ( k = N / n : ratio of sample size and dimensionality) while E(Ap2) depends on (n+l ) / k . In order to maintain the same amount of bias (E{Ap} = E{Apl} E{Ap2}), a larger k must be chosen as n increases. For example, t o meet E{Ap} 5 0.223, k must be larger than 6.2 and 39.6 for n = 8 and 64 respectively. The variances of (4.7) also can be computed similarly.
+
+
4.2. Estimation of Classification Errors
The classification errors of (2.23) and (2.24) are the members of the family of functions presented in (4.1) and (4.4), when h ( X ) and p , ( X ) are functions of M , and C,. However, in this case, the randomness comes from two sources: the finite design-sample set to make h ( X ) random and the finite test-sample set to make pi(X) random. Since these two affect the error differently, we need t o discuss their effects separately.
Effect of test samples. When a finite number of samples is available for testing a given classifier, a n error-counting procedure is the only feasible possibility in practice. That is, each sample is tested by the classifier and the number of misclassified samples is counted. Then,
Et(6,) = E ,
and
Vart{h,} =
Er(1 -
Nr
~ r )
(7- =
1, 2)
(4.11)
where Et and Vart indicate that the expectations are taken with respect to test samples, and N, is the number of test samples from w,. This is a n unbiased estimate. Furthermore, (4.11) is valid regardless of functional forms for h ( X ) and Pr(X).
Effect of design samples. When a finite number of design samples is used to compute M,, 9, and then h ( X ) , Ed{&.} and Vard{i,} can be obtained through (2.23) and (2.24) for given test distributions, p , ( X ) , where Ed and Vard indicate the expectations with respect to design samples. The resulting bias is
Ed{&} 2 E
+
v +N
(4.12)
where E = P1~1 P ~ E and z , v is determined by the underlying distributions of design samples, given test distributions and the functional form of h ( X ) . The number of design samples is denoted by N (= N1 = N2) and distinguished from the test sample size N . Although v is a complicated function and can be computed
1.2 Statistical Pattern Recognition 49 explicitly only for simple special cases, v can be obtained empirically by the estimation procedure of (4.8). When h ( X ) is the quadratic classifier of ( 2 . 2 2 ) with C1 = C2 = I and PI = Pz, v becomes 1
vq 2
4
J
2-1
m
(4.13) where M = M2 - M I . On the other hand, when h ( X ) is the linear classifier of (2.17) with C1 and PI = P2, v becomes
ve =
1 4JGimiT
e--MTM/8
[
(1
--+
+ MTM
- 1]
.
=
C2 = I
(4.14)
When C1 = C2, the quadratic h ( X ) of (2.22) becomes the same as the linear h ( X ) of (2.17). However, when the estimated covariance matrices, 21 # 2 2 , are used, h(X) of (2.22) differs from that of (2.17). As a result, E{AE} for the quadratic classifier is proportional to n 2 / N while E{&} for the linear classifier tends t o n / N when n gets large. This implies that many more samples are needed to properly design a quadratic classifier than a linear classifier. More generally, (4.6) suggests that the bias could be proportional to n2 because of the double summation of the last term. This is due to the fact that n2 correlations are estimated. This number could be significantly reduced, if we assume a structure of the covariance matrix and estimate a smaller number of parameters. As for Vard{i}, we have for the Bayes classifier Vard{d}
(4.15)
0:
N
otherwise
Effect of independent design a n d t e s t samples. When both design and test samples are finite and they are independent, the bias and variance of 2 are
E{d}
(4.16)
Z Ed{€}
(4.17) where E and Var indicate that the expectations are taken with respect t o both design and test samples. Note that Vart{d,} of (4.11) is obtained from the first term of (4.17) by replacing Ed{&} by E,. Since Ed{&,} Z E , v,/N, we can conclude as follows:
+
(1) the bias of the classification error comes entirely from the finite design set, and ( 2 ) the variance comes predominantly from the finite test set.
50
K. Fukunaga
4.3. Holdout, Leave- One- Out, and Resubstitution Methods When only one set of samples is available and the performance of the specified classifier is to be estimated, we need to decide how to divide the samples into two groups, design and test.
Upper and lower bounds of the Bayes error. In general, the classification error is a function of two sets of data, the design set PD and the test set PT, and may be expressed as E ( P DP , T ) where E is an operator to compute the Bayes error. We assume that both PD and PT are drawn from the same set of underlying distributions P = { p l ( X ) , p 2 ( X ) } . If P is used for design, the resulting classifier is the Bayes which produces the Bayes error by testing P . That is, the Bayes error is expressed by & ( P P , ) . Letting and +2 be two different sets of samples independently drawn from P , E ( P ,P ) can be bounded as (4.18)
The rightmost term indicates that, as long as design and test sample sets are independent, the resulting error is larger than the Bayes in expectation with respect to test samples. The leftmost term suggests that, if the same set is used for both design and test, the resulting error is smaller than the Bayes in expectation. These procedures are called the holdout (H) and resubstitute (R) methods respectively. The H method works well, if many data sets can be generated by a computer. However, in practice, with only one set of data, we need t o divide the available data into two independent groups. This reduces the number of samples available for each of design and test. Also, it must be assured that the distributions of design and test samples are close. Another problem is how to allocate samples to design and test. This is normally done by balancing the bias due t o the design sample size and the variance due to the test sample size. The leave-one-out (L) method alleviates the above difficulties of the H method. In this method, one sample is excluded, the classifier is designed on the remaining N - 1 samples, and the excluded sample is tested by the classifier. This is repeated N times to test all N samples. The number of misclassified samples is counted to obtain the estimate of the error. Since each test sample is excluded from the design sample set, the design and test sets are independent. Also, since all N samples are tested and N - 1 samples are used for design, the available samples are more effectively utilized. Furthermore, we do not need t o worry about dissimilarity between the design and test distributions. Although the L method requires N classifiers (one for each sample), these classifiers may be computed with little extra computer time as the perturbations from the classifier designed from N samples. This will be shown for the quadratic classifier next and for nonparametric cases later.
R and L methods for the quadratic classifiers. In the R method, all available samples, X J T )( r = 1, 2 ; i = 1, . . . , N T ) ,are used t o compute MT and 2, of (4.5).
1.2 Statistical Pattern Recognition 51
Then, the same samples are tested as to whether or not the following inequality is satisfied:
for testing
Xk
E { X (1) , , .. ., X N ( 11 ) ,XI ( 2 ), . . . , X N ( 2 )Z } .
(4.19)
Then, the number of misclassified samples is counted. This error is supposed t o be smaller than the true error of the classifier. On the other hand, in the L method, X i 1 ) E w1 is excluded from the computations of k 1 and 21, and the modified k l and 51 are used in (4.19) to test X:'). Similarly, k 2 and 2 2 are modified for testing X r ) E w2. The resulting quadratic equation ~ L ( X I becomes ,)
where
and
d $ (X,) = ( X ,
- k,)T2,1(XI,
-
k,) .
(4.22)
Equation (4.20) indicates that, for X I , E w1, h ~ ( X k is) larger than h ~ ( X k )and , the chance of X I , being misclassified is increased. The same is true for XI, E w2. Therefore, the L error is always larger than the R error. When the R method is used to count the error, h ~ ( X kand ) d z ( X ~ ,()r = 1, 2) must be computed for all (N1 N2)Xk's. The L method requires an additional computation of (4.21) for each 5. However, since (4.21) is a scalar function, the computation time for this part is negligibly small. Thus, when h ~ ( X kis) computed and tested for each XI,, h ~ ( X kis) also computed and tested at the same time with little additional computer time.
+
5. Nonparametric Procedures As Fig. 1 shows, nonparametric procedures are necessary for error estimation in both measurement and feature spaces and data structure analysis before a
52
K. Fukunaga
parametric structure of the data is determined. Nonparametric procedures are based on the estimation of a density function without assuming any mathematical form.
5.1. Estimation of a Density Function There are two approaches for density estimation: one is the Parzen approach and the other is the k-nearest neighbor (NN) approach. They have similar statistical properties with minor differences.
Parzen density estimate. When N samples X I , . . . , X N , are given, but no mathematical form can be assumed for the density function, the value of the density function at X may be estimated by p ( x )=
1
c N
Ic(X
-
Xi)
i=l
where K ( . ) is called a kernel function. In practice, selection of the kernel function is limited to either Gaussian or uniform particularly in a high-dimensional space. A more general form for K ( . ) is
(5.2) where I?(.) is the gamma function. When m = 1 and 00, K ( . ) of (5.2) are reduced to the Gaussian and uniform kernels respectively. The matrix A , which is called a metric, determines the shape of the hyperellipsoid, and r controls its size (both in the uniform case). Otherwise coefficients are selected to satisfy two conditions: tc(X)dX = 1 (which is required to satisfy p ( X ) d X = 1) and X X T l c ( X ) d X = r 2 A (the covariance matrix of .(X)). The bias and variance of (5.1) are
s
s
s
The control parameters of the Parzen density estimate are m, r , A and N and their optimal choices could be found by minimizing E{ [ p ( X )-p(X)]’} which is (Bias)2+ Var. However, the optimal selection of parameters for density estimation does not coincide with the one for classification.
kNN density estimate. In this approach, the kth N N sample of X is found, and the distance d k (or the corresponding volume U k ) is measured, where d i = (XkNN X ) T A - ’ ( X k ~-~X ) , uk = CIA^^/^^;, A is a metric, and c i s a constant. Then, the
1.2 Statistical Pattern Recognition 53 density estimate a t X is k-1 (5.5)
p(x) = NVk(X)
where vk is a random variable and the function of X . Defining u as the probability of a sample falling within ?&, the density function of u is known as (5.6) Since u g u k p ( X ) for a small ?&, the density function of vk can be computed from (5.6). Thus, the bias and variance of the k N N density estimate are also obtained, resulting in
[ +a
E{p(X)} 2 p(X) 1
Var{p(X)}
- tr
':rv;
{ -A }
(5.7)
(Ncl:l&x))2/n]
P2 (XI
E -.
k-2
The term (.)'//" of (5.7) is d; by (5.5). Also, Var {p(X)} of (5.4) becomes p2(X)/k for a uniform kernel in which w is l / w . That is, the biases and variances of the Parzen and k N N density estimates are very similar. The k N N density estimate could be considered as the Parzen one with a uniform kernel whose kernel size is adjusted by p(X). The control parameters of the k N N density estimate are k , A and N , and their optimal choices are found by minimizing E{p(X) - p(X)I2}.
Moments of the kNN distance. Since the density function of moments of d k can be computed, resulting in
Vk
is known, the
where the integrals for some distributions with the covariance matrix C are Gauss: Uniform:
(27r)m/21C1"/2"(1 -- m / n ) - n / 2
(zn-)m/2jCIm/2nr--m/n (1
+n/2)(1+ n / 2 1 ~ / ~ .
(5.10) (5.11)
When m / n is small in a high-dimensional space, and A is selected as A = C, E{dr} is determined predominantly by n and m. The effects of k and N are minimal as r ( k m / n ) 2 r ( k ) and r ( N 1 m / n ) 2 r ( N 1). Also, E{drlX} is computed by (5.9) without taking the integration, but is little affected by p(X) because of a small m/n in power. The variance of d k is very small and all dk's are close to the expected value.
+
+ +
+
Estimation of the local dimensionality. The ratio of two averaged k N N distances depends only on k and n, but not on N and p ( X ) as follows: (5.12)
54
K. Fukunaga
The n computed from d k ’ s and d k + l ’ s by (5.12) depends only on neighboring information, and thus indicates the local dimensionality (or intrinsic dimensionality). Generally, the dimensionality plays a dominant role in determining the statistical properties of any nonparametric estimation. For example, E{dy} of (5.9) with A = C is predominantly determined by n and m. However, it must be kept in mind that the n of (5.9) means the local dimensionality but not the global one.
Very large number of classes. When the number of classes is very large as in the hundreds, we may consider class expected vectors Mi (i = 1, . . . , L ) as random vectors drawn from a distribution, p ( M ) . The classification error between a pair of classes, wi and w j , is determined by the distance between Mi and Mj and the amounts of noises around these M’s. The overall error depends on how many neighboring classes contribute the error. On the other hand, the previous discussion indicates that, in a high-dimensional space, the k N N distance is not affected by k , L and p ( M ) . That is, each class is surrounded by many other neighboring classes with almost equal distances. Thus, almost-equal pairwise errors are added up to form the total error which could become large. Classification of a very large number of classes must be handled with special care. It is not enough to confirm that each pair of classes can be classified with a reasonably small error. 5.2. Classification
Parzen classifier. Substituting the Parzen density estimates into the likelihood ratio classifier of (2.6),
This is called the Parzen classifier and can be used to classify X , when a set of samples {Xi’), . . . , X N(1)1, X ,(2) , . . . , X N(2), } is given. Each class may have a distinct kernel function. In order t o find the upper and lower bounds of the Bayes error, we may adopt the resubstitution ( R ) and leave-one-out ( L ) methods for the Parzen classifier. In the R method, the same samples, Xi‘) ( r = 1, 2 ; i = 1, . . . , N‘), are tested by (5.13), and the number of misclassified samples is counted. On the other hand, when Xe(l) is tested in the L method, Xj’) is excluded t o form the Parzen density estimate of w1. Therefore, the numerator of (5.13) must be replaced by
The denominator stays the same as l;z(Xe( 1 )). Again, X,(’) (f? = 1, . . . , N1) are tested and the number of misclassified samples is counted. Note that the amount
1.2 Statistical Pattern Recognition 55 subtracted in (5.14), ~ 1 ( 0 )does , not depend on !. When an w2 sample is tested, the denominator of (5.13) is modified in the same way. It can be proved that P1L(Xj1))5 fil(Xjl))if K ( X )5 ~ ( 0 which ) is satisfied for the kernel functions of (5.2). Therefore, the tested sample has more of a chance to be misclassified in the L method than in the R method. Also, note that the L density estimate of (5.14) can be obtained from the R density estimate by simple scalar operations -subtracting ~ l ( 0 and ) dividing by N1 - 1. Therefore, the computation time needed to obtain both the L and R density estimates is almost the same as that needed for the R density estimate alone.
kNN classifier. Using the k N N density estimates of (5.5), the likelihood ratio classifier becomes
where us)= clA,I1/”dL)’” and dL)’(X) = ( X i : h N - X)TA-l(XL:,,N - X ) . That is, in order to test X , the k l N N from w1,X k(1)l N N ,and the k 2 N N from w2, X l(c2 )z N N , are found, the distances from X to these neighbors, dg’ and d t ) , are measured, and
d t ) and d g ) are inserted into (5.15) to test whether the left hand side is smaller or larger than t. For simplicity, kl = k2 = k is used in this chapter. Normally, the class covariance matrix is used for A, and therefore A1 # A2. The R and L methods of the k N N classifier are used to bound the Bayes error. In the R method, all samples are included in the list of design samples from which the k N N of the test sample is found, and the same samples are tested. When X,(l) is tested, X,(’) itself is the closest sample in the list. Therefore, dFll(Xjl)) is inserted into the denominator of (5.15) while df)(Xjl)) is inserted into the numerator. On the other hand, when Xi1) is tested in the L method, Xk’) must be excluded from the list of design samples. Therefore, d r ) ( X { ’ ) ) and d f ) ( X j ’ ) ) are compared. Since drjl(Xjl)) 5 d P ) ( X j l ) ) , Xk‘) has more of a chance to be misclassified in the L method than in the R method. Also, note that in order to find the k N N , the distances to all samples are computed and compared. When d r ) ( X j ’ ) ) is obtained, d,&(Xj’)) is also available. This means that the computation time needed to get both the L and R results is practically the same as the time needed for the R method alone. Similarly, for testing a w2 sample, X ,( 2 ), d f j l ( X j 2 ) )and d f ) ( X j 2 ) )are compared with d r ) ( X d ” ) in the R and L method respectively. Voting kNN classifier. Instead of selecting k N N from each class separately and comparing the distances, the k N N ’ s of a test sample are selected from the mixture of classes, and the number of neighbors from each class, ki, among the k N N is counted. The test sample is then classified to the class represented by a majority
56
K. Fukunaga
of the kNN’s. That is,
k,
= max(k1,
..., k L }
+
X E w,
(5.16)
In order to avoid confusion between these two k N N procedures, we will call (5.16) the voting k N N procedure and (5.15) the volumetric kNN procedure. For the voting kNN procedure, it is common practice t o use the same metric to measure the distances to samples from all classes, although each class could use its own metric. Since the kz’s are integers and a ranking procedure is used, it is hard to find a component of (5.16) analogous with the threshold of (5.15). It can be shown that, with t = 0 in (5.15), the volumetric k N N and voting (2k - 1)” procedures give identical classification results for the two-class problem using the same metric for both classes. For example, let k and (2k - 1) be 3 and 5 respectively. In the voting 5NN procedure, a test sample is classified to w1, if 3 , 4, or 5 of the 5”’s belong to w1. This is equivalent to saying that the third N N from w1 is closer to the test sample than the third N N from w2. In the voting k N N classification for the two-class problem, k must be odd. Otherwise, k l = kz could occur and we cannot decide which class the test sample is classified to. This problem may be alleviated by introducing the concept of rejection. That is, when kl = k2 occurs, the test sample is rejected, and not counted as misclassified one. As a result, the classification error becomes smaller than even the Bayes error. This happens because some of the to-be-misclassified samples are rejected and not counted as the error. The asymptotic ( N , = m) performance of the voting k N N is known. For the two-class problem, the risk of the voting k N N classification given X, r k ( X ) , is (5.17)
(5.18) where <(X) = ql(X)qZ(X). On the other hand, the Bayes risk given X is
= g;( 1
i=l
22-2 i-1
)m
(5.19)
Using (5.17)-(5.19), it is not difficult to prove that these conditional risks satisfy the following inequalities
1.2 Statistical Pattern Recognition 57 Taking the expectation of these risks with respect to X, the corresponding errors can be obtained. Therefore, these errors also satisfy the inequalities of (5.20). Thus,
where &*
= E{T*(X)}
and
Ek"
= E{Tk(X)}.
Equation (5.21) indicates that asymptotically (N, = co)the Bayes error is bounded by the voting kNN errors; the upper bounds for odd k's and the lower bounds for even k's.
5.3. Selection of Parameters
As is seen in Fig. 1, one of the major objectives of nonparametric classification is to estimate the Bayes error. This is done by finding the upper and lower bounds, using the L and R methods. However, normally nonparametric estimates are heavily biased unless the parameters are carefully chosen, and the results might not bound the Bayes error. Simply increasing the sample size is not the way to reduce the biases. The required number of samples could be astronomical particularly in a high-dimensional space. Bias of the voting kNN classification. The asymptotic kNN error is obtained by assuming (rr(Xe") = q,(X) where xe" is the CNN of X. However, as (5.9) suggests, X ~ N N and X are not close in a high dimensional space. Using a better approximation, qT(xeNN) q,(X) VTq,(X)(Xe" -X ) tr{V2q,(X)(XeNN X)(Xe" - X)T}, we can compute the simplest case, the bias of the NN error, for a finite sample size, resulting in
+
++
where ENN is the asymptotic NN error, A is the metric, B1(X) is a matrix determined by the underlying distributions, and p1 is a constant related to N and n as follows:
For given distributions (given B l ( X ) and n ) , the bias is controlled by N and A. However, the reduction of ,& in (5.23) by increasing N is very slow for a large n. For example, for n = 64, N-2/n is 0.81, 0.65 and 0.52 for N = lo3, lo6 and lo9 respectively. The optimal A should be obtained by minimizing Ex{.} of (5.22) for the global metric and (A(-l/ntr{AB1(X)} for the local metric. However, since B l ( X ) is too complex to compute in practice, we do not know how to select A.
58
K . Fukunaga
The above result can be extended to the 2NN error as (5.24) 4/n)
r(N I'(N
+ 1)
+ 1 + 4/n)
FZ
(5.25)
where &(X) is another matrix determined by the underlying distributions. In (5.25), ,& is proportional t o N P 4 / " while is t o N P 2 I n . T h a t is, as N increases, the 2NN error converges to its asymptotic value more quickly than the N N erroras if n were half as large. Also, note that pz is significantly smaller than 01, because r2/"/nn(0.071 for n = 64) is squared. Since their asymptotic errors are related by &" = 2&2" from (5.17) and (5.18), a better estimate of ENN could be obtained by estimating &2" first and doubling it.
Parzen classajiers. In the L method, design and test samples are independent, and the bias of the classification error comes from the finite design samples. In this case, h ( X ) in (2.23) and (2.24) is replaced by h(X) of (5.13) in which the bias and variance of p,(X) are given in (5.3) and (5.4). Therefore, we can express the bias of the error of the Parzen classifier in terms of T and N as E{Ad}
Z'
a1r2
+ a2r4 + a3 r-n __ N
(5.26)
where a l , a2 and a3 are determined by the underlying distributions, the metrics A1 and A2 for w1 and w2, the parameter t o control the shape of the kernel m, and the threshold t in (5.13). Recall in (5.3) and (5.4) that the bias of the Parzen density estimate is a function of V 2 p ( X ) ,A and r2 while the variance is a function of p ( X ) , N and w (which is a function of r , A and m). The a1r2 and a2r4 terms indicate how biases in the density estimates influence the performance of the classifier, while the a3r-"/N term reflects the role of the variance of the density estimates. For small values of r , the variance term dominates (5.26), and the observed error rates are significantly above the Bayes error. As T grows, however, the variance term decreases while the u1r2 and a2r4 terms play an increasingly significant role. Thus, for a typical plot of the observed error rate versus r, € decreases for small values of r until a minimum point is reached, and then increases as the bias terms of the density estimates become more significant. The r-" in the third term of (5.26) for a small r and a large n is astronomically large (for example, r-" = 1.8 x lo1' for r = 0.5 and n = 64). It is futile t o attempt to bring down this term by selecting a large N and reducing a3. The optimal r can be obtained by taking the derivative of (5.26) with respect to r and equating it to zero. The more practical solution is t o compute the L and R errors for the various values of T , plot the curves, and find the optimal r. As r increases, the biases of both L and R errors increase, and they do not bound the Bayes error. In order to reduce this bias, we must make a1 and a2 of (5.26) small by selecting parameters, t, A1, A2 and m.
1.2 Statistical Pattern Recognition 59 (1) Selection of the decision threshold t. Changing t is a very effective way t o reduce a1 and 132. Since a1 and a2 are very complicated functions, it is more practical to find the optimal value for t experimentally. The different optimal value for t must be computed for each selection of r. Although better but more complex procedures are available, a simple procedure to determine t is presented as follows: For each value of r , find the value o f t which minimizes the R error, and then use this value o f t to find the L error. Since the selection o f t is isolated from the actual values of the L estimate of the likelihood ratio, using this method helps to maintain the independence of the test operation from the design one. (2) Selection of metrics, A1 and A2. Minimization of a1 and a2 with respect t o A1 and A2 does not give any easy answer. Also, since A1 and A2 are n x n matrices, empirical try-and-error cannot be applied easily. An intuitive selection, which is normally used, is A, = C,. But, this is far from the optimal solution. Another alternative is Ai = Ci - y i ( X - M i ) ( X -Mi)T with a constant yi, the properties of which have not been studied extensively. Even if A1 = C, is a good selection of the metric, the estimation of C, needs special care. That is, if the same sample set is used for both estimating C, and computing the L and R errors, both L and R errors are severely and optimistically biased, and often they do not bound the Bayes error. In order to avoid this optimistic bias, we must use two independent sample sets, one for each of the above two operations. A better but more complex procedure is to exclude Xi') from estimating C,, when X i.' is tested by the L and R classifiers. (3) Selection of m. As m increases, the L error curve is little affected. But, the R curve steadily rises closer to the Bayes error line until m = 4, and then stops to improve. It means that (1) both Gaussian and uniform kernels give the similar L errors, (2) the uniform kernel is better to compute the R error than the Gaussian kernel, and (3) the uniform-like kernels are obtained by selecting m = 4.
kNN classifiers. The parameters of the k N N classifier, k, t, A1 and A2, are chosen in the same way as the ones of the Parzen classifier. That is, k and t are determined empirically. We do not know how to select A1 and A2. Samples to compute the L and R errors must be independent from samples to estimate the metrics. References [l] K. Fukunaga, Introduction to Statistical Pattern Recognition, Second Ed. (Academic Press, San Diego, CA, 1990). [2] R. 0. Duda and P. E. Hart, Pattern Classijication and Scene Analysis (Wiley, New York, 1973).
60
K. Fukunaga
[3] P. R. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach (PrenticeHall, Englewood Cliffs, New Jersey, 1982). [4] A. K. Agrawala (ed.), Machine Recognition of Patterns (IEEE Press, New York, 1977). [5] P. R. Krishnaiah and L. N. Kana1 (eds.), Handbook of Statistics 2: Classzfication, Pattern Recognition and Reduction of Dimensionality (North-Holland Publ., Amsterdam, 1982). [6] T. Y .Young and K. S. Fu (eds.), Handbook of Pattern Recognition and Image Processing (Academic Press, San Diego, 1986). [7] B. V. Dasarathy (ed.), Nearest Neighbor (NN)Norms: NN Pattern Classification Techniques (IEEE Computer Society Press, Los Alamitos, CA, 1991).
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 61-103 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 1.3 I SYNTACTIC PATTERN RECOGNITION
KOU-YUAN HUANG Department of Computer and Information Science, National Chiao Tung University Hsinchu, Taiwan Sabbatical Leave at Rice University and University of Houston In pattern recognition problems, the structural information that describes the pattern is important, so we can use syntactic methods to recognize the pattern. The pattern can be described as a representation, i.e., a string, a tree, a graph, an array, a matrix, or an attributed representation. We can parse the representation and assign the pattern to its correct class. In this chapter, we present the algorithms of the finite-state errorcorrecting parsing for the recognition of the 1-D string; the modified error- correcting Earleys parsing and the parsing using a match primitive measure for the recognition of the 1-D attributed string; the weighted minimum-distance structure preserved errorcorrecting tree automaton and the modified maximum-likelihood structure preserved error-correcting tree automaton for syntactic parsing of the 2-D patterns.
Keywords: Syntactic pattern recognition, pattern representation, string distance, attributed string, finite-state automaton, tree automaton, error-correcting parsing, minimum distance, maximum likelihood.
1. Introduction Syntactic pattern recognition has been developed over two decades, received much attention and applied widely to many practical pattern recognition problems, such as (1) English and Chinese character recognition, ( 2 ) fingerprint recognition, (3) speech recognition, (4) remote sensing data analysis, (5) biomedical data analysis in chromosome images, carotid pulse waves, EEG signals,. . . , etc., (6) scene analysis, (7) texture analysis, (8) 3-D object recognition, (9) two-dimensional mathematical symbols, (10) spark chamber pictures, (11) chemical structures, (12) geophysical seismic signal analysis, . . . , etc. [3,4,8~11,13-17,21,26,28,30,3537, 39-46,51,52,57-60,64-67,70,74,76,77,85,86,89]. In pattern recognition problems, besides the statistical approach, the structural information that describes the pattern is important, so we can use syntactic methods to recognize the pattern. A pattern can be decomposed into simpler subpatterns, and each simpler subpattern can be decomposed again into even simpler subpatterns, and so on. The simplest subpatterns are called primitives (symbols, terminals). A pattern can be described as a representation, i.e. a string of primitives, a tree, a graph, an array, a matrix, or an attributed string,. . . , etc. [23,32, 61
62
K.-Y. Huang
44-46,48,69,84]. We can parse the representation and assign the pattern to its correct class. A basic block diagram of the syntactic pattern recognition system is shown in Fig. 1 [30]. The system consists of two major parts: training and recognition. The training part consists of primitive (and relation) selection, grammatical inference, automata construction from the training patterns, while the recognition part consists of preprocessing, segmentation or decomposition, primitive (and relation) recognition, construction of pattern representation, and syntactic parsing analysis for the input testing pattern. Pattern representation PreproPrimitive Representation Syntax Input bcessing -+ Segmentation +(and relation) +construction -+analysis (parsing) pattern recognition
-Training Patterns
Classification
’
~
Primitive Grammatical Automata (and relation) +inference +construction selection
-
Fig. 1. Block diagram of a syntactic pattern recognition system.
The finite-state grammar, context-free grammar and context-sensitive grammar of the formal language are adopted in the description of 1-D string representation of the pattern [2,30]. The 1-D string grammars also include programmed grammar, indexed grammar, grammar of picture description language, transition network grammar, operator precedence grammar, pivot grammar, plex grammar, attributed The syntactic parsing analgrammar, . . . , etc. [11,25,26,28,30,37,40,68,70,73,75,78]. yses include finite-state automata, pushdown automata, top-down parsing, bottomup parsing, CockeYounger-Kasami parsing, Earley’s parsing,. . . , etc. [2,30]. The description power can be extended from 1-D string grammars t o highdimensional pattern grammars for the analysis of 2-D and 3-D patterns. The high-dimensional pattern grammars include tree grammar, array grammar, web grammar, graph grammar, shape grammar, matrix grammar,. . . , etc. [12,23,30,32, 46,48,66,69,72,84]. The syntactic parsing analyses include tree automata, array automata,. . . , etc. For consideration of substitution, insertion, and deletion errors in the pattern, the automata can be expanded to error-correcting automata to accept the noisy pattern or distorted pattern [1,44,45,63,78,79,82,88].The 1-D string grammars and high-dimensional pattern grammars also include stochastic grammars, languages, and the corresponding parsers [20,29,33,65,78,81].In this chapter, we present some fundamental algorithms for parsing 1-D strings, 1-D attributed strings, and 2-D patterns.
1.3 Syntactic Pattern Hecognition 63 2. Grammar and String Distance Computation
A grammar is a 4-tuple G = (VN, VT, P, S), where VN is the nonterminal set, VT is the terminal set, P is a finite set of production rules, and S E VN is the starting symbol. From the starting symbol S, a string (sentence) is derived by using the production rules of P, S + a0 a1 + ... + a , = a. The language generated by G is defined as L(G) = {ala is in V; and S a}, where * represents several derivation steps using production rules in G [2,30]. Depending on the form of the production rules, the grammars can be divided into finite-state grammar, context-free grammar, context-sensitive grammar, and unrestricted grammar. We can use a string to represent a pattern. Due to noise and distortion in the practical applications of syntactic pattern recognition, misrecognition of primitives is regarded as substitution errors, and segmentation errors are regarded as deletion and insertion errors [30]. For a given input string y and a given grammar G, we can find the minimum distance between y and z using a parsing technique, where string z is in L(G). The parsing technique is called minimum-distance error-correcting parser (MDECP) [1,30]. We use MDECP in the finite-state parsing, attributed string parsing and tree automaton.
*
Definition 2.1. Error transformations [30]. For two strings x,y , E Vk we can define three error transformations on T : Vk V; such that y E T(x). (1) Substitution error transformation: w1 aw2
(2) Deletion error transformation: w1 a w2 ( 3 ) Insertion error transformation: w1w2 w2 E v;.
+
T
4 w1 bw2, for all a , b E VT, a # b
T
4w1 w2, for all a E VT T tf w1 b w z , for all b E
VT, where w1,
Definition 2.2. Distance between two strings [1,30,54,83]. The distance between two strings x,y E Vk, dL(x,y), is defined as the smallest number of error transformations required to derive y from x. Example 2.1. Given a sentence x = abcd and a sentence y = accbd, then x = T T abcd 4 accd +-$accbd = y. dL(x,y) = 2. The graphical interpretation is as follows. y = a c
x-;
B a
_I
____
b
c
d
. + Insertion
,____,___-,
r _ _ _ _ I
6
I
I
.
.
.
64
K.-Y. Huang
Definition 2.3. Weighted error transformations [30]. For two strings x , y E V$, we can define three weighted error transformations on T: V$ + V$ such that y E T(x). w1, wz E V; are shown as follows: TI
(1) Weighted substitution error transformation w1 a wz S ( a ,b) w1 bw2, for a , b E VT, a # b, where S(a, b) is the cost of substituting a by b. Let S(a, a ) = 0. (2) Weighted deletion error transformation WI awz where D(a) is the cost of deleting a from w1 a w 2 . (3) Weighted insertion error transformation w1 w2 I(b) is the cost of inserting b.
T
4’ D(a) w1 w2,
for a
E
VT,
T,
4I(b) w1 bwz, for b E VT, where
Definition 2.4. Weighted distance between two strings. The weighted distance between two strings x, y E V;, d W ( x ,y), is defined as the smallest cost of weighted error transformations required to derive y from z. Algorithm 2.1: Weighted distance between two strings [83] Input: Two strings x = a1a2.. . a, and y = blb2.. . b,, substitution error cost S ( a , b ) , S(a,a ) = 0, deletion error cost D(a), and insertion error cost I(a), a , b E VT. output: d(x, y). Method: Step 1. D(0,O) = 0. Step 2. Do i = 1,n. D(i, 0) = D(i - 1 , O ) D ( u ~ ) Do j = 1,m. D(0,j) = D ( 0 , j - 1)-t- I(b3) Step 3. Do i = 1 , n ; do j = l , m . el = D(i - 1 , j- 1) + S ( a i , b j ) e2 = D(i - 1,j) D(ai) e3 = D ( i , j - 1) I&) D ( i , j ) = min(el,ez,es) Step 4. d(x,y) = D ( n , m ) . Exit.
+
+ +
We may consider context-deletion and context-insertion errors, then the deletion cost Del(a,b), deleting a in front of b or after b, and the insertion cost I(a,b), inserting b in front of a or after a , must be included. The relation between error probability and string distance has previously been presented in the detection of wavelets [41,42]. 3. Finite-State Grammar and Error-Correcting Finite-State Automata In a 1-D training wavelet (signal), pattern representation performs pattern segmentation and primitive recognition to convert a wavelet into a string of primitives (symbols, terminals). Grammatical inference infers finite-state grammar from a set of training strings. The finite-state grammar is expanded to contain three types of error symbol transformations: deletion, insertion, and substitution errors. The
1.3 Syntactic Pattern Recognition 65
Input seismic signals
Location of wavelets
’
- Recognition -----Training
String pattern representation Amplitude Segmen- +dependent --p + tation encoding
-
Error-correcting finite-state parsing
Classification results
+
I---finite-state grammar
grammar and automaton
Fig. 2. A classification system of seismic wavelets using error-correcting finite-state parsing.
automaton can be constructed from error-correcting finite-state grammar. Then, the minimum-distance error-correcting finite-state automaton can perform syntactic parsing and classification of input wavelets. The system of Fig. 2 is the example that can be used for the recognition of 1-D seismic Ricker wavelets [45]. We describe the techniques in the following sections. 3.1. Grammatical Inference
3.1.1. Inference of finite-state grummur
After the primitive recognition, each training wavelet can be represented by a string of primitives (symbols, terminals). A basic canonical definite finite-state grammar can be inferred from a set of training strings [30]. The canonical definite finite-state grammar G, associated with the positive sample set S+ = { z 1 , 2 2 , . . . , 5 , ) is defined as follows:
G , = (VN, VT, P, S) , where S is the starting symbol, and VN,VT, and P are generated using the following steps. (1) Check each string xi E Sf and identify all of the distinct terminal symbols used in the strings of S+. Call this set of the terminal symbols, VT. (2) For each string zi E S+,xi = uilui2 . . . ai,, generate the corresponding production rules:
Each
Zi,j
represents a new nonterminal symbol.
66
K.- Y. Huang
(3) The nonterminal symbol set VN consists of S and all the distinct nonterminal symbols Zi,j produced in Step (2). The set P consists of all the distinct production rules generated in Step (2).
Example 3.1. Given a training string abbc, the inferred canonical definite finite, P, s) is as follows: state grammar G ~ ( V NVT,
The production rule set P:
3.1.2. General expanded finite-state g r a m m a r
Due t o noise and distortion problems, three types of error symbol transformations may occur in the strings. After the inference of finite-state grammar, the grammar is expanded t o include the three types of error production rules. And the grammar is called the general expanded finite-state grammar. The following are the construction steps. (1) The original production forms of a finite-state grammar are A + aB, or A + a. Change A + a to A + aF, where F is the new nonterminal with ending terminal a. (2) The production forms added to account for substitution errors are A + bB, where a # b. ( 3 ) The production forms added to account for insertion errors are A 4 aA. (4) The production forms added to account for deletion errors are A + XB, where X is the empty terminal. We could put weights to production rules, especially the different weights in the three types of error production rules. The following is the algorithm.
Algorithm 3.1: Construction of general expanded finite-state grammar. Input: A finite-state grammar G = (VN,VT, P, S). Output: The general expanded finite-state grammar G = (V”, VT,, P’, S’), where P’ is a set of weighted production rules. Method: (1) For each production rule in P with the form A + a in the grammar G, change the rule to the from, A -+ aF, where F is a new nonterminal. (2) Set VTJ = VT U {A}, VN, = VN U {F}, S’ = S. ( 3 ) Let the production rule set P‘ = P with a weight of zero for each original production rule. (4) For each nonterminal A in VNJadd t,he production A + XA (with the weight 0) to P’.
1.3 Syntactic Pattern Recognition 67 (5) Substitution error production: For each production A -+ a B in P do For each terminal b in VT do If A t bB is not in the P’ then add the production A t bB (with the weight 1) t o P’. (6) Insertion error production: For each nonterminal A in V” do For each terminal a in VT do If A t aA is not in the P’ then add the production A t aA (with the weight 1) to P’. (7) Deletion error production: For each production A t a B in P(A # B) do If A -+ XB is not in the P’ then add the production A -+ XB (with the weight 1) to P’. (8) Add the production F -+ X (with weight 0) to P’. (9) Output G’.
Equal unit weight is assigned to each error production rule in the algorithm. Different weights may be assigned in Steps (5), (6) and (7).
Example 3.2. From the above Example 3.1, given the training string abbc, the inferred general expanded finite-state grammar G’(VN?,VTJ,P’, S’) using Algorithm 3.1 is as follows: VN, = { S , A , B , C , F } VTJ = {a,b,c,X} S’
=
{S}
The production rule set P’ is: (0) S t aA, 0 (1) A -+ bB, 0 (2) B -+ bC, 0 (3) C -+ cF, 0 (4) s + AS, 0 (5) A t XA, 0 (6) B -+ XB, 0 (7) c t XC, 0 (8) F -+ XF, 0 (9) S -+ bA, 1 (10) S t c A , 1 (11) A - + a B , 1 (12) A t cB, 1
(13) B t aC, 1 B -+ cC, 1
(14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25)
C ta F , 1 C -+ bF, 1 S -+ as,1 S t bS, 1 s -+ cs, 1 A t aA, 1 A t bA, 1 A -+ cA, 1 B t aB, 1 B t bB, 1 B -+ cB, 1
(26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36)
C -+ aC, 1 C + bC, 1 C -+ cC, 1 F ta F , 1 FtbF,1 F -+ cF, 1 S -+ XA, 1 A -+ XB, 1 B -+ XC, 1 C t XF, 1 F -+ A, 0
Production rules (9) to (16) handle the substitution errors, rules (17) to (31) handle the insertion errors, and rules (32) to (35) handle the deletion errors. The corresponding error-correcting finite-state automaton is shown in Fig. 3.
68
K.-Y. Huang
Fig. 3. Finite-state transition diagram of the general expanded grammar of Example 3.2.
3.1 3. Restricted expanded finite-state grammar
For insertion error, we can insert an error terminal symbol before or after some arbitrary terminal symbol. Then we can expand the form of the production rule A -+ aB with restricted insertion error as follows:
or
A
+ bB1, B1 -+ U B
(Insert b in front of a ) ,
A
+ uB2, B2 -+ bB
(Insert b after u )
The proposed algorithm is described in the following.
Algorithm 3.2: Construction of restricted expanded finite-state grammar Input: A finite-state grammar G = (VN,VT, P, S). Output: The restricted expanded finite-state grammar G’ where P‘ is a set of weighted production rules. Method:
=
(V”, VT,, P’, S’),
(1) For each production rule in P with the form A -+ a in the grammar G, change the rule to the form A + aF, where F is a new nonterminal. (2) Let P’ = P with the weight 0 for each original production rule. (3) Substitution error production: For each production A -+ aB in P do For each terminal b in VT do If A -+ bB is not in the P’ then add the production A -+ bB (with the weight 1) to P’. (4) Insertion error production: For each production A + aB in P do { Insert b in front of a }
1 . 3 Syntactic Pattern Recognition
(5)
(6) (7) (8) (9)
69
For each terminal b in VT do add the production A + bB1 (with the weight 1) t o P’, add the production B1 t a B (with the weight 0) to P’, and { Insert b after a } For each terminal b in VT do add the production A + aBz (with the weight 0) to P‘, add the production B2 tbB (with the weight 1) t o P’. Deletion error production: For each production A taB in P (A # B) do If A + XB is not in the P’ then add the production A -+ XB (with the weight 1) to P’. Set S’ = S,VT, = VT U { X},V” = all the distinct nonterminal symbols in P’. For each nonterminal A in V” do add the production A tXA (with the weight 0) to P’. Add the production F -+ X (with weight 0) t o P’. Output G’.
Example 3.3. From Example 3.1, given the training string abbc, the inferred restricted expanded finite-state grammar G”(VN”,V y , P”, S”) using Algorithm 3.2 is as follows: VN” = { S , A , B , C , D , E , F , G , H , I ,J , K , L } VT” = {a,b,c,X} S” (0) S + aA, 0 (1) A + bB, 0 (2) B tbC, 0 (3) C -+ cF, 0 (4) S + bA, 1 (5) S + cA, 1 (6) A + aB, 1 (7) A t cB, 1 (8) B + aC, 1 (9) B --t cC, 1 (10) C + aF, 1 (11) C +bF, 1 (12) S -+ XA, 1 (13) A -+ XB, 1 (14) B + XC, 1 (15) C + XF, 1 (16) S t a D , 1 (17) S t bD, 1 (18) S -+ cD, 1 (19) D + aA, 0 (20) S -+ aE, 0
(21) E t aA, 1 (22) E t bA, 1 (23) E -+ cA, 1 (24) A t aG, 1 (25) A + bG, 1 (26) A t cG, 1 (27) G t bB, 0 (28) A -+ bH, 0 (29) H + aB,1 (30) H + bB, 1 (31) H t cB, 1 (32) B + aI, 1 (33) B + bI, 1 (34) B -+ cI, 1 (35) I t b C , 0 (36) B 4 bJ, 0 (37) J + a C , 1 (38) J -+ bC, 1 (39) J + cC, 1 (40) C --t aK, 1 (41) C t bK, 1
(42) (43) (44) (45) (46) (47) (48) (49) (50) (51) (52) (53) (54) (55) (56) (57) (58) (59) (60) (61)
=
{S}
C t cK, 1 K + cF, 0 c -+ CL, 0 L + aF, 1 L -+ bF, 1 L -+ cF, 1 S +AS, 0 A -+ XA, 0 B t XB, 0 C -+ XC, 0 D tAD, 0 E tXE, 0 F -+ XF, 0 G tXG, 0 H tAH, 0 I + XI, 0 J + XJ, 0 K + XK, 0 L -+ XL, 0 F -+ A, 0
The transition diagram of the error-correcting finite-state automaton is similar to Fig. 3.
70
K.-Y. Huang
3.2. Minimum-Distance Error- Correcting Finite-State Parsing Input testing pattern strings can be analyzed by a finite-state automaton that can accept strings derived by finite-state grammar. Given a finite-state grammar G , there exists a finite-state automaton M such that the set of strings accepted by M is equal to the set of strings L(G) derived by the finite-state grammar G. The automaton M can be represented by a finite-state transition diagram [30]. The production rule A + aB in the finite-state grammar corresponds to the transition 6(A, a ) = B in the automaton. An input string can go from the initial state to the final state of the automaton if the string is accepted by the automaton. Here, each transition of the error-correcting finite-state automaton has two attributes, input terminal symbol and weight value. For the production rule A + aB with weight w , we use C A B ( U )= w as the cost representation in the transition of the automaton. We want t o parse the input string from the initial state to the final state with minimum cost. The following algorithm is proposed to compute the minimum cost (distance) by using the dynamic programming technique [30].
Algorithm 3.3: Minimum cost of error-correcting finite-state parsing with two attributes in each transition. Input: An error-correcting finite-state automaton with n nodes numbered 1 , 2 , . . . , n, where node 1 is the initial state and node n is the final state. Two attributes, terminal symbol and its cost function C i j ( a ) ,for 1 5 i, j 5 n, a E (V, U {A}), with Ci j(a) 2 0, for all i and j . An input testing string S. Output: MI, the minimum cost of the path from node 1 to node n when the parsing sequence is equal to the terminal sequence of the input string S. Method: (1) MI1 = 0 , M l j = maxint (a large number), 1 5 j 5 n. (2) For 1 5 k 5 n do For 1 5 j 5 n do M1j = min {Mlk
+ Ckj(A), 1 5 k 5 n }
( 3 ) Set h = 1 (4) Let Mik = Mlk, 1 I k 5 n.
For all 1 5 j 5 n do M1j = min {Mik C k j ( b ) ,1 5 k 5 n } , where b = S(h), the hth terminal symbol of S.
+
( 5 ) For 1 5 k 5 n do For 1 5 j 5 n do Mlj = min {Mlk
+ Ckj(A), 1 5 k 5 n}
(6) If h < IS(, then increase h by 1 and go to Step (4). If h = S, then go to Step (7). (SI is the number of terminal symbols in S. (7) Output MI, which is the minimum cost from node 1 to node n following the terminal sequence of the input string S. Then stop.
1.3 Syntactic Pattern Recognition
71
This algorithm is a minimum-cost error-correcting finite-state parsing when insertion] deletion, and substitution error transformations are considered. The cost function Cij(a) denotes the cost of moving from state i to state j when the input symbol is ‘a’. It is noted that C i j ( a ) = maxint, a large number, if there is no direct path from state i to state j under the symbol ‘a’. The value of C,j(a) comes from the production rule of the expanded grammar. M l j is the minimum cost from state 1 to state j. Steps (2) and (5) handle the transitions (i.e. deletion error). Step (4) handles the substitution and insertion errors.
Example 3.4. From the above Examples 3.1, 3.2, and 3.3, given the training string abbc, and its inferred error-correcting finite-state automata from the above two expanded grammars] an input testing string aabc is parsed by using Algorithm 3.3 and the parsing results from both automata with minimum cost are shown in Figs. 4 and 5. The minimum parsing costs are 1 from both automata.
Fig. 4. Parsing result of test string aabc using Algorithm 3.3 according to the state transition of general expanded grammar of Example 3.2 in Fig. 3. Minimum cost is 1.
Fig. 5. Parsing result of test string aabc using Algorithm 3.3 according to the state transition of restricted expanded grammar of Example 3.3. Minimum cost is 1.
72
K.-Y. Huang
4. Attributed Grammar and Minimum-Distance Error-Correcting
Earley’s Parsing
4.1. Introduction In the syntactic approach] after segmentation and primitive assignment of the digital signal, the same primitive may be repeated several times. This often makes the size of the pattern strings and inferred grammars unnecessarily large. Instead of keeping trace of all these identical primitives, we can use one syntactic symbol to represent the type of primitive with a n attribute t o indicate the length of that primitive. This leads to the application of the length attribute t o seismic and other similar digital signal analyses [44,59]. Attributed grammar has been applied in pattern recognition. You and Fu [87] implemented attributed grammars in shape recognition. Shi and Fu [76] proposed the use of attributed graph grammars. A grammar of which production is associated with a set of semantic rules is called an attributed grammar [30]. Both the inherited and the synthesized attributes often lead to significant simplification of grammars [30]. The advantages of using attributed grammars for pattern recognition are two-fold. First, the inclusion of semantic information increases flexibility in pattern description. Second, it reduces the syntactic complexity of the pattern grammar. The resulting grammar size for each pattern class is reduced. Similarity and dissimilarity measures between two strings have been discussed in many articles [49,50,54,83]. Here, the similarity measure between two attributed strings is proposed and is called the match primitive measure (MPM). Two parsing methods for attributed strings are proposed. One is the modified minimum distance error-correcting Earley’s parsing and the other is the parsing algorithm using match primitive measure (MPM). The system of Fig. 6 shows the example of two parsing methods that can be used for the recognition of seismic wavelets [44].
Input seismic signals
Location b of wavelets
Training wavelets
Attributed string representation
Attributed string representation
Error-correcting Earley’s parsing or parsing using match primitive measure
-
Inference of b attributed
grammar
Classification results
1.3 Syntactic Pattern Recognition
73
4.2. Attributed Primitives and Strings
In this chapter, each primitive is accompanied by a length attribute. That is, each pattern primitive, a , can be represented by a two-tuple,
where s is a syntactic symbol denoting the basic segment, and y represents the length of a. For example, a pattern string is aaadgggggeeaaagg
It can be simplified by merging the identical symbols. Thus, the above string becomes (a,3 ) ( 4 l)(g, 5)(e,2 ) ( a ,3)(9,2)
as the attributed string where each number represents the number of duplications for that symbol. This idea leads to some storage improvement in string representation.
4.3. Definition of Error Transformations for Attributed Strings In minimum-distance error-correcting parsing for context-free languages, three different types of error transformations (insertion, deletion, and substitution) have been defined [1,54,83]. In order to handle these three error transformations in the parsing of attributed strings, error transformations for attributed strings have been defined. Errors can be classified as global and local deformations. For global deformations, errors consist of insertion and deletion errors, and each of them can deform syntactically and semantically. (1) A syntactic insertion error is the replacement of a null string X (the length of X is zero) by a syntactic symbol, i.e.
(2) A semantic insertion error is the addition of a length attribute, i.e.
( s , y l ) 4(s,y2) where y l
< y2
When a syntactic insertion error has occurred, the associated semantic length is also added. However, a semantic insertion error can occur without any syntactic insertion error. (3) A syntactic deletion error is the replacement of a syntactic symbol by a null string A, i.e. ( s ,y) -+(X,O) .
74
K.-Y. Huang
(4) A semantic deletion error is the removal of an attribute from a semantic length, i.e. (s, y1) + (s,y2) where y l > y 2 . When a syntactic deletion error has taken place, the corresponding semantic length is also deleted. A semantic deletion error can occur without a syntactic deletion error. For local deformations, a substitution error can take place. (5) A syntactic substitution error is defined as the replacement of primitive s by another primitive t , i.e. (s,
Y) + (tl Y) .
A semantic substitution error is not defined, because it may be counted as a semantic insertion or deletion error. 4.4. Inference of Attributed Grammar
An attributed context-free grammar is a 4-tuple G = (V,, Vt, P, S), where V, is the nonterminal set, Vt is the terminal set, P is a finite set of production rules, and S E V, is the starting symbol. In P, each production rule contains two parts: one is a syntactic rule, and the other is a semantic rule. Each symbol X E (V, U V,) is associated with a finite set of attributes A(X); A(X) is partitioned into two disjoint sets, the synthesized attribute set Ao(X) and the inherited attribute set Al(X). The syntactic rule has the following form:
where k means the kth production rule. The semantic rule maps values to the attributes of X ~ , O Xk,l, , X ~ J ,. . ,Xk,nk. The evaluation of synthesized attributes is based on the attributes of the descendants; therefore it proceeds in bottom-up fashion. On the other hand, the evaluation of inherited attributes is based on the attributes of the ancestors; therefore it proceeds in top-down fashion. To explain the inference procedure, let us consider the example of the previous string aaadgggggeeaaagg,
where each primitive has a unit length attribute 1. First it will be converted into the following string by merging identical primitives:
Let the upper case character be nonterminal and the lower case character be terminal. Then, we can infer the following attributed grammar.
1.3 Syntactic Pattern Recognition 75
Syntactic rules
Semantic rules
(1) S -+ ADGEAG (2) A + a A (3) A + a (4) D + dD (5) D +d (6) E + eE (7) E + e (8) G + 9G (9) G + 9 where L denotes the inherited length attribute, and y denotes the synthesized length attribute. The number right after the nonterminal symbol in the semantic rules is used to distinguish between occurrences of the same nonterminal. For example, in the production rule (2), A1 represents the nonterminal A on the left side; A2 represents the nonterminal A on the right side of the syntactic part. It is noted that the inherited length attribute, L, is not down to the descendents as it usually is; rather it is used t o maintain the semantic information of the training string and as a reference for comparison in parsing. For simplicity, let y(a) = 1 for all a E Vt. Consider the second input string
aakkdddf f eeeea. We convert it into
( a ,2)@, 2 x 4 3 ) ( f , 2)(e, 4)(a, 1) and add the following productions to the inference grammar. Syntactic rules
Semantic rules
S -+ AKDFEA
L(A1) = 2, L(K) = 2, L(D) = 3, L(F) = 2, L(E) = 4, L(A2) = 1 K + kK Y(K1) = Y(k) + Y(K2) K-+k Y(K) = Y(k) F + fF y(F1) = Y ( f ) + Y(F2) F-+f Y(F) = Y ( f ) For the new input string, there will be no need to add those production rules, A -+ a A , A + a , . . . , etc. One production rule is created for each input string, i.e. the first production rule in the above example. In fact, there are (2m+n) production rules for a set of n training strings, where m is the number of nonterminal symbols. We now formulate the inference algorithm of attributed grammar which uses the length attribute [59].
Algorithm 4.1: Inference of an attributed context-free grammar. Input: A set of training strings. Output: An inferred attributed context-free grammar. Method:
76
K.-Y. Huang
(1) Convert each input string t o the attributed string by merging identical primitives. (2) For each input attributed string a l a 2 a 3 . . ’ a k , add the production S + AlAzA3.. . A k t o the inference grammar, where Ai is the nonterminal corresponding to terminal ai, and the semantic rule L(Ai) = yi, 1 5 i 5 k, where yi is the length attribute of primitive ai. (3) For each primitive a,add the production rule A -+ aA, y(A1) = y(a)+y(A2) and A + a,y(A) = y ( a ) to the inference grammar, if they are the new production rules. This inferred grammar will generate excessive strings if we only apply syntactic rules. However, we can use semantic rules (inherited attributes) t o restrict the grammar so that no excessive strings are generated. 4.5. Minimum-Distance Error-Correcting Earley ’s Parsing f o r
Attributed Strings
A modified Earley’s parsing algorithm is developed here for attributed contextfree languages. Here, errors of insertion, deletion, and substitution transformation are all considered in the derivation of Earley’s item lists. Let the attributed grammar G = (V,, V t ,P, S) be a CFG (context-free grammar), and let z = b l b 2 . . . b, be an input string in Vt*. The form [A + a . p, 5 , y, i] is called a n item for z if A + ap is a production in P and 0 5 i 5 n [ 2 , 2 2 ] . The dot in a . ,L? between a and p (we use a . p in the program output) is a metasymbol not in V, or V t ,which represents the parsing position; 5 is a counter for local syntactic deformation which accumulates the total cost of substitution of terminal symbols. When A = S,y is used as a counter for global deformation which records the total cost of insertion and deletion errors. On the other hand, if A # S then y is used as the synthesized attribute of A. The meaning of index i is the starting parsing position of the string, and it is the same pointer as the conventional item of Earley’s parsing algorithm. The parsing algorithm for an input string z is shown in the following. Algorithm 4.2: Minimum-distance error-correcting Earley’s parsing for an attributed string. Input: An attributed grammar G = (V,, Vt, P, S) and a test string z = blb2 . . . b, in Vt*. Output: The parse lists 10,11,. . . ,I,, and the decision whether or not z is accepted by the grammar G together with the syntactic and semantic deformation distances. Method: (1) Set j = 0 and add [S + . a , 0, 0, 01 to I j if S + a is a production in P. (2) Repeat Steps (3), (4) and (5) until no new items can be added t o Ij. (3) If [B + ( . , x l , y l , z ] is in Ij, B # S, and
1.3 Syntactic Pattern Recognition 77 (a) if [A + a . BP, 2 2 , y2, Ic] is in Ii and A # S, then add item [A + a B . p, 21 2 2 , yl y2, k] to Ij. (b) if [S -+ a . BP, 2 2 , y2, k] is in Ii, then add item [S + a B . p, 21 2 2 , y2 IL(B) - yll, k] to Ij. (c) if [S + a . Cp, 2 2 , y2, Ic] is in Ii, C # B, then add item [S + a . Cp, 2 2 , yl y2, Ic] t o Ij. (d) if [S 4a .,2 2 , y2, Ic] is in Ii, then add item [S + a ., 2 2 , yl y2, Ic] to Ij. If B + [ is a production in P, and if [A + a . BP, 2,y, i ] is in Ij, then add item [B 4 .[, O , O , j] to Ij. If [S + a . BP, 2 , y, i ] is in I j , then add item [S -+a B . p, 2,y L(B), i ] t o Ij. If j = n, go to Step (8); otherwise increase j to j 1. For each item [A + a . a p , z , y , i ] in Ij-1, add item [A + a a . p , z + S ( a , b j ) , y + y(a),i]to Ij, where y ( a ) is the synthesized attribute of a. For simplicity, let y ( a ) = 1 for all a in Vt. S(a, b j ) is the substitution cost, and S ( a ,a ) = 0. Go to Step (2). If item [S -+ a ., 2 , y, 01 is in I,, then string z is accepted by grammar G where z is the local deformation distance, and y is the global deformation distance; otherwise, string z is not accepted by grammar G. Exit.
+ +
+ +
+
+
(4) (5) (6) (7)
(8)
+
+
In the above algorithm, Step (3b) handles the semantic insertion and deletion errors, Steps (3c) and (3d) handle the syntactic insertion errors, Step (5) handles the syntactic deletion errors, and step (7) handles the syntactic substitution errors. It is possible for collision t o occur in the process of developing a new item, i.e. the old item has already been in the list when a new item is to be put in the list. Under this situation, the one with the minimum distance (minimum of z + y) is selected for that item. Actually, collision may occur with only the items related to S-productions, because insertion and deletion transformations are allowed for those items only. Training string : ahhc The attributed grammar G(Vn,Vt,P,S)is as follows : Vn = ( S , A, B, C) V, = la. h, c )
s
=IS}
The production set P i s as follows : Syntactic rules Semantic rules
Fig. 7. Training string abbc and its inferred attributed grammar for Earley’s parsing.
78
K.-Y. Huang
‘est string aabc and its item lists :
I[O] contains [S->ABC. [C->.c [C->.cc [S->AB.C [B->.b [B->.bB [S->A.BC [A->.a [A->.aA [S->.ABC
I[3] contains [A->aA. [B->bB. [C->cC. [C->.c [C->.cC [B->.b [B->.bB [A->.a [A->aA [C->cc. [B->bB. [S->.ABC [A->aA. [S->AB.C [S->ABC. [S->A.BC [A->a.A [A->a. [B->b.B [B->b. [C->c.c [C->c.
,0,4,01 0 0 0I ,O,O,O] ,0,3,0l ,O,O,OI ,O,O,O] 7
2
7
,0,1,0]
,O,O,OI ,O,O,Ol ,O,O,O]
,1,3,01 ,2,3,01 ,3,3,01 ,0,0,31 ,0,0,3] ,0,0,31 ,0,0,3l .0,0,3l ,0,0,31 ,2,2,1] .1,2,1l ,0,3,0] ,1,2,1] ,l,O,OI , I ,1,0] ,0,2,0] ,1,1,2] > 1,1,2l ,0,1,2] ,0,1,21 ,1,1,21 ,1,1,21
I[1] contains [C->.c [C->.cc [B->.b [B->.bB [A->.a [A->.aA [S->.ABC [S->A.BC [S->AB.C [S->ABC. [A->a.A [A->a. [B->b.B [B->b. [C->c.c [C->c.
I[4] contains [C->cC. [B->bB. [A->aA. [A->aA. [B->bB. [C->cC. [C->.c [C->.cC [B->.b [B->.bB [A->.a [A->aA [C->cc. [B->bB. [S->.ABC [A->aA. [S->ABC. [S->AB.C [S->A.BC [A->a.A [A->a. [B->b.B [B->b. rc->c.c
,O70,1l ,0,0,1]
,0,0,11 ,0,0,1]
,o,o,11 ,0,0,1] ,0,1,0] ,O,O,O] .0,2,0] ,0,3,0] ,0,1,0] ,o,1,OI ,1,1,0] ,1,1 ,OI ,1,1,0] ,1,1 ,OI
I[2] contains [C->.c [C->.cc [B->.b [B->.bB [A->.a [A->.aA [C->cc. [B->bB. [S->.ABC [A->aA. [S->ABC. [S->AB.C IS->A.BC [A->a.A [A->a. [B->b.B [B->b. [C->c.c [C->c.
.0,0,21 ,0,0,21 ,0,0,2] ,0,0,2] ,0,0>21 ,O,O,21 ,2,2,0] ,2,2,0] ,0,2,0] ,0,2,0l .1,2,0] , I , I ,O] ,O,l ,O] ,0,1,1I 8 . 1 >1l ,1,1,1] .1,1,1l ,l,l,ll ,l,~?~l
.3,4,0] .3,4,0] .2,4,0] ,2,3,1] ,2,3,1l ,2,3,11 ,0,0,4l ,0,0,4] ,0,0,41 ,0,0,4l ,0,0,4l ,0,0,41 ,1,2,2l ,1,2,2] ,0,4,0] ,2,2,2] ,1,0,0] .1,1,0] ,0,3,0] ,1,1,3] ,I, 1 3 1 ,0,1,3] ,0,1,31 .I,M
Fig. 8. Item lists of t h e Earley’s attributed parssing on the test s t r i n g a u k .
1.3 Syntactic Pattern Recognition 79 Since the error-correcting grammar is ambiguous, the time complexity is O ( n 3 ) , and the space complexity is O ( n 2 ) ,where n is the length of the input string. The parsing is inefficient if the length of input string is large. 4.6. Experiment
Given a training string abbc and using Algorithm 4.1, the inferred attributed grammar is shown in Fig. 7. An input test string aabc is parsed by Algorithm 4.2. The corresponding item lists are shown in Fig. 8. As we can see from the derived item lists, all three kinds of errors are considered. The corresponding items are generated for each possible error transformation. Because the item [S + ABC ., 1 , 0 , 0 ] is in the 14 list, the string aabc is accepted with local syntactic deformation distance 1 and global deformation distance 0. 5. Parsing of Attributed Strings Using Match Primitive Measure (MPM) Although the modified Earley’s parsing algorithm considers all three types of errors, the parsing speed is inefficient. Here, the parsing of an attributed string using the match primitive measure (MPM) is proposed. The similarity measure of attributed strings is discussed in the following.
Fig. 9. The partial MPM f[i,j] computed from f [ i , j - 11 and f[i
-
1, j ]
5.1. Match Primitive Measure- Similarity Measure of Attributed String Matching The match primitive measure (MPM) is defined as the maximum number of matched primitives between two strings. Here, the similarity measure between two attributed strings is proposed. The computation of MPM between two lengthattributed strings can be implemented by the dynamic programming technique on grid nodes as shown in Fig. 9. For each node, three attributes are associated, i.e.
80
K.-Y. Huang
(f,h, w). Let a be an attributed string, where a[i] denotes the ith primitive in a; a[i].s and a[i].y denote the syntactic symbol and length attribute of a[i], respectively. Let ( i l j )indicate the position in the grid. f [i,j]represents the MPM value from point (0, 0) to ( i , j ) ,i.e. the MPM value between two attributed substrings (a[l] . s, a [ l ] . y)(a[2].s,a[2].y). . . (a[i].s,a[i].y)and (b[l].s,b[l]-y)(b[2].s,b[2].y).. . ( b [ j ] . sb[j].y) , of attributed strings a and b. h [ i , j ]and w [ i , j ] represent the residual length attributes of primitive a[i]and b [ j ] ,respectively, after the match primitive measure (MPM) between two attributed substrings (a[l] . s, a [ l ] . y)(a[2]. s, a[2]. y) . . . ( a [ i ]. s, a[i]. y) and (b[l]. s, b [ l ] . y)(b[2] . s , b[2] . y) . . . ( b [ j ]. s, b [ j ]. y) of attributed strings a and b. The partial MPM f [i,j]can be computed from the partial MPM’s f [i - 1,j] and f [i,j - 11 as shown in Fig. 9. The following algorithm is proposed t o compute the MPM between two attributed strings. Algorithm 5.1: Computation of the match primitive measure (MPM) between two attributed strings. Input: Two attributed strings a and b. Let a = (a[l] . s, a[I] . y)(a[2] . s, a[2] . y) . . . (a[rn] . s, a[m] . y) b = (b[l] . S, b[l] . y)(b[2] . S , b[2] . Y ) . . . ( b [ n ]. S , b[n]. Y), where m,n are the number of primitives of a and b, respectively. Outout: The maximum MPM S(a, b). Method: (1) f[O, 01 := 0; h[O,01 := 0; w[O,01 := 0; (2) for i := 1 to m do begin f [ i , O ] := 0; h[i,O]:= a[i]. y; w[i,O] := 0; end; (3) for j := 1 to n do begin f [O, j] := 0; h[O,j]:= 0; w [ O , j ] := b[j] . y; end; (4) for i := 1 to m do for j := 1 to n do begin nodl:= hmove ( i ,j ) nod2:= vmove ( i , j ) if nod1.f >nod2.f then node [Z,j]:= nod1 else node [Z,j]:= nod2; end; (5) output S ( a , b ) := f [m,n]/dGqE;where y1 = a[i] . y,y2 = cj b[jl . y Functions hmove and vmove are written as follows: function hmove (i, j ) : node-type; {node (i - 1,j)+ node ( i , j ) } begin
1.3 Syntactic Pattern Recognition 81 if a[i]. s # b[j] . s then d l := 0; else d l := min(v[i - I , j ] , a [ i ] .y); hm0ve.f := f[i - l , j ] d l ; hm0ve.h := a[i] . y - d l ; hm0ve.v := v[i - l , j ]- d l ; return(hmove) ;
+
end; function vmove (i,j): node-type; {node ( i , j - 1) + node ( i , j ) } begin if a[i] . s # b [ j ]. s then d l := 0; else dl: = min(h[i,j - I],b [ j ]. y); vmove-f := f[i,j - 11 d l ; vm0ve.h := h [ i , j- 11 - d l ; vmovew := b [ j ]. y - d l ; return (vmove); end;
+
J
\
0
f=O h=O v=o
f=O h=3 v=o
1
f=O h=O v=4
f=3 h=O v=l f=3
2
f=O h=O v=2
h=O v=2
3
f=O h=O v=2
f=3 h=O v=2
4
f=O h=O v=4
f=3 h=O v=4
5
f=O h=O v=1
f=3 h=O
6
f=O h=O v=3
f=3 h=O v=3
v=l
In the above algorithm, two functions are used. Function hmove is used t o compute the variation of attributes ( f , h , u ) from node (i - 1,j) to node ( i , j ) . Function vmove is used to get the value of (f,h, u)at point (i, j ) from point (i, j-1). An example of the MPM computation of two attributed strings is shown in Fig. 10. The normalized MPM value is calculated. 5.2. Inference of Attributed Grammar For the parsing of an attributed string using the property of the MPM, the attributed grammar for the training strings is inferred first.
Algorithm 5.2: Inference of attributed grammar. Input: A set of training strings. Output: An attributed grammar. Method: (1) Convert each input string to the attributed string by merging identical primitives. (2) For each input attributed string a1a2a3 . . . ak, add to the grammar the production S -+AlA2As . . . Ak, where Ai is the nonterminal corresponding to terminal a,; and the semantic rule L(Ai) = yi, 1 5 i 5 k , where yi is the length attribute of primitive ai. (3) For each primitive a , add the production rule A + a , y(A) = y ( a ) and y(a) = y, where y is the length attribute of primitive a. The example is shown in Fig. 11
Training string : abbc The attributed grammar G(Vn,Vt,P,S) is as follows Vn = {A,B, C, SJ Vt = {a, b, c ) s ={S) The production set P is as follows : Syntactic rules Semantic rules
Fig. 11. Training string abbc and its inferred attributed grammar for the M P M parsing.
1.3 Syntactic Pattern Recognition 83 5.3. Top-down Parsing Using M P M Given an attributed grammar G and input attributed string z , the value of MPM between z and L(G), the language generated by the grammar G, is calculated. Consider an S-production rule in the grammar, which has the form
For each nonterminal a t the right-hand side of S-production rule, two attributes are associated with it. f [ k ]denotes the MPM value calculated from the beginning up to the parse of kth nonterminal. h [ k ]is a kind of residual attribute used for the calculation later on. The proposed algorithm to compute the MPM between z and L(G) is described in the following.
Algorithm 5.3: Top-down parsing using the MPM. An attributed grammar G = (Vn,Vt, P, S) and an input string z . Let m = the number of primitives in z . n = the length of z = z [ i ]. y. Output: The maximum MPM between z and L(G). Method:
Input:
xi
(1) Set N = the number of S-production rules, and max-MPM = 0. (2) Set f [ O ] = 0 and h[O]= 0. (3) For all 1 5 k 5 N do Steps (4)to (10). (4) Apply the kth S-production rule with the form Sk + Ak,lAk,Z. . . Ak,,, , where m k is the number of nonterminals at the right-hand side of the kth S-production rule to do Steps (5) to (8). (5) For all 1 5 i 5 mk do
f [;I = 0; Wl = L(Ali,z); 1. (6) For all 1 5 j 5 m do Steps (7) and (8). (7) Set v0 = z [ j ]. y and v = v0. (8) For all 1 5 i 5 mk do Apply production rule Ak,i t Uk,i. (a) if z [ j ]. s = ak,i, then d l = min(y(ak,i),ZI) else d l = 0; f l = f [ i - 11 d l ; hl = Y ( a k , i ) - d l ; w l = Y - dl; (b) if z [ j ]. s = u k , i , then d l = min(h[i],v0) else d l = 0; f2 = f [ i ] dl; h2 = h[i]- d l ; v2 = vo - d l ;
+
+
84 K.-Y. Huang
(c) if f l
> f 2 then
{ f[iI = f l ; h[i]= h l ; v = vl; } else{ f[i] = fa; h[i] = h2; v = v2; } (9) MPM = f [ m k ] / d x , where l k = CzlL(Ak,i). (10) If MPM>max-MPM, then max-MPM = MPM. (11) Output max-MPM. Here the normalized MPM is calculated. Algorithm 5.3 is obtained from the comparison between the input string and the string generated by the S-production rule.
Example 5.1. The training string abbc and its inferred attributed grammar are shown in Fig. 11. One input string aabc has been tested, and the parsing result is shown in Fig. 12. The MPM value is 0.75 after normalization. Test string : aabc = (a,2)(b,l)(c,l) 1
2
3
0
0 1
1 1
0
1
0 1
1
2
1
1
3
1
1 2 2
2 3
Fig. 12. Parsing result using M P M for the test string aabc, the MPM value is 0.75.
6. Tree Grammar and Automaton 6.1. Introduction Tree grammars and the corresponding recognizers, tree automata, have been successfully used in many applications such as: English character recognition, LANDSAT data interpretation, fingerprint recognition, classification of bubble chamber photographs, and texture analysis [30,57,64,67,89]. Fu pointed out that “By the extension of one-dimensional concatenation to multidimensional concatenation strings are generalized to trees.” [30]. Comparing with other high dimensional pattern grammars: web grammar, array grammar, graph grammar, plex grammar, shape grammar,. . . , etc. [30], tree grammar is easy and convenient to describe a pattern using data structure of the tree, especially in the tree traversal and in the substitution, insertion, and deletion of a tree node.
1.3 Syntactic Pattern Recognition 85
Input testing seismogram
Recognition Training
-
Preprocessing: (1) Envelope (2) Thresholding (3) Compression (4) Thinning
~
1 Training seismic
Pattern representation: (1) Pattern extraction (2) Primitive recognition (3) Tree construction
Pattern representation: ( 1 ) Pattern extraction
Error-
conecting
Classification
tree automata
grammar
Fig. 13. A tree automaton system for seismic pattern recognition.
An example of applying tree grammar and automata t o recognize 2-D synthetic seismic patterns is presented. The system of a tree automaton is shown in Fig. 13. In the training part of the system, the training seismic patterns of known classes are constructed into their eorresponding tree representations. Trees can infer tree grammars [5,7,55,56,65].Several tree grammars are combined into one unified tree grammar. Tree grammars can be used t o generate the error-correcting tree automaton. In the recognition part of the system, each input testing seismogram passes through preprocessing [38,47],pattern extraction, and tree representation of the seismic pattern. Then each input tree is parsed and recognized by the errorcorrecting tree automaton into the correct class. Three kinds of tree automatons are adopted in the recognition: weighted minimum distance structure preserved errorcorrecting tree automaton (SPECTA), modified maximum-likelihood SPECTA, and minimum distance generalized error-correcting tree automaton (GECTA). We have some modifications on the methods of weighted minimum distance SPECTA and maximum-likelihood SPECTA. 6.2. Tree Grammar and Language
A tree domain (tree structure) is shown below [30]. Each node has its ordering position index and is filled with a terminal symbol. 0 is the root index of a tree. 0
0.1
/ I \ 0.2
/ \ 0.1.1
0.1.2
...
0.1.1.2
. ..
/ \ 0.1.1.1
0.3 . ..
86
K.-Y. Huang
Each node has its own children, except the bottom leaves of the tree. The number of children at each node is called the rank of the node. Although there are different kinds of tree grammars [30], we use the expansive tree grammar in the study because of the following theorem. Theorem 6.1. For each regular tree grammar Gt, one can effectively construct an equivalent expansive grammar Gt’, i.e. L(Gt’) = L(Gt) [30]. An expansive tree grammar is a four-tuple Gt = (V, T , P, S), where = V N u VT, VN = the set of nonterminal symbols, VT = the set of terminal symbols, S : the starting nonterminal symbol, T : the rank of terminal symbol, i.e. the number of children in the tree node, and each tree production rule in P is of the form
v
(11x0 -+
x
(2)Xo -+ x,
or
/I...\
x1 x2
&(z)
where x E VT and XO,X I , X2,. . . , XT(z)E VN. For convenience, the tree production rule (1) can be written as XO + XXlX2 . . . Xr(z) [30]. From the starting symbol S, a tree is derived by using the tree production rules of P, s -+(Yo =+ a1 =+ ... =+ (Ym = a. The tree language generated by Gt is defined as L(Gt) = { a is a tree IS (Y in G t } , where * represents several derivation steps using tree production rules in Gt.
>
6.3. B e e Automaton
The bottom-up replacement functions of the tree automaton are generated from the expansive tree production rules of the tree grammar. Expansive tree grammar is Gt = ( V , r , P , S ) , and tree automaton is Mt = ( Q , f , S ) , where Q is the set of states, f are the replacement functions, and S becomes the final state.
xo -+ If tree production rule
X
/ I
x1 xz
... \
is in P, then the bottom-up
x,
replacement function in the tree automaton can be written as
xo 4/ XI
x
I .. . \ x2
xn
, or
f ( 2 14x0 / I . .. \ , or fz(X1,X2,. .. , X n ) XI
x2
xn
-+
XO.
The tree automaton is an automatic machine to recognize the tree and has the tree bottom-up replacement functions which are the reverse direction of the tree production rules. Tree grammar uses forward and top-down derivation to derive the tree. The tree automaton uses a backward replacement of the states from the bottom to the root of the tree. If the final replacement state is in the set of the final
1.3 Syntactic Pattern Recognition 87 states, then the tree is accepted by the automaton of the class. Otherwise the tree is rejected.
Example 6.1. The following tree grammar Gt = (V, T , P, S) where V = { S , A , B , $ , a 1 b } , V ~= {*,$,.L a , + b } , r ( a ) = { 2 , 0 } , r ( b ) = {2,0}, r ( $ ) = 2, and P: (1) S + $
(2) A
+a
/ \ A
(3) B
+b
A
+a
(5) B
+b
/ \
/ \
B
(4) A
B
A
B
can generate the patterns, for example, using productions (l),(4), and (5),
or using productions ( l ) ,( 2 ) , ( 3 ) , (4), (5), (4),and (5). $
1
a
b I \ a b
b
The tree automaton which accepts the set of trees generated by Gt is
Mt = ( Q , f a , f b , f $ , F ) , w h e r e Q = { 4 A , q B , Q S } , F = { q ~ } ,a n d f : f a = q A , f b = qB, f a ( Q A , q B ) = ~ Af bI( q A , q B ) = q B , f$(qA,
qB) = 4%.
Example 6.2. The following tree grammar can be used to generate trees representating L-C networks.
Gt
=
(V, T , P, S), where V = {S, A, B, D, E, $, Vin, L, C, W}, VT = {$, Vin, L, C,W}, 2, r(Vin) = 1,r(L) = {1,2},r(C) = 1,T(W) = 0, and P:
T($) =
(1) S + $
(2) A
/ \
A
B
+ Vin I E
(3) B
+L
(4) B
/ \
D
B
+L I
D
(5) D
+C I
E
(6) E
+W
88
K.-Y. Huang
For example, after applying productions (l),(2), (3), ( 6 ) , (5), (4), (6), (5), and (6), the following tree is generated.
$
I\ i'n L I I\ w C L I
I
ni'
w c
The tree automaton which accepts the set of trees generated by Gt is Mt =
(Q, fw, fc, f ~fv,,, , f$,F), where Q = {QE, 4 D , QB, q.4, qs}F = { q s } , and f:
fW()=qE,fC(qE)=qD,
fL(qD)=qB,fL(qD,QB)=qB,
fV,,(qE)=qA,
f$(qA,qB)=qS-
Example 6.3. Tree representation and tree grammar of seismic bright spot pattern. (A) Bright spot pattern: The seismogram of the primary reflection of bright spot is generated from a geologic model [19,71]. The geologic model is shown in Fig. 14 and the seismogram Distance
0.0
Density=2 0 gm/cm**3 Velocity=2.0km/sec
D=2.3 V=2.3
-1.0
I
/
D12.270 V=2.225
D=2.8 V=2.8
.." Fig. 14. Geologic model.
\
1 . 3 Syntactic Pattern Recognition
89
Fig. 15. Bright spot seismogram.
is shown in Fig. 15. After preprocessing, thresholding, and compression in the vertical direction, the bright spot pattern can be shown below. We can scan the pattern from left to right, then top t o bottom. The segments (branches) can be extracted in the tracing. Eight directional Freeman’s chain codes [27] are used to assign primitives t o the segments. From expanding the segments (branches), the tree representation of the seismic bright spot pattern is constructed. And the tree can infer the production rules of the tree grammar. $0
X
xxxxx xxxxx I xxx xxx xxx xxxxxxxxxxxx xxx 5 xxx 0 xxx 7 xx xxxxxxxxxxxxxxxxyxxxxxx xx 5 xxxxx 0 xxxxx 7 xx xx 5
Primitives: eight directional Freeman’s chain codes [27] and terminal symbol 0 (the neighboring segment have already been expanded), 3
2
1
4 1% O
5
--
-6 7 Primitives
90
K.-Y. Huang
(B)(1) Tree representation of bright spot after scanning and tree construction: $
I \ 5 7 I\ I \ 5 0 0 7 I \ I \
5
0
@ 7
(2) Corresponding tree node’s position: 0
/ \ 0.1 I \ 0.1.1 0.1.2 I\ 0.1.1.10.1.1.2
0.2 I \
0.2.1 0.2.2 I\ 0.2.2.1 0.2.2.2
(C) Tree grammar: Gt = (V, r, P, S), where V = set of terminal and nonterminal symbols = {$, 0, 5, 7 , 0, S, A, B, C, D, E, F, GI HI I, J > , VT = the set of terminal symbols = {$, 0, 5, 7, Q},$: the starting point (root) of the tree, @:represents that the neighboring segment has already been expanded, S: the starting nonterminal symbol, T : r ( 5 ) = r(7) = { 2 , 0 } , r ( $ )= 2 , r ( @ ) = O,r(O) = 0, and P: (2)A -+ 5 (3)Bj 7 (4)C+ 5 I \ I \ I \ I \ G H C D E F A B (7)F-+ 7 (8)G-+5 (9)HjO (lO)I+@ I \
(1) S-+ $
I
(6)E-+@
(5)D+O
(11)J-+7
J
The tree derivation steps are as follows:
j
I\ A
B
(4,5 , 6, 7, 8, 9, 10, 11)
(3)
(2)
(1) S + $
$
I \ 5 B I \ C D
$
-+
I 5
_ _ - - . _ - - . -> -._ $- . . - - . -
\
7 I \ I \ C D E F
I
\
5 7 I \ I \ 5 0 @ 7 I \ I \ 5 0 @ 7
1.3 Syntactic Pattern Recognition 91 So following the steps as described in (A) and (B), each seismic pattern can be represented as a tree. From the steps (B) to (C), a tree can infer tree production rules. The tree production rules can derive trees. Each tree corresponds t o its pattern class. (D) Tree automaton from tree production rules of (C): A tree automaton generated by Gt is Mt = (Q, f$, fo, f5, f7, f ~S),, where Q = {S, A, B, C, D, E, F, G, H, I, J}, S: the final state, and the bottom-up replacement function f: (11) f7 + J (10) f~ + 1 (9) fo + H (8) f5 + G (7) f7(I,J) + F (6) f~ + E (5) fo + D (4) f5(G,H) -+ c (3) f 7 ( W + B (2) f5(C,D) + A (1) f$(A,B) + s.
The number on the left hand side of the bottom-up replacement function corresponds to the number of production rule of the tree grammar, and the bottom-up replacement function is the reverse of the corresponding production rule. The above tree in (B) can be replaced by replacement functions step by step from the bottom to the root of the tree and accepted by this tree automaton Mt as the seismic bright spot pattern. 6.4. Tree Representations of Patterns
In the tree automaton system, patterns must be extracted from image data and constructed as the tree representations. In order to construct the tree representation of a pattern automatically, a scanning algorithm is proposed. The following Algorithm 6.1 is the construction of tree representation from scanning an input pattern. The algorithm works for both four-neighbor and eight-neighbor connectivity. The scanning is from left t o right and top to bottom on the binary image. In the algorithm, breadth-first tree expansion is adopted to construct the tree representation such that the depth of the tree will be shorter and the parsing time of the input tree by tree automaton will also be shorter in parallel processing.
Algorithm 6.1. Construction of a tree representation from a pattern. Input: Image of a pattern after thinning. Output: Tree representation of a pattern. Method: (1) While scanning image from left to right and then top to bottom, (a) If the scanning reaches a point of the pattern, then the point is the root (node) of a tree. (b) Trace all following branches (segments) from a node. And assign the terminal symbol to each branch (segment) by the chain code. (c) If the lower end of each branch (segment) has sub-branches (sub-segments), then go to Step (b), trace the sub-branches (sub-segments) and expand all children nodes from the left-most child node in the same tree level. After the whole children nodes in the same level are expanded, then go t o Step (b) to expand the descendants from the left-most in the next down tree level.
92
K.-Y. Huang
Expand level by level until there is no node to be expanded. Then a pattern is extracted and its corresponding tree representation is constructed. There may exist several patterns in the image data. Algorithm 6.1 can extract a pattern. The following Algorithm 6.2 can extract all the entity patterns in the image and construct their tree representations of all the patterns in the image.
Algorithm 6.2. Extract all patterns and construct their tree representations from the binary image. Input: Image data after thinning. Output: Tree representations of all patterns. Method: (1) Scan image from left t o right, then top to bottom. (2) When the scan reaches a point of a pattern, follow Algorithm 6.1 to extract one pattern and construct its corresponding tree representation, then erase the position of current pattern from the image. (3) Go to Step (l),continue to scan, extract the next pattern, and construct the tree representation until there is no pattern to be extracted.
6.5. Inference of Expansive Free G r a m m a r In the training part of tree automaton system of Fig. 13, training patterns must be given in order t o infer the production rules. The following Algorithm 6.3 is presented to infer expansive tree grammar from tree representation of the pattern.
Algorithm 6.3. Inference of expansive tree grammar. Input: Tree representation of a pattern. Output: Expansive tree grammar. Method:
(1) From top to bottom of a tree, for every node of the tree, derive a tree production rule:
x+
a
/ I
x1
x2
... \ x n
where X: the nonterminal symbol assigned to the node, a: the primitive (terminal) of the node, and XI, Xa,. . . ,Xn: the nonterminals of the direct descendants (children) to cover the next level subtrees. (2) Go to Step (1) to handle the other nodes in the same level of the tree, until every node is reached. Handle each node level by level of the tree. This algorithm can be implemented in one recursive procedure.
1.3 Syntactic Pattern Recognition 93
6.6. Weighted Minimum-Distance SPECTA Due to noise and distortion there are some possibilities that terminals may be recognized as the neighboring terminals in the process of primitive recognition. The tree may have substitution error terminals, then tree automaton must be expanded to recognize error trees. Given a tree W below, node b at position 0.1 is substituted by node x of W'. But the trees W and W' have the same tree structure. This
'2W'.
substitution error is written as W
Tree W = b I d
a
0
a I \
Substitutenode b
c = 0.1 0.2 I 0.1.1
at position 0.1 by x
I
\
I d
For a given tree grammar Gt or tree automaton Mt, the minimum-distance SPECTA is formulated to accept the input tree and to generate a parse that consists of the minimum number of substitution error tree production rules or error replacement functions, while the tree structure is still preserved. Assume that W' is an input tree, the parsing of the input tree by minimum-distance SPECTA is a search for the minimum distance and is a backward procedure of constructing a tree-like transition table with all candidate states and their corresponding costs recorded from the leaves to the root of W'. For each tree node a (index position in tree), there is a corresponding transition box t, which consists of triplet items (X, # k , c) in the transition table, where X is the state, # k is the kth production rule or replacement function, c is the accumulated error cost. An example is shown as follows. Example 6.4. Parsing by minimum-distance SPECTA. Given tree production rules P: (1)s -+
(2)A
$
I \
A
B
-+
5
( 3 ) B -+ 7
$
I \
, and given an input tree,
5
7
we can generate the parsing of the input tree by minimum-distance SPECTA as below.
K.-Y. Huang
94
includes state A (nonterminal) in (A, 2, 0) and state B (nonterminal) in (B, 3, 1). Triplet item (A, 2, 0) represents that we can use A in production rule (2) A -+ 5 to derive terminal 5 with 0 substitution error because input terminal is 5, triplet item (B, 3, 1) represents that we can use state B in production rule (3) B + 7 to derive terminal 7 with 1 substitution error because input terminal is 5. t o . 2 has a similar explanation. t o has state S in triplet item (S, 1, 0) so that we can use S in
to.1
$
production rule (1)to derive
/ \ with 0 substitution error, 0 in (S, 1, 0) is the
A B sum of 0 from (A, 2, 0) and 0 from (B, 3, 0). Although there are other derivations from the combinations of the states (nonterminals) A and B of the triplets, i.e. $
$
$
$
/ \ , only S in production rule (1) can derive
/ \ , / \ , and
/ \
A
A B A B B A B with minimum error, others are counted as larger errors and neglected. If X is a candidate state of tree node at position index a , then each triplet item (X, #lc, c) is added to a box t,, #lc specifies the lcth production rule or bottom-up replacement function used, and c is the accumulated minimum number of substitution errors from the leaves to the node a in subtree of W' from node a when tree node a is represented by state X. The algorithm of minimum-distance SPECTA was in Fu [30]. For consideration of the weighted substitution error costs of pair terminals, Basu and Fu [6], presented the minimum weighted distance SPECTA. Here we have some modification. We can expand each production rule to cover the substitution error production rules and embed cost t o each production rule. So the expanded tree grammar with weighted error costs are generated. Production rule X 4z is expanded to substitution error production rule with error cost c, X 4 y, y # z; X 3 z, c = 0, and production rule C
x+ I
I
...
x,x*
\
X"
isexpandedto
/
XI
Y I ... \ x2
,y#x;c=Oify=x.
X"
Initially the expanded tree grammar with error costs must be generated. Then the algorithm of weighted minimum distance SPECTA is presented as follows.
Algorithm 6.4. Weighted minimum distance SPECTA. Input: An expanded tree grammar Gt (or a tree automaton Mt) with error costs, a tree W'. Output: Parsing transition table of W' and minimum distance. Met hod: (1) Replace each bottom leaf of the input tree W' to a state. If the bottom leaf (terminal) of the input tree is 2, i.e. r[W'(a)]= 0 (rank of bottom node a in tree W' is 0) and W ' ( a ) = z (terminal at bottom node a
1.3 Syntactic Pattern Recognition 95 in tree W’ is x), then for any terminal y and production rules ((#k)X % y ) , replace the leave (terminal) x of the input tree t o state X, and store (X, #k, c) to the box t a .
Do Steps (2) and (3) until there is no replacement to the root of the tree. Replace the subtree to a state using bottom-up replacement function. m
L .
If subtree is
/ 1 . .. \
x1 xz n) and W’(a)
, i.e. r[W’(a)] = n > 0 (rank of node a in tree W’ is
xn
(terminal at node a in tree W’ is x), then for any terminal y x% y X and production rules ( ( # k ) / I . . . \ ), replace subtree / I . . . \ =x
x1 xz
xn
x1
x2
+ + +
x,
to state X, and store (X, # k , c‘) to the table box t,, c’ = c1 c2 . . . c,, where ci is the cost in table box t,,j for state Xi, z = 1,.. . ,n. Whenever more than one item (X,#k,c) in t, has the same state X, keep the item with smaller error cost, and delete the items with larger error costs. If items (S, # k , c) are in t o , choose one item with minimum distance c, then the input tree is accepted with distance c. If no item is associated with the starting nonterminal S in t o of the form (S, # k , c), then the input tree is rejected.
Example 6.5. Parsing of error bright spot pattern by minimum distance SPECTA. (A) Bright spot pattern with primitive errors: $0
X
5
xxxxx xxxxx
7 xxx xxx xxxxxxxxxxx xxxxxxxxxxxxxxxx 0 xxx 5 xxx 0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx xx xxx 0 xxxxxxxxxx 0
xxx
5
xx
(B) Tree representation with primitive errors: Using eight directional F’reeman’s chain codes [27] and @, tree representation of bright spot with primitive errors is shown below. $
f 5
\
7 I \ 5 O @ 0 I \ I \ 5 0 0 0 I\
96
K.-Y. Huang
(C) Parsing by minimum distance SPECTA: Using the tree automaton inferred from bright spot pattern in Example 6.3, the transition table of parsing on tree of error bright spot pattern of Example 6.5 using minimum distance SPECTA can be generated as follows. Here the costs of terminal substitution errors are set and equal t o 1. The explanation of each box from the bottom-up replacement is the same as that of Example 6.4
6.7. Modified Maximum-Likelihood SPECTA When the probabilities of tree production rules and the substitution deformation probabilities on terminal pair symbols are available, maximum-likelihood SPECTA can be used for the recognition of patterns [30]. The stochastic expansive tree grammar G, = (V, T , P, S) has the form of production rules in P : (1) xo
3
or
z
/I
x1 x2
(2)
XO
3 z,
... \ &(z)
where p is the probability of production rule, z E VT and Xo,X1, Xz,. . . ,XT(z)E VN.
I . 3 Syntactic Pattern Recognition 97 The major steps of the maximum-likelihood SPECTA in Fu are described as follows [30]. Given stochastic expansive tree grammar G,, terminal substitution probabilities q(y/x) and input tree W’. (1) Replace each leaf of the input tree t o a state. If r[W‘(a)]= 0 (rank of node a in tree W’ is 0), W’(a) = y (symbol at node a in tree W’ is y), and X 3 x is the lcth production rule in P, then add (X, #lc,p’) to t, and p’ = p x q(y/x). (2) Replace the subtree to a state using bottom-up replacement function. If r[W’(a)]= n > 0 (rank of node a in tree W’ is n ) , W’(a) = y (symbol at node a in tree W’ is y), and
x 4
x
x1
x2
/
I . . .\
is the lcth production rule in P,
x n
then add (X, #lc,p’) t o table box t,, p’ = p i x p i x . . . x p; x p x q ( y / z ) , where p’, is the probability in table box t,.i for state Xi, i = 1,.. . , n. The calculation of probability p‘ was from the multiplication of the production rule probability p and the terminal substitution probability q(y/x). Instead of the calculation from the multiplication, p x q(y/x), here a modification is proposed. Similar to the previous expanded tree grammar with error costs, each production rule is expanded t o cover substitution error production rules with the probabilities,
x+
i.e. if
X S
x
/ 1 .. . \
x1 x2
is in P, then expand to
x n
X
/ I . ..\
x1 x2
for all
xn
terminals y, y = x or y # x. Summation of the probabilities from expansion of one tree production rule is 1. There are some possibilities that terminals may be recognized as the neighboring terminals in the process of primitive recognition. The value of probability of substitution pair terminals is inversely proportional to the angle of pair terminals. Each tree production rule of tree grammar Gt is expanded to cover the substitution error production rules with probabilities. Based on the expanded grammar, the maximum-likelihood SPECTA is modified here for the recognition of seismic patterns. The algorithm is presented as follows.
Algorithm 6.5: Modified maximum-likelihood SPECTA. Input: (1) Expanded tree grammar Gt (or tree automaton M,) with probability on each production rule, (2) input tree W’. Output: Parsing transition table of W’ and maximum probability. Method: (1) Replace each bottom leave of the input tree W’ t o a state. If r[W’(a)]= 0 (rank of node a in tree W’ is 0), W’(a) = x, and X lcth production rule in P, then add (X, #lc,p’) t o t, and p’ = p .
5 x is the
Do Steps (2) and (3) until there is no replacement at the root of the tree.
98
K.-Y. Huang
(2) Replace the subtree to a state using bottom-up replacement function. If r[W’(a)] = n > 0 (rank of node a in tree W’ is n),W’(a) = x, (terminal at X2; 5 node a in tree W’ is x), and / I . . . \ is the kth production rule
x1
x2
x,
in P, then add (X, # k , p ’ ) to table box t,, p’ = p i x p i x . . . x p k x p , where pi is the probability in table box t,.i for state Xi, i = 1,.. . ,n. ( 3 ) Whenever more than one item (X, # k , p ’ ) in t , has the same state X, keep the item with larger probability and delete the item with smaller probability. (4) If items (S, # k , p ’ ) are in t o , choose one item with maximum probability p’, then the input tree W’ is accepted with probability p’. If no item in t o is associated with the starting nonterminal S, then the input tree W‘ is rejected. 6.8. Minimum Distance GECTA
Due to noise, distortion, and interference of the wavelets, the tree has an error structure. The error may cause the tree to become the preserved or non-preserved tree structure. If the tree structure is preserved, then weighted minimum distance SPECTA and modified maximum-likelihood SPECTA can be applied in the recognition of patterns. If the tree structure is not preserved, then minimum distance GECTA [30] can be applied in the recognition of patterns. The syntax errors between two trees may include substitution, deletion, and insertion errors. The insertion error includes three types of errors: stretch, branch, and split errors. Totally there are five types of syntax errors on the tree. The distance between two trees is defined to be the least cost sequence of error transformations needed t o transform one t o the other [30,61,80].Because there are five possible error transformations to transform one tree into the other tree, each production rule of the tree grammar must be expanded to cover all five syntax errors on trees. Then the expanded grammar can generate a recognizer, i.e. the minimum distance generalized error-correcting tree automaton (GECTA). Similar to the weighted minimum distance SPECTA, the parsing of an input tree W’ using minimum distance GECTA is also to construct a tree-like transition table with all candidate states and their corresponding costs recorded. The procedure is a backward from the leaves to the root of W’ for the least cost solution [30].
7. Conclusions and Discussions Theoretical studies in syntactic pattern recognition were effective in handling abstract and artificial patterns [30,89]. We need simultaneous progress in both theoretical studies and real-data applications in the future. Combining the syntactic and the semantic approaches can expand the power of syntactic pattern recognition. Semantic information often provides spatial information, relation, and reasoning between primitives, subpatterns, and patterns, and can be expressed syntactically, for example, by attributed strings and attributed graphs. The attributed 2-D and
I . 3 Syntactic Pattern Recognition 99 3-D pattern grammars such as attributed tree, graph, and shape grammars, may be the subject of future study [9,31,76,86-881. The distance computation between two attributed patterns (attributed strings, attributed trees, attributed graphs,. . . , etc.) may also be studied in the future [44]. The error-correcting finite-state parsing, Earley’s parsing, tree automaton,. . . , etc. may also be expanded for attributed strings, trees,. . . , etc. [44,45,59]. The distance can be computed between input pattern y and language L(G), or between input pattern and training pattern. Using a distance or similarity measure, clustering methods, such as minimum-distance classification rule, nearest neighbor classification rule, K-nearest neighbor classification rule and method of hierarchical clustering, can be easily applied to syntactic patterns [30,34,61,62,89]. If the pattern has inherent structural property, globally we can use the syntactic approach t o recognize the pattern. Locally we can use neural network techniques in the segmentation and the recognition of primitives, such that the syntactic pattern recognition will improve and become more robust against the noise and distortions. In the study of certainty effect, besides the probability approach, fuzzy logic may be considered in grammar and automaton, for example, fuzzy tree automaton [53]. Parallel parsing algorithms can speed up the parsing time [12,18,20]. For example, the tree automaton can be parsed from the bottom leaves to the top root of the tree in parallel. Further, a syntactic approach to the time-varying pattern recognition may also become one of the research topics in the future [24].
References [l] A. V. Aho and T. G. Peterson, A minimum distance error-correcting parser for context-free languages, SIAM J. Comput. 1 (1972) 305-312. [a] A. V. Aho and J. D. Ullman, The Theory of Parsing, Translation, and Compiling, Vol. 1: Parsing (Prentice-Hall, Englewood Cliffs, NJ, 1972). [3] F. Ali and T. Pavlidis, Syntactic recognition of handwritten numerals, IEEE Trans. Syst. Man Cybern. 7 (1977) 537-541. [4] K. R. Anderson, Syntactic analysis of seismic waveforms using augmented transition network grammars, Geoexploration 20 (1982) 161-182. [51 A. Barrero, Inference of tree grammars using negative samples, Pattern Recogn. 24, (1991) 1-8. [6] S. Basu and K. S. Fu, Image segmentation by syntactic method, Pattern Recogn. 20, (1987) 33-44. [7] J. M. Brayer and K. S. Fu,A note on the K-tail method of tree grammar inference, IEEE Trans. Syst. Man Cybern. 7 (1977) 293-299. [8] I. Bruha and G. P. Madhavan, Use of attributed grammars for pattern recognition of evoked potentials, IEEE Trans. Syst. Man Cybern. 18 (1988) 1046-1089. [9] H. Bunke, Attributed programmed graph grammars and their application to schematic diagram interpretation, IEEE Trans. Pattern Anal. Mach. Intell. 4 (1982) 574-582. [lo] H. Bunke and A. Sanfeliu (eds.), Special Issue: Advances in Syntactic Pattern Recognition, Pattern Recogn. 19, 4 (1986). [ll] H. Bunke and A. Sanfeliu (eds.), Syntactic and Structural Pattern Recognition Theory and Applications (World Scientific, 1990).
100
K.-Y. Huang
[12] N. S. Chang and K. S. Fu, Parallel parsing of tree languages for syntactic pattern recognition, Pattern Recogn. 11 (1979) 213-222. (131 C. H. Chen (ed.), Special Issue: Seismic Signal Analysis and Discrimination, Geoexploration 20,1/2 (1982). [14] C. H. Chen (ed.), Special Issue: Seismic Signal Analysis and Discrimination 111, Geoexploration 23, 1 (1984). [15] C. H. Chen, L. F. Pau and P. S. Wang (eds.), Handbook of Pattern Recognition and Computer Vision (World Scientific, 1993). [16] C. H. Chen (ed.), Special Issue: Artificial Intelligence and Signal Processing in Underwater Acoustic and Geophysics Problems, Pattern Recogn. 18,6 (1985). [17] J. C. Cheng and H. S. Don, A graph matching approach to 3-D point correspondences, Znt. J. Pattern Recogn. A d i f . Intell. 5 (1991) 399-412. [18] Y . C. Cheng and S. Y. Lu, Waveform correlation by tree matching, IEEE Trans. Pattern Anal. Mach. Intell. 7 (1985) 299-305. [19] Y . T. Chiang and K. S. Fu, Parallel parsing algorithm and VLSI implementations for syntactic pattern recognition, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 302-314. [20] M. B. Dobrin and C. H. Savit, Introduction to Geophysical Prospecting, 4th edn. (McGraw-Hill, New York, 1988). (211 H. S. Don and K. S. Fu,A parallel algorithm for stochastic image segmentation, I E E E Trans. Pattern Anal. Mach. Intell. 8 (1986) 594-603. 1221 D. Dori, A syntactic/geometric approach to recognition of dimensions in engineering machine drawings, Comput. Vision Graph. Image Process. 47 (1989) 271-291. [23] J . Earley, An efficient context-free parsing algorithm, Commun. of A C M 13 (1970) 94-102. [24] M.A. Eshera and K. S. Fu, A graph distance measure for image analysis, IEEE Trans. Syst. Man Cybern. 14 (1984) 398-408. [25] T. I. Fan and K. S. Fu, A syntactic approach t o time-varying image analysis, Comput. Graph. Image Process. 11 (1979) 138-149. [26] T. Feder, Plex languages, I f . Sci. 3 (1971) 225-241. [27] G. Ferrate, T. Pavlidis, A. Sanfeliu and H. Bunke (eds.) Syntactic and Structural Pattern Recognitions (Springer-Verlag, 1988). [28] H. Freeman, On the encoding of arbitrary geometric configurations, I E E E Electron. Comput. 10 (1961) 260-268. [29] K. S . Fu, Syntactic Methods in Pattern Recognition (Academic Press, New York, 1974). [30] K. S. Fu, Syntactic image modeling using stochastic tree grammars, Comput. Graph. Image Process. 12 (1980) 136-152. [31] K. S . Fu, Syntactic Pattern Recognition and Applications (Prentice-Hall, Englewood Cliffs, NJ, 1982). [32] K. S. Fu, A step towards unification of syntactic and statistical pattern recognition, I E E E Trans. Pattern Anal. Mach. Intell. 5 (1983) 200-205. [33] K. S. Fu and B. K. Bhargava, Tree systems for syntactic pattern recognition, I E E E Tkans. Comput. 22 (1973) 1087-1099. [34] K. S . Fu and T. Huang, Stochastic grammars and languages, Int. J. Comput. In$ Sci. 1 (1972) 135-170. [35] K. S . Fu and S. Y . Lu, A clustering procedure for syntactic patterns, I E E E Trans. Syst. Man Cybern. 7 (1977) 734-742. [36] J. E. Gaby and K. R. Anderson, Hierarchical segmentation of seismic waveforms using affinity, Pattern Recogn. 23 (1984) 1-16.
1.3 Syntactic Pattern Recognition 101 [37] P. Garcia, E. Segarra, E. Vidal and I. Galiano, On the use of the morphic generator grammatical inference (MGG) methodology in automatic speech recognition, Int. J. Pattern Recogn. Artif. Intell. 4 (1990) 667-685. [38] R. C. Gonzalez and M. G. Thomason, Syntactic Pattern Recognition (Addison Wesley, Reading, MA, 1978). [39] K.-Y. Huang, Branch and bound search for automatic linking process of seismic horizons, Pattern Recogn. 23 (1990) 657-667. [40] K.-Y. Huang, Pattern recognition to seismic exploration, in Automated Pattern Analysis in Petroleum Exploration, eds. I. Palaz and S. K. Sengupta (Springer-Verlag, New York, 1992) 121-154. [41] K. Y. Huang, W . Bau and S. Y. Lin, Picture description language for recognition of seismic patterns, SOC. Exploration Geophysicists Int. 1987 Mtg., New Orleans, 326-330. [42] K. Y. Huang and K. S. Fu, Syntactic pattern recognition for the classification of Ricker wavelets, Geophysics 50 (1985) 1548-1555. [43] K. Y. Huang and K. S. Fu, Syntactic pattern recognition for the recognition of bright spots, Pattern Recogn. 18 (1985) 421-428. [44] K. Y. Huang, K. S. Fu,S. W. Cheng and Z. S. Lin, Syntactic pattern recognition and Hough transformation for reconstruction of seismic patterns, Geophysics 52 (1987) 1612-1620. [45] K. Y. Huang and D. R. Leu, Modified Earley parsing and MPM method for attributed grammar and seismic pattern recognition, J . Inf. Sci. and Eng. 8 (1992) 541-565. [46] K. Y. Huang and D. R. Leu, Recognition of Ricker wavelets by syntactic analysis, Geophysics 60 (1995) 1541-1549. [47] K. Y. Huang and T. H. Sheen, A tree automaton system of syntactic pattern recognition for the recognition of seismic patterns, 56th Annu. Int. Mtg., SOC.Expl. Geophys. (1986) 183-187. [48] K. Y. Huang, T. H. Sheen, S. W. Cheng, Z. S. Lin and K. S. Fu, Seismic image processing: (I) Hough transformation, (11) Thinning processing, (111) Linking processing, Handbook of Geophysical Exploration: Section I. Seismic Exploration, 20, Pattern Recognition €9 Image Processing, (ed.) F. Aminzadeh (1987) 79-109. [49] K. Y. Huang, J. J. Wang and V. M. Kouramajian, Matrix grammars for syntactic pattern recognition, 1990 Telecomm. Symp. Taiwan, 576-581. [50] J. W. Hunt and T. G. Szymansky, A fast algorithm for computing longest common subsequences, Commun. A C M 20 (1977) 350-353. [51] S. Kiram and C. Pandu, A linear space algorithm for the LCS problem, Acta Informatica 24 (1987) 353-362. [52] A. Koski, M. Juhola and M. Meriste, Syntactic recognition of ECG signals by attributed finite automata, Pattern Recogn. 28 (1995). [53] L. H. T. Le and E. Nyland, An application of syntactic pattern recognition to seismic interpretation, in Computer Vision and Shape Recognition, A. Krzyzak, T. Kasvand and C. Y. Suen (eds.) (World Scientific, 1988) 396-415. [54] E. T. Lee, Fuzzy tree automata and syntactic pattern recognition, I E E E Trans. Pattern Anal. Mach. Intell. 4 (1982) 445-449. [55] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Sou. Phys. Dokl. 10 (1966) 707-710. [56] B. Levine, Derivatives of tree sets with applications to grammatical inference, IEEE Trans. Pattern Anal. Mach. Intell. 3 (1981) 285-293. [57] B. Levine, The use of tree derivatives and a sample support parameter for inferring tree systems, IEEE Trans. Pattern Anal. Mach. Intell. 4 (1982) 25-34.
102
K.-Y. Huang
[58] R. Y. Li and K. S. Fu, Tree system approach for LANDSAT data interpretation, Symp. Mach. Process. Remotely Sensed Data, West Lafayette, Ind., June 29-July 1, 1976. [59] W. C. Lin and K. S. Fu, A syntactic approach to 3-D object representation, I E E E Trans. Pattern Anal. Mach. Intell. 6 (1984) 351-364. [60] H. H. Liu and K. S. Fu,A syntactic approach to seismic discrimination, Geoexploration 20 (1982) 183-196. [61] S. W. Lu, Y . Reng and C. Y . Suen, Hierarchical attributed graph representation and recognition of handwritten Chinese characters, Pattern Recogn. 24 (1991) 617-632. [62] S. Y. Lu, A tree-to-tree distance and its application to cluster analysis, I E E E Trans. Pattern Anal. Mach. Intell. 1 (1979) 219-224. [63] S . Y. Lu and K. S. Fu, A sentence-to-sentence clustering procedure for pattern analysis, IEEE Trans. Syst. Man Cybern. 8 (1978) 381-389. [64] S. Y. Lu and K. S. Fu, Error-correcting tree automata for syntactic pattern recognition, I E E E Trans. Comput. 27 (1978) 1040-1053. [65] S. Y. Lu and K. S. Fu, A syntactic approach to texture analysis, Comput. Graph. Image Process. 7 (1978) 303-330. [66] S . Y . Lu and K. S. Fu,Stochastic tree grammar inference for texture synthesis and discrimination, Comput. Graph. Image Process. 9 (1979) 234-245. [67] W. Min, Z. Tang and L. Tang, Using web grammar to recognize dimensions in engineering drawings, Pattern Recogn. 26 (1993) 1407-1416. [68] B. Moayer and K. S. Fu, A tree system approach for fingerprint pattern recognition, I E E E Trans. Comput. 25 (1976) 262-274. [69] R. Mohr, T. Pavlidis and A. Sanfeliu (eds.) Structural Pattern Recognitions (World Scientific, 1990). [70] T. Pavlidis, Linear and context-free graph grammars, J. ACM 19 (1972) 11-12. [71] T. Pavlidis, Structural Pattern Recognition (Springer-Verlag, New York, 1977). [72] C. E. Payton (ed.), Seismic Stratigraphy - Applications to Hydrocarbon Exploration (AAPG Memoir 26, Tulsa, OK, Amer. Assn. Petroleum Geologists, 1977). [73] J. L. Pfaltz and A. Rosenfeld, Web grammars, Proc. 1st Int. Joint Conf. Artif. Intell., Washington, D.C., (1969) 609-619. [74] A. Rosenfeld, Picture Languages (Academic Press, New York, 1979). [75] A. Sanfeliu, K. S. Fu and J . Prewitt, An application of a graph distance measure to the classification of muscle tissue patterns, Int. J. Pattern Recogn. Artif. Intell. 1 (1987) 17-42. (761 A. C.Shaw, The formal picture description scheme as a basis for picture processing system, Znf. Control 14 (1969) 9-52. [77] Q. Y. Shi and K. S. Fu, Parsing and translation of (attributed) expansive graph languages for scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 472485. [78] L. Stringa, A new set of constraint-free character recognition grammars, ZEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 1210-1217. (791 P. H. Swain and K. S. Fu, Stochastic programmed grammars for syntactic pattern recognition, Pattern Recogn. 4 (1972). [80] E. Tanaka and K. S. Fu,Error-correcting parsers for formal languages, I E E E Trans. Comput. 27 (1978) 605-616. [81] E. Tanaka and K. Tanaka, The tree-to-tree editing problem, Int. J. Pattern Recogn. Artif. Intell. 2 (1988) 221-240. [82] M. G. Thomason, Generating functions for stochastic context-free grammars, Int. J. Pattern Recogn. Artif. Intell. 4 (1990) 553-572.
1.3 Syntactic Pattern Recognition 103 [83] M. G. Thomason and R. C. Gonzalez, Error detection and classification in syntactic pattern structures, ZEEE Trans. Comput. 24 (1975) 93-95. [84] R. A. Wagner and M. 3. Fischer, The string to string correction problem, J. ACM 21 (1974) 168-173. 1851 P.S. P. Wang (ed.), Special issue on array grammars, patterns and recognizers, Znt. J. Pattern Recogn. Artif. Intell. 3,3&4 (1989). [86] G. Wolberg, A syntactic omni-font character recognition system, Znt. J. Pattern Recogn. Artif. Intell. 1 (1987) 303-322. [87] A. K. C. Wong, S. W. Lu and M. Rioux, Recognition and shape synthesis of 3-D objects based on attributed hypergraph, I E E E Trans. Pattern Anal. Mach. Intell. 11 (1989) 279-290. [88] K.C.You and K. S. Fu, A syntactic approach to shape recognition using attributed grammars, I E E E Trans. Syst. M a n Cybern. 9 (1979) 334-345. [89] K. C. You and K. S. Fu, Distorted shape recognition using attributed grammars and error-correcting techniques, Comput. Graph. Image Process. 13 (1980) 1-16. [go] T. Y . Young and K. S. Fu (eds.), Handbook of Pattern Recognition and Image Processing (Academic, New York, 1986).
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 105-142 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 1.4 I NEURAL NET COMPUTING FOR PATTERN RECOGNITION
YOH-HAN P A 0 Electrical Engineering and Computer Science, Case Western Reserve University 10900 Euclid, Cleveland, Ohio 44106-7221, USA In this chapter we discuss Artificial Neural Net computing from the viewpoint of its being an enabling methodology for pattern recognition research and practice. The four functionalities of clustering, learning functional mappings, classification through associative recall, and optimization are discussed in a comparative manner relative to other practices in pattern recognition and also relative to each other. In addition to references, two bibliographies, one for books and the other for journals, are provided as guides for further reading.
Keywords: Neural net computing, ART, Hopfield net, optimization, functional m a p ping, supervised learning, Boltzmann machine, simulated annealing, functional-link net, associative memory.
1. Introduction In this chapter, we address Artificial Neural Net (ANN) computing from the perspective of its being a tool for implementing pattern recognition algorithmic practices. The primary context of our discussion is that of pattern recognition, but the topic of specific interest is how neural net computing can be used for attaining pattern-based information processing objectives, especially those which have been established over the years t o be of central interest and importance to the pattern recognition research and practitioner communities. Researchers in information processing have long recognized the strikingly different information processing propensities of serial digital computers and of biological systems. The former rely on speed and accuracy and on ability to execute vast amounts of detailed programmed instructions precisely. But they are, nevertheless, easily overwhelmed by algorithmic tasks, which are of exponential or greater complexity. Unfortunately most real-world perception/cognition tasks, if approached in a direct manner, are of such a nature. In contrast, the nature of biological systems is that of distributed parallel processing systems, made up of large numbers of interconnected elemental processors of rather slow processing speed. In addition, information processing seems to depend on the ability to discern what is cogent and relevant, and to focus on that while
106
Y.-H. Pao
sustaining a minimal degree of maintenance on other matters. Situations, circumstances, and events seem to be evaluated on the basis of the “pattern-ness” of things and on similarities between patterns, and on associations between patterns. This is in marked contrast to the operational strategies of the high-speed, general-purpose, serial-digit a1 computers. At the risk of overstating the case, it almost seems that in approaching the performance of a task, serial-digital computer algorithms tend t o search all of the system space to find a reasonably good path from start state to goal state. We know such approaches are doomed to failure because of the combinatorial explosion in the number of paths to be tried. In contrast to the systematic, frontal-attack approach, biological systems seem to rely more on experience and education so that any good path or even a segment of a good path is remembered, and that knowledge is transmitted through generations, either genetically or through education. In this latter mode of information processing, individual operations are of limited significance but patterns, both spatial and temporal, are of central importance. The significance of patterns is established by associations between a pattern (or a set of patterns) and other patterns (or sets of patterns). Accordingly, the formation of such associations and the activation of such linkages are matters of critical importance. One of the practical objectives of pattern recognition researchers has always been the ability to design and implement machine systems, which are able to perform perception tasks competently t o degrees of proficiency comparable t o that of biological systems. To date it cannot be said that progress in that respect has been as substantial as desired or as expected. If we try to identify reasons for this relative lack of success, we might include the following. It would seem that detailed studies of information processing architectures and procedures in actual biological neuronal studies are so difficult that progress comes a t a very slow pace, indeed. Therefore guidance from that source, though much valued, is limited. In addition, tragically, one of the few initial attempts at artificial neural net computing was so thoroughly discredited at its onset that no academic research in that topic could be sustained for the past decades, until recently. For example, pattern recognition texts have always taught the WidrowHoff algorithm [l]as a procedure for learning a linear discriminant but never with any suggestion that it might also be considered to be a representation of a net capable of learning functional mappings. These matters and others contributed to the absence of a coherent body of commonly-shared knowledge of adaptive and associative pattern-based, information processing practice, even when it was clear that such knowledge and activity were critical to further progress in pattern recognition research. The most recent resurgence in artificial neural net computing is due to initiatives from the cognitive psychology sciences and from researchers interested in biological information processing matters.
1.4 Neural Net Computing for Pattern Recognition 107 It is a huge and high risk jump t o go from well-accepted, highly professional psychological or biological studies to the dubious practice of postulating some drastically simplified “neuronal” computational models and to try to establish some relevancy between the two types of endeavors. However, at any rate, as is well known, such initiatives were carried out over the past decade and have stimulated a powerful resurgence of interest and activity in artificial net computing [2]. Of primary significance to us, is the outcome that regardless of whether the artificial neural net computing paradigm models biology or not, it is of intrinsic value to information processing researchers especially pattern recognition researchers who are interested in the “pattern-ness” of matters and in the rapid distributed parallel processing of associated nets of such patterns [2]. Currently there is not only interest in basic matters in artificial neural net computing, but also extensive activity in the application of this technology to practical tasks, with reports of considerable success. This chapter is primarily in the nature of a n annotated guide to the knowledge which comprises the core of the state-of-the-art in this field at this time. The guide is, therefore, selective rather than comprehensive, and the notation reflects our personal biases and viewpoints, as indeed must be the case for the annotation to be meaningful. The organization of our presentation of materials is described in Section 2. The topical matters themselves are discussed in subsequent sections. These rather sparse schematic discussions are stressed in a section on comments and bibliographic remarks and by a list of titles for further reading. 2. Organization of Chapter In a manner consonant with accepted practice, we divide the architectures and algorithms of artificial neural net computing into four parts characterized by the headings of unsupervised learning, supervised learning, associative memory, and optimization. In addition we list a fifth area, which addresses systems level issues. In Table 1 we list for each such topical area some typical architectures, algorithms, and functionalities supported by the algorithms and corresponding activities and results in traditional pattern recognition research. We believe that Table 1 indicates that neural net computing does indeed address issues of interest to pattern recognition and might indeed provide effective means for realizing the computational objectives of pattern recognition. There are aspects of artificial neural net computing which have been well discussed in literature and even in books. There is no need for us to repeat such discussions in this brief chapter. For accepted background material, we refer the reader to the referenced works and also to the additional bibliographies. In the following sections, we discuss each of these areas.
Pao and Hafez algorithm for concept formation
0
System Level Issues
ART 1, 2 or 3
0
Hopfield and Tank approach
Hopfield net
Generalized delta rule/back propagation-of-error Functional-link net
0
0
Topologically correct mapping
LVQ
0
0
ART 1 & 2
0
Representation Algorithms
Optimization
Associative Memory
Supervised Learning
Unsupervised Learning
ANN Computing Area
Classification
Disconcerning regularities in data
Classification
0
Max nets
Forming clusters
Topologically correct
Classification
0
0 0 0
Feature extraction Concept formation
D D
Associative memory
Learning discriminant
Feature extraction
Underdeveloped in pattern recognition
Undeveloped in pattern recognition
Distributed matrix associative memories
No direct correspondence
This area is of great importance but underdeveloped in traditional pattern recognition
Non-parametric estimation (usually limited to estimation of density distribution functions)
Neural net computing does not deal with feature extraction
Modifying clusters
Vigilance factor
K-means and ISODATA
0
New concepts:
Comments
Data reduction
Traditional Pattern Recognition Issues
0
0
0
Inductive learning
Optimal activation to :omplex problems :radient search
Associative recall
0
Restoration of corrupted patterns
Learning a functional mapping from a set of examples
0
0
Functionality
Table 1. Neural net computing and pattern recognition.
cc
0
F
1.4
Neural Net Computing f o r Pattern Recognition
109
3. Unsupervised Learning We can distinguish between three types of unsupervised learning represented by the algorithms of the types of ART [3-51, LVQ [6], and topologically correct mapping [7,8]. To some this area of neural net computing contributes the least to pattern recognition, because in a sense nothing significantly new is added to the principal functionality of cluster formation. Indeed, it might be argued that existing methods, such as the K-means [9] algorithm or the ISODATA [lo] algorithm, can do just as well if not better than the corresponding neural net algorithms. However to others it is exactly this close correspondence which is satisfying and stimulating. Currently in neural computing, clustering is established on the basis of some metric defined in the actual pattern space in question. This means that we establish a rule for calculating the “distance” between two patterns, and decide whether they should be considered sufficiently similar t o be grouped within one and the same cluster or whether they should be in different clusters. This is illustrated in Fig. l ( a ) for some geometric but not necessarily isotropic metric. If the metric is isotropic, meaning that the rule for calculating distances is the same regardless of the direction in which we look from any one pattern, then the result is a partitioning of pattern space into distinctive nonoverlapping hyperspherical regions or clusters, as shown in Fig. l ( b ) , for a Euclidean distance metric, in two dimensions. Even in such a straightforward simple procedure, we can introduce variety by specifying different cluster radius thresholds for different regions of the pattern space. That can and in general does result in the need for special procedures for resolving conflict and for ensuring convergence. To date no neural net algorithm provides the capability of shaping clusters of the form shown in Fig. l(c) in a meaningful and adaptive manner. 3.1. ART
The well-accepted ART algorithms might seem to differ from the above Euclidean distance approach but actually deviate from it only slightly, being exactly the Euclidean distance approach if all of the vectors are of the same length. As shown in Fig. 2, in the ART algorithm each input pattern vector x is projected on each and all of the prototype vectors bj, and the cluster (prototype) node with the largest projection sum y j = C bjizi is identified with the use of the MAXNET. The proposition that the input vector x belongs to cluster j is then checked by forming the vigilance factor t j z z i . If that exceeds a threshold value (say) p, then the vector x is accepted as an additional new member of that j t h cluster, and the values bj are updated. The top down vigilance factor components are updated also. The ART algorithms are well explained in the literature [3-51, but we advocate and practice a slightly modified version of these especially in so far as updating is concerned [ 111.
Xi
110
Y.-H. Pa0
limitof duster exteni
0 ..... .
..
(c)
Fig. 1. Formation of clusters in unsupervised learning: (a) essentials of a cluster, (b) formation of distinctive non-overlapping clusters, and ( c ) more general cluster formation.
The projection procedure is adequate as long as both the b vectors and the input vectors are all of the same length. Under such circumstances, the scalar product of the two vectors b and x do, indeed, provide a measure of t-he similarity. Also we note that the square of the Euclidean distance between the bj vector and x vector is
= bgbj - 2bSx
+X ~ X .
(3.1)
Clearly the larger the value of (bjx),the smaller the Euclidean distance between bj and x or, in other words, the more similar they are. Also clearly all the previous remarks are also vaIid for the case of binary valued features as in ART 1.
1.4 Neural Net Computing for Pattern Recognition
111
In general, however, we advocate the practice described in Box 1, which is compatible with standard pattern recognition practice and with ART 2.
Box 1 Activate all output nodes j , j = 1, 2, . . . , J . Initialize weights b,, = E,, where E,, are random numbers (-1 < E,, < 1). Input pattern {z,} i = 1, . . . , N . Calculate the square of the Euclidean distance ED;, = 1,(b,, - y,)’. Determine that j for which ED:, < ED;, for all k = 1, 2, . . . , J ; K # j. Assign pattern {z,} as belonging to node j if ED:, also is equal to or less than ED;, (limit), where ED:, (limit) is a more or less arbitrary chosen limiting radius beyond which patterns are not considered to be of that cluster. n 1 z, (n = 0 at initialization). 7. Update b,,(n 1) = - b,z(n) + n+l Therefore, after the input of the first pattern b,,(1) = z,. 8. Input the next pattern, determine t o which unsupervised learning node it belongs, and update corresponding { b , , } .
1. 2. 3. 4. 5. 6.
+
We note that the cluster centers, the bj vectors, are only slightly perturbed by the inclusion of a new member, especially if the cluster already contains a number of members. The updating of bj is weighted so that
n 1 bj(n + 1) = -bj(n)+ n+l n+lX when the j t h cluster with n members in the cluster adds an additional member x to the cluster. The top-down vigilance vector tj is then taken to be equal to b, and is updated in the same manner. There are different ways of exercising this algorithm depending on whether one should activate all cluster prototype nodes initially or activate additional ones only as needed. In contrast to concerns which might dominate if we were endeavoring to build models of the brain, in artificial neural net computing it would seem that the latter practice, that of activating each additional new prototype cluster as needed, is more reasonable and usually convergence to stable cluster centers occurs in a straightforward manner. Also in the case of artificial neural net computing, there may be circumstances where determination of maximum similarly might be carried out more simply than with use of the MAXNET 1121. This type of algorithm corresponds closely to the K-means and ISODATA algorithms, and more to the former than to the latter. Our interest in the ART type of algorithm lies in the suggested net architecture and in the fact that the procedure corresponds to that of the K-means algorithm.
112
Y.-H. Pao y2
Y1
.................
YM
MAX-NET (0 detmine dustermembership. Not allcavrecbnsdwmn.
7oPdoW'
tF link.not shorm. used in verification.
f
f ...........
f ...............
XI
x2
xi
t
XN
Fig. 2. Some aspects of the ART algorithm. (a) Schematic illustration of the ART net, (b) two-dimensional illustration of the equivalence of the projection and distance measures when all pattern vectors are of the same length.
The ART algorithm has been extended to hierarchical ART structures [5] in work which addresses systems issues in the use of such algorithms. What is a little disappointing is the lack of opportunity to shape the clusters and to merge or split them as in the case of ISODATA, as illustrated in Fig. l(c). 3.2. Learning Vector Quantization
The Learning Vector Quantization (LVQ) algorithm builds [6] on the ART type of algorithm and mixes supervised learning with cluster formation. In a manner similar to that of the Widrow-Hoff [l]algorithm of pattern recognition, it refines the structure of a cluster by examining the class membership of each of the members in turn. The assumption is that nearly all of the cluster members belong t o one and the same class. Now as each member is examined in turn, the cluster prototype is modified to move closer to the current member under consideration or away from it depending on whether that member is or is not a member of the majority class.
1.4
Neural Net Computing for Pattern Recognition
113
That is m(n
+ 1) = m(n) + a(x
-
m(n)) if x is of the class of the cluster
(3.3)
+ 1) = m(n)
-
a(x - m(n)) if x is not of that class
(3.4)
or
m(n
where n is the number of cluster members already checked, and m is the vector denoting the cluster center. The parameter a is a fractional quantity which decreases with n so that there is convergence. This situation is depicted schematically in Fig. 3. X X
X
X
X
0 X
/Ax
X
x o
O X
x 0
cluster center adapts by moving A x
- patterns of class c (the majority)
o - patterns aof class C
A
- cluster "center of cluster of class c patterns (biased by presence of c patterns)
Fig. 3. The Learning Vector Quantization (LVQ) algorithm incorporating supervised classification into the cluster procedures.
3.3. Topologically- Correct Mapping The topologically correct mapping approach to clustering allows us t o investigate relationships between the following matters: 0 0 0
0
patterns defined in an N-dimensional positional space X , a process (or metric) defined on X and on the patterns described in X , nodes (or neurons) located spatially in an M-dimensional positional space Y , and an interactional process (or metric) defined for the neurons described in Y .
These entities are described in Fig. 4 where we show pattern vectors {x} defined in pattern space X . The ordering process we impose on top of the patterns in that space is not limited to a determination of the inter-pattern Euclidean distance, but can be quite general indeed. Given the patterns in X and the ordering process, we ask how the consequences of that ordering might be reflected in another space. In particular, for instance, for an array of fixed position neurons in space Y with
114
Y.-H. Pao
. .:. . .
. ..: . .
f
.
..
. .
.
.
. ., .. _
XJpace
.
... ... . .... .. . .
.
.
NdimendcndX space and an wdedng process which imposes a ‘meaning’ on the panern-ness 01 the members of X space
’. .
.
. . . .* . ... . . .
high trequency
sibilants
fricatives
.
/ \
A bwerdimensiondY soace suitable for representingthe essenw d the origin4 Order hhequencl
vdced
I
I
(eg.a 2-D mapping 01 utteranms)
Fig. 4. Schematic illustration of a hypothetical instance of a useful topologically correct mapping.
on-center off-surround interneuron interactions, how might the original X-space ordering process influence the correspondences between the pattern vectors in X space and their representation neurons in Y-space? The illustrations given by Kohonen [7,8] are for one- and two-dimensional Y spaces, the “exhibit” space, so to speak. Although the illustrations are striking and may provide insight for understanding how biological neuronal spatial structures are achieved in nature, our feeling is that in a sense those examples are perhaps too obvious. For example, in the two-dimensional case, the “process” or “metric” ordering the random vectors is determination of the inter-pattern Euclidean distance, and the underlying factor governing the positions of the representative neurons in Y space is also the Euclidean distance. Under such circumstances, it is not surprising that a good “topologically correct” mapping should have been achieved. This is not to say that demonstration of such a mapping is insignificant, but rather that we are not quite clear what other possibilities are implied. The one-dimensional acoustic signal spectral ordering example is slightly more intriguing, but it can be understood in about the same manner as a mapping from a one-dimensional (spectral) space to another one-dimensional (positional) space with corresponding Euclidean distance type of metrics in both cases. The finite Q filter banks are interesting but tend to obscure the situation slightly. However the intrinsic measure in both spaces is still that of inter-frequency distance. That and the additional requirement that neighboring neurons should resonate t o about the same frequency suffices t o establish a monotonic spectral ordering in neuron positional space.
1.4 Neural Net Computing for Pattern Recognition
115
Despite our attempts to rationalize our ready acceptance of the results of those illustrations, it is true that very little has been said about the theory of such mapping processes and work remains. In our fanciful illustration exhibited in Fig. 4, we suggest that in speech processing some weighting of formant-time values of utterances together with a n interneuron interaction in display space might result in a meaningful topologically correct mapping, in which the underlying order, always present, is now made manifest. Such mappings might provide some model of how biological systems organize themselves, but also would be interesting for neural net computing and pattern recognition purposes. 4. Supervised Learning In neural net computing, the notion of “supervised learning” corresponds to the inductive learning of a functional mapping from R” t o R”, given a set of examples of instances of that mapping. In other words, if we know that the vectors xi map into vectors yi for i = 1, 2, 3, . . . , I , can we construct a network computational structure which will accurately map all other x vectors in the N-dimensional X space into the corresponding correct image y vectors in the M-dimensional Y-space? This situation is illustrated schematically in Fig. 5.
Fig. 5 . The concept of learning a functional mapping from observation of examples of such mappings.
This type of activity corresponds most closely to the pattern recognition task of learning a discriminant function for the purposes of classification. It is interesting to note that the task of quantitative estimation is not addressed in pattern recognition except for estimation of density distribution functions, and even there the nature of the task is closer to synthesizing an analytical representation of known (measurable) densities rather than the inductive learning of functional mapping identified only through a set of examples. In this section, we comment on the backpropagation-of-error algorithm, briefly and schematically, because it is well known and the details of the algorithm have been widely disseminated [2].
116
Y.-H. Pao
In so far as learning procedures are concerned, we describe briefly two others in the following, these being the Boltzmann machine (with simulated annealing) [13] and the other being the functional-link net approach [11,14].
4.1. Backpropagation-of-Error Learning Algorithm The feedforward net is illustrated schematically in Fig. 6 , for a functional mapping of R" -+ R. The input to such a net is a vector in N-dimensional space and the output is a single real number. It is assumed that there is a functional mapping y = f(x), instances of which are known {y, = f(x,)}, and the learning task consists of determining the values of the weights { A j i } and { P j } and the thresholds { b j } so that the mean of the squares of the error, C,(f^(x, - f(~,))~, is minimized. There is no loss of generality in omitting a nonlinear transform a t the single output node. In the general case, there would be more than a single output and there could be more than one hidden layer. The weights and thresholds are determined on the basis of minimizing the overall system error averaged over all the training sets. That is, the quantity CkC,(Ok(x,) - Ok(xP))' is minimized where Ok(xp) is the desired (or target) output at the kth node for the pth pattern, and Ok(xP)is the actual computed value of the kth output for the same pattern.
t
Fig. 6. A feedforward neural net with hidden layer and no intra-layer node interactions, used with the backpropagation-of-error algorithm. Shown for Rn -+ R.
In the learning process, the weights Pj (or ,&j in the multi-output case) are readily learned because we have a direct measure of the error (&(xP) - Ok(x,)), at each and all outputs, for all of the training set patterns. However, for the hidden nodes, there is no direct measure of the relevant error ascribable to a particular hidden node and so the output pattern error has to be propagated backwards and interpreted appropriately to serve as a measure of guidance for improving the values of the weights leading into the hidden-layer node.
1.4 Neural Net Computing for Pattern Recognition 117 Although the overall learning procedure of the backpropagation-of-error algorithm is that of gradient search in weight space and that protocol is rapidly adhered to in all cases, there are, nevertheless, many variations on the adaptation scheme, primarily on how to improve the rate of convergence to the point of least error. There exist a number of papers which prove that a multilayer feedforward net can serve as a universal approximator, from a computational point of view, of quite general functional mappings. In other words, provided the spaces X and Y are measurable spaces and the known function is well behaved, then a net of the type shown in Fig. 6 can, indeed, reproduce the known mapping [15-171 and even the derivatives of the functional mapping [18]. Furthermore, even nets with only a single hidden layer can serve as a universal approximator provided the activation functions are of an appropriately constrained form. The multilayer feedforward net depicted in Fig. 6 has linear links and nonlinear activation functions at the nodes. The theoretical proofs of the adequacy of this computational model assure us that the known mapping as made evident by the act of examples {xi + yi} can, indeed, be computed by that type of net. However in pattern recognition and in applications of pattern recognition, interest in supervised learning goes beyond the question of whether known instances of mappings can be duplicated or not. In fact, the primary interest is whether the net can inductively learn a representation of the presumed functional mapping, which is valid for samples of x not included in the training set. In other words, as in other cases of pattern recognition, the interest is in whether the learned mapping is valid for the test set (of x vectors) also. The critical issue is the validity of the generalization. From a signal processing point of view, the generalized delta rule (GDR) multilayer feedforward net is a complex system. If we want to represent the functionality of such a net in terms of a transfer function, we would find that perhaps the best we could do would be to give instances of the effective small-signal transfer function at different signal regions. Even then there remain questions of the efficiency of learning and the quality of the learning achieved with use of different learning procedures. We will discuss these latter issues again briefly in the following in the context of the functional-link net. 4.2. The Boltzmann Machine and Simulated Annealing
An alternate approach to learning an optimal set of weight and threshold values is to “generate and test”. In this alternate approach [13,19,20], different states in weight space can be generated statistically and each new proposed state is evaluated as being accepted or not, depending on whether the LMS error is decreased or whether the increase in the magnitude of the LMS error is within a tolerable amount. In the simulated annealing approach to matters, we evaluate the change in the magnitude of error A&= ~ ( n1) - ~ ( nas) we generate the (n 1)th state. We also generate a random number p in the interval [1,0].
+
+
118
Y.-H. Pao
If A& < 0 then we accept the new set of weights as a better set and go on t o generate yet another (hopefully) even better set. In this way we let the system migrate to an optimum state in weight space. However if A& > 0, we do not necessarily reject the new state. Instead, we compare exp(-Ae/c) with the random number p .
If exp(-A&/c) increase in error.
> p , we accept the new set of weights even though there is an
However if exp(-Ae/c) another trial state.
5 p , we reject the new state and go on t o generate
In simulated annealing the “temperature” parameter c, at first, is taken to be quite large so that A&/c is liable to be quite small, and exp(-Ae/c) large, and the state of the system can wander quite a bit in weight space. As c + 0 large increases in error becomes less and less tolerated and the overall effect is to cause the state of the system to search for and to diffuse toward regions of lower and lower error. If and when carried out well, the simulated annealing procedure allows the system to explore, at first, wide regions of weight space and to avoid being trapped in narrow local minima. Use of the expression (exp -Ae/c) is inspired by an analogy t o the Boltzmann distribution of energy states in a classic (nonquantum-mechanic) system in thermodynamic equilibrium at some temperature. The gradual lowering of the “temperature” parameter corresponds to annealing, hence the term lLsimulated))annealing. It is amusing to note that in practice we often have simulated quenching working quite well also [2O]. The term Boltzmann machine generally refers t o network structures other than the feedforward net, but does not exclude the feedforward architecture. Indeed, it is used frequently for nets which have bidirectional excitatory and inhibitive internode interactions [13]. The procedure we have just described can also be considered to be an instance of the “generate and test” approach to learning in contrast t o the gradient search approach. In practice, use of the Boltzmann machine comprises two separate tasks, one being the choice of an appropriate structure and the other the learning of the values of the weights. To illustrate this and other points we have made, we discuss briefly the task of training a Boltzmann machine digit recognizer [ll]. The numerical digits are represented in terms of the segments of a seven-segment display as shown in Fig. 7 and the input/output relationships of the Boltzmann machine are shown in Fig. 8.
For this case no “hidden” nodes are needed, and the structure of the machine is that shown in Fig. 9. An important point is that there are extensive intra-layer node-to-node interactions. However it is not always true that “hidden” nodes can be avoided.
1.4 Neural Net Computing
~ O Pattern T
Recognition
119
BBRBR
H 5 u 6 7
Fig. 7 . A seven-segment display format for numerical digits 111).
digit -
0 1
2 3 4 5 6 7 8 9
input
1 1 101 1 1 001 001 0 I011101 101 101 1 0111010 1 1 01 01 1 0101 1 1 1 101 001 0 1111111 1111010
OUtDUt
1000000000 0100000000 0010000000 0001 000000 00001 00000 000001 0000 0000001000 00000001 00 0000000010 0000000001
Fig. 8. Fig. 8. Input/output relationship for a Boltzmann machine classification net [ll].
Input
Fig. 9. Structure of digit recognition Boltzmann machine [Ill.
4.3. The Functional-LinkNet
Experience with the backpropagation-of-error algorithm indicates that the algorithm is often slow and does not extrapolate well to high dimensions or t o large training sets. However users often find that ease of learning can be greatly enhanced by appropriate “preprocessing”. It is because of that type of experience that we
120
Y.-H. Pao
initially advocated a functional-link net approach to supervised learning. Instead of using a multilayer feedforward net with backpropagation of error, we advocated enhancing the input vector with functional-links g,(x) to yield a description of the input in an extended pattern space, with additional dimensions [11,14]. The functions g(x) are functions of the entire input pattern vector x and not just functions of any one component x,. In one version of that approach, the one which approaches the back-propagationof-error algorithm the closest, our approach consists in simply claiming that the first layer weights A,, and thresholds b, in the feedforward net of Fig. 6 need not be learned. Subject to rather general and easily satisfied constraints, only the output weights P, need to be learned. This is easily demonstrated. For illustration purposes we consider a functional mapping R + R. Namely, both the input and output spaces are one-dimensional. There is no loss of generality. We choose the one-dimensional case because of the ease of displaying results graphically. In other words we assume that there is a mapping y = f (x). Given instances {y, = f(x)}, can we learn the functional mapping sufficiently well so that we can interpolate and extrapolate to values of x not encountered in the training set? This is a question of utmost importance and interest. Both the B P net and the functional-link net are illustrated in Fig. 10 for this case. Let there be J hidden-layer nodes in a BP net. Then the value of the output is
where g ( ) is the activation function. Let there be N training set patterns so that the entire set of N simultaneous equations to be solved can be written as
GP=f or
C gw. P .-- f n
for n = l , 2 , 3 , . . . , N
j
Equation 4.1 can be expressed in component form as follow: P1 P2
P3
6 OJ
1.4 Neural Net Computing for Pattern Recognition 121
t
.
.
.
t Fig. 10. Comparison of backpropagation of functional-link nets.
+
where gj(xn) = gnj = g(Ajx, b j ) and x, is the value of the input for the n t h training set pattern. It is clear the nature of the solutions of Eq. (4.4) depends critically on the rank of the G matrix, and the values of the individual components are t o some extent immaterial. That is, instead of the feedforward net of Fig. lO(a), we advocate the net illustrated in Fig. 10(b) in which the initial input x is enhanced in dimensions and has the additional components gj(Ajz b j ) . If G is exactly of the correct rank, then a unique solution exists. If there are too many constraints, then there is only an LMS solution or possibly a degenerate set of such solutions. If there is an insufficient number of constraining equations, then there may be an infinite number of solutions for the weights {&}. The point we make is that the weights A3 (and thresholds b j ) can be randomly generated with no loss of generality.
+
122
Y.-H. Pao
For a mapping Rn + R, the input is a vector rather than a scalar, but the argument remains the same. The equations to be solved are then (4.5) Our point is that the vectors A, and thresholds b, may be generated randomly and only the weights @, need to be learned. The function g,() is a function of the entire input pattern vector x and not just a function of any one component x,. The function g() is not learned but is a “hardwired” functional transform of the input vector constituting a preprocessing set, so to speak. In view of these findings, we advocate regarding a supervised learning net t o be essentially a linear net with the input augmented with extra (non-linear) nodes. A large set of experiences indicate that our view is correct and that there are very large improvements in the fidelity of representation. The rate of learning can be achieved with use of the random vector version of the functional-link net. However there is a very important precaution t o be observed. The range of the amplitude of the “random” vectors A, need to be scaled so that the functional outputs g(A,x+ b,) are not all saturated, nor are all so small so that the additional components are linearly dependent. In general this precaution is not difficult to deal with. A normalizing scaling of the range of the input vectors and of the norms of random vectors would be sufficient. The situation is changed significantly if we insist that both the derivative of the function f(x) as well as the function itself be approximated well. Under such circumstances the vectors A, and thresholds b, do indeed need to be learned and the two sets of equations to be satisfied are
and (4.7) where
representing differentiation with respect to the ith component of the vector x and (4.9) It is difficult to solve Eqs. (4.6) and (4.7) simultaneously for a set of { P j } , {Aji}, and { b j } values which will approximate the derivative as well as the functions. But good mappings can be learned, nevertheless, by retreating to our simple functionallink net approach and taking sets of points near each training set input vector so
1.4 Neural Net Computing for Pattern Recognition
123
that in essence something is learned about how the functions varies in different direct ions. The result of the use of this “random vector” version of the functional-link net approach is that we can achieve the learning of rather complex functional mappings in moderate lengths of time. The nets used are linear nets with the input augmented with extra non-linear functional transforms of the input vector. We present and discuss some experimental results in the following subsection. These results are suggestive, but we refrain from generalizing too optimistically on the basis of these partial findings. 4.4. Experimental Results in Support of the Functional-Link N e t
For rather straightforward training tasks, the L‘randomvector” functional-link net outperforms the BP algorithm, principally, in terms of the rapidity with which learning is achieved. However we are also concerned about the ability to inductively learn a mapping of which we know a few instances. To explore the interpolation and extrapolation capabilities of such nets, we revert initially to the one-dimensional case and consider the task of learning a function y = f(z)where both z and y are scalar quantities. Of course sometimes correct interpolation cannot be achieved because not enough information was available in the first place. Given the training set of Fig. 11, there is simply no way for either the BP net or the functional-link net to guess what nature had in mind. Actually, in practice, both nets learned the smooth function f ~ ( z ) .
‘T
Fig. 11. An ambiguous training set of patterns for R
-+
R mapping.
However we might add a few (non-random) instances of the function to provide further information as shown in Fig. 12, in which case both the functional-link net and the backpropagation net interpolated well, but in different ways. The BP net
124
Y.-H. Pao
Fig. 12. An augmented training set.
took a very long time and a large number of iterations. The functional-link net used a large number of augmentations, but learned rapidly in about of the time required by the B P net. These matters are illustrated in Figs. 13 and 14. To explore comparable circumstances for higher dimensions, we also considered a two-input and two-output learning task. This can be visualized as learning two surfaces in a three-dimensional space. In every instance the input is a pair of coordinates (x,y) and the outputs are fi(z, y), and f2(z,y) representing the upper and lower surfaces of a bounded region. The net configurations are depicted in Fig. 15 and the surfaces to be learned are shown in Figs. 16 and 17. Given a reasonably uniform and representative sampling of the two surfaces, both the BP net and functional-link net do learn the two surfaces and interpolate reasonably well. However the functional-link net again learns much more rapidly. The interpolation achieved by the B P net and the FLN net for the upper surface are shown in Figs. 18 and 19, respectively. The B P net and the FLN net for the lower (and smoother surface) that are achieved are shown in Figs. 20 and 21, respectively. To the eye, the FLN results look more irregular, but actually the estimated results are more accurate. Again, the FLN net is faster by a factor of about lo3. We mention, in passing, that in the case of the B P net the hidden layer nodes serve both outputs. The hidden layer nodes are therefore constrained. There are advantages and disadvantages. One disadvantage is that any change in the input/output relationships at any single output will have widespread and severe repercussions throughout the entire net. One advantage is that such severe interactions might indeed force the hidden layer to take on the form of a meaningful internal representation. But that would be attained only by paying a high price in the form of the difficulty of learning!
1.4
Neural Net Computing for Pattern Recognition
1
'I
0.6
e
a
I 0
0
0
0
V
0
0
Training set
125
0
o Estimated output
(4
BP Net:
20 hidden layer nodes;
Number of iterations: 556245;
System error: 0.000 025 Training time (486 PC): Approx. 6 hm
(b) Fig. 13. Demonstration of interpolation achieved with a BP net for R
-+
R mapping.
In contrast, in the case of the functional-link net, each output would be served by its own net and learning is rapidly achieved. However this does not mean that interactions between components of the components of the input vector are neglected. 5. Associative Memories The term associative memories is used in different ways within neural net computing practice, as well as vis-8-vis pattern recognition.
126
Y.-H. Pao 2 -
--
1.5
._
1
8
8 0
0.6 - 0
4
-0.6 i. 1
--
.1.6
--
-2
0
0
0 2iO
0 0
0
0
0
b 260
270 0
2M)
280
300
8 310
320
,
350
I 340
3w
0
..
'T
FLN Net:
Auto expansions: 500;
System Error: 0.000025
Number of iterations: 1160;
Training time (486 PC): Approx. 5 min
(b) Fig. 14. Demonstration of interpolation achieved with a FLN net for R
+ R mapping.
For example, the matrix associative memory was studied by Nakano [21], Kohonen [22], Willshaw [23], Pao [24], and others primarily as models of distributed content-addressable memories, which were forgiving of errors or distortion in the cue and also forgiving of local damage to the memory. Hopfield [25] accentuated the perspective of viewing such distributed contentaddressable devices as nets. In the Hopfield net, there is no learning per se, just memorization, and the net computes a more nearly correct output pattern in response to a possibly distorted input pattern. These nets are fully and widely described in literature.
1.4 Neural Net Computing f o r Pattern Recognition I, 0
t
Fig. 15. Illustration of a two-input/two-output FLN configuration.
127
',U
t
net. (a) The BP net configuration and (b) the
Fig. 16. Output 1 of the two-output net.
128
Y.-H. Pao
Fig. 17. Output 2 of the two-output net.
d
Fig. 18. 2-D mapping learned with a BP net: the upper surface fi(x).
In a series of papers, Kosko [26] explored the question of whether bidirectional associative memories could be synthesized with the memory still in distributed matrix form, but with a nonlinear transforrnation at each end so that
MS(y) = x
(5.1)
1.4 Neural Net Computing for Pattern Recognition 129
Fig. 19. 2-D mapping learned with an F L net: the upper surface f l ( x ) .
Fig. 20. The surface f 2 ( x ) as learned with a BP net.
and
MtS(x) = y . Such memories can indeed be achieved, but often only with great difficulty. They are, nevertheless, noteworthy because they are not limited t o being auto-associative memories.
130
Y.-H. Pao
Fig. 21. The surface fi(x) as learned with a n FL net.
In contrast to such associative memories, we also have nets such as ART 1, ART 2 and ART 3 [3-51, which are also considered to be associative memories. These are probably closer in spirit to the associative memory models such as ACT 1271 and ACT* [28], models devised by psychologists in attempts to mirror the workings of human memory. As far as pattern recognition is concerned, a memory such as the Hopfield net might be considered to be a n excellent pattern recognizer capable of accepting partial or distorted cue patterns and returning a fully restored correct pattern. But, in practice, such fully connected nets are inefficient with the need for N 2 links for an N neuron net and with low storage capacity.
6. Optimization In pattern recognition research, interest in the topic of optimization manifests itself somewhat indirectly in the search for optimal values of decision functions, usually for classification purposes. This type of interest is different from the optimization concerns in systems or controls research where a typical task is to find that system state or control path for which an appropriately defined objective function has an optimal value, subject t o certain constraints on the system. Despite this large difference in the degree and mode of involvement, we discuss briefly a widely known but somewhat controversial approach t o optimization, which is part of neural net computing practice.
1.4 Neural Net Computing for Pattern Recognition 131 Energy Function
Direction of evolution of memory state
E
as distorted cue activates the correct stored pattern and is retrieved as such.
J
I I
I I
I Memory State Vectorx .--)
Stored Pattern
Distorted Cue
Fig. 22. Associative recall as optimization.
We start by going back to the Hopfield net auto-associative memory. For that memory, storage of a pattern may be likened to the creation of a (local) minimum in an energy function as indicated schematically in Fig. 2 2 . In associative retrieval the Hopfield net is activated by an input which might be a distorted version of the stored pattern. The algorithm of the net is such that the system evolves to that state (that pattern) which corresponds to the energy function being at the (local) minimum. In this manner retrieval with the Hopfield associative memory is equivalent to a set of optimization tasks. We know, from Hopfield and Tank [29], that given any initial state v the system evolves t o the state corresponding t o a minimum in the energy, if we let the system update itself iteratively in accordances with the equation
where ui is the input to the ith neuron and vi = g ( u i ) is the output of the ith neuron. For such a system, we have for the temporal evolution of the energy
And we see that E , indeed, evolves to a minimum if g(ui), the neuron activation function, is a nondecreasing function of ui. Hopfield and Tank suggested that the well-studied Traveling Salesman Problem (TSP) be encoded in the following manner. As illustrated in Fig. 23, a five city/five day planning task would be represented by the values of a set of 25 neurons. A
132
Y.-H. Pao B
Example Tour
ci A
B
-
C
D
E
1
2
Encoding used for Neural-net Computing 4
5
Fig. 23. Encoding the TSP Problem for neural net computing.
neuron would represent a specific city visited on a specific day and that neuron would have an output value of 1 if that combination were part of the salesman’s tour and it would have a value of 0 if it were not. Hopfield and Tank synthesized an energy function in analytical form, which represented not only the length of the salesman’s path, but also imposed penalties for nonvalid solutions, such as a city being visited twice or the salesman being a t two different places at the same time. In their approach, although the ultimate acceptable values for the neuron outputs were restricted to 1 or 0, they were treated as continuous variables in the processing. In this manner, a combinatorial optimization problem was converted into a gradient search task for neural net computing. In our opinion this constitutes both the strength and the weakness of this approach. A gradient search approach is advantageous because it obviates the necessity for devising some algorithm for generating new trial states. However the advantage is real only if one can be reasonably assured of a smooth descent into the minimum state or into one of the sets of acceptable minima.
1.4 Neural Net Computing f o r Pattern Recognition
133
The hypothetical advantage is more than offset by the real difficulties if there is a tremendous number of spurious local minima in the energy function so that the search procedure almost immediately comes to a halt in the nearest spurious local minimum. We believe that imposition of constraints in the energy function results in the creation of very large numbers of such minima. Some researchers, nevertheless, have employed this methodology to good effect for subclasses of optimization tasks. Takefuji [30], for example, has used “hillclimbing” terms in the energy function to eject the system state out of a local minimum if the state is not one which is acceptable as a solution. Other innovations, such as “maximum neuron”, have also been helpful for certain other circumstances. In other words Takefuji and his collaborators found that the neural net version of gradient search for optimization is viable if certain acceptability conditions are known for the solution. These should not be incorporated into the energy function but can be used t o activate “hill-climbing’’ terms if one or more of the validation conditions are not satisfied.
7. Representation, Feature Extraction and Concepts In our opinion the most valuable aspect of neural net computing lies in none of the above four computational capabilities. It is rather in the promises of its being a useful (perhaps even the correct) tool for research in a very murky area of information processing research. This is an area which is only dimly perceived by many, denied by others, and partitioned and vehemently defended in isolated parts by yet others, but, nevertheless, tantalizing and beguiling to almost all. We speak of some underlying mysteries of human behavior, perhaps ultimately attributable to the nature of our “hardware”, in this case the left and right halves of the brain, with different functionalities and propensities and not overly communicative with each other. In human behavior we have the dichotomy of perception and action on the one hand and language and reason on the other. In some human cultures, the importance of an internal “knowing” is elevated above all other considerations. Thus, one behaves well not necessarily through elaborate reasoning, but because we “know” that it is the correct behavior. One aims an arrow most accurately when one almost feels that no deliberate aiming is being done. In such behavior, perception and action are everything, and all matters can proceed smoothly and rapidly in an easy flow. To the modern day information processing researcher, it is interesting and entertaining to go back into the history of philosophy and psychology to see that other human cultures have glorified the “pure light of reason”. Language and reason are then supreme. There is even the hypothesis that humans are rational beings and always act to optimize attainment of their goals [28].
134
Y.-H. Pao
Most of us will admit that, in practice, our cultures and our behavior comprise admixtures of both aspects of behavior and both are important. The perception and action channel allows us to carry out intricate actions appropriately and rapidly in response to rapidly changing external conditions with no time for explicit cognitive deliberations. Thus, we can ride bicycles even on highly uneven roadways. Incidentally we note that some bears can also be trained to ride bicycles, but they do not articulate their skill at all, as far as we can tell. However humans can articulate some aspects of the bike-riding skill in language and help teach other humans that skill through use of the language and reasoning channel as well. Interactions between the two channels are, indeed, of great importance and value. One demonstration of the value and effectiveness of such interaction is provided by the example of a n athletics’ coach being able t o produce significant improvements in the performance of a star athelete even though the performance capabilities of the coach himself may be substantially below that of the athlete. In another instance it is found that improvements in foundry practice are obtained when experiences are carefully documented and the information shared with the foundry shop community through the language and symbols channel. It seems to us that one bottleneck to communications between the channels lies in the discovery and articulation of concepts. This same matter is also encountered under the guises of feature extraction, knowledge representation and so on. We believe that the perception and action channel draws a veil over its workings, so as to speak, and in that mode of information processing, both for humans and in algorithms, uniqueness in representation or in feature selection might not be critical. To use an analogy, it is as if matrix operations proceed equally well in general representations as in eigenfunction representation. To pursue the analogy, matters are very different when one wants to describe matters in terms of concepts expressible in terms of linguistic symbols. In artificial intelligence, researchers do strive to learn concepts and there is success when concepts are “learned” in the sense of being inferred from other sets of concepts. However the bridge between the linguistic symbolic world and that of perception is weak. These matters are very important not only because the subject matter is so interesting, but also because there may be the opportunity to fashion computer aids in ways t o compensate for inadequacies due t o the idiosyncrasies of human physiology. We believe that neural net computing provides a tool for capturing and manipulating the “pattern-ness” of things and also for encoding and articulating such matters into the linguistic symbol world. Some work by Pa0 and Hafez [31] address these matters. Hinton, McClelland, Rumelhart, Touretsky and others [32,33] have addressed the relationships between distributed associative processing and linguistic symbolic, logic-based, and information processing. But the emphasis has been to ask how the same type of rule-based processing, entirely with linguistic symbols on both the
1.4 Neural Net Computing for Pattern Recognition 135 antecedent and consequent sides of the scale, might be carried out in a “connectionist” representation. We believe what is just as interesting, perhaps more so, is to investigate what can be achieved if the distributed coarse-coding connectionist scheme is used as an interface between the pattern-ness of things and the extracted linguistic symbolic entities. In terms of human behavior, we would be striving to understand how we learn how to dance a fast waltz not only by example but also aided by spoken instructions. Or in a related matter, how do we “internalize” perceptual experiential knowledge so that we can verbalize that information and reason with it. Finally, it can be said that it is not that neural net computing is relevant to pattern recognition and to computer vision. It is rather that neural net computing might turn out to be a n essential tool for unifying our pieces of knowledge in the fragmented bastions of research endeavor known presently as artificial intelligence, pattern recognition, fuzzy logic, computer vision, and so on. References (11 R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis (John Wiley, NY, 1973). [2] D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing: Explorations i n the Microstructure of Cognition, Vols. 1 & 2 (MIT Press, Cambridge, MA, 1986). [3] G. A. Carpenter and S. Grossberg, A massively parallel architecture for a selforganizing neural pattern recognition machine, Comput. Vision Graph. Image Process. 37 (1987) 54-115. [4] G. A. Carpenter and S. Grossberg, ART2: Self-organization of stable category recognition codes for analog input patterns, Appl. Opt. 26 (1987) 4919-4930. [5] G. A. Carpenter and S. Grossberg, ART3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures, Neural Networks 3 (1990) 129-152. [6] T. Kohonen, An introduction to neural computing, Neural Networks 1 (1988) 3-16. [7] T. Kohonen, Self-organized formation of topologically correct feature maps, Biol. Cybern. 43 (1988) 59-69. [8] T. Kohonen, Clustering, taxonomy, and topological maps of patterns, in PTOC.Sizth Int. Conf. on Pattern Recognition, Silver Spring, MD (IEEE Computer Society Press, 1982) 114-128. [9] C. H. Chen, Statistical Pattern Recognition (Hayden, Washington, DC, 1973). [lo] G. H. Ball and D. J. Hall, ISODATA, an interative method of multivariant data analysis and pattern classification, in PTOC.I E E E Int. Communication Conf., Philadelphia, PA, Jun. 1966. [ll]Y. H. Pao, Adaptive Pattern Recognition and Neural Networks (Addison-Wesley, Reading, MA, 1988). [12] R. P. Lippman, B. Gold and M. L. Malpass, A comparison of Hamming and Hopfield neural nets for pattern classification, MIT Lincoln Laboratory Technical Report, TR769, Massachusetts Institute of Technology, Cambridge, MA, 1987.
136
Y.-H. Pao
[13] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines (John Wiley, New York, 1989). [14] Y. H. Pa0 and Y. Takefuji, Functional-link net computing, I E E E Computer 3 (1992) 76-79. [15] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signal, and Systems 2 (1989) 303-314. [16] M. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359-366. [17] K. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks 2 (1989) 183-192. [18] K. Hornik, M. Stinchcombe and H. White, Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks, Neural Networks 3 (1990) 551-560. [19] G . E. Hinton and T. J. Sejnowski, Analyzing cooperative computation, in Proc. Fifth Annual Conf. of the Cognitive Science Society, Rochester, NY, May 1983. [20] D. S. Touretzky and G. E. Hinton, Pattern matching and variable binding in a stochastic neural network, in L. Davis (ed.), Genetic Algorithm and Simulated Annealing (Morgan Kaufmann, Inc., Los Altos, CA, 1987). [21] K. Nakano, Associatron-A model of associative memory, I E E E Trans. Syst. Man Cybern. 2 (1972) 380-388. [22] T. Kohonen, Associative Memory: A System- Theoretical Approach (Springer-Verlag, New York, 1977). [23] D. J. Willshaw, Model of distributed associative memory, unpublished doctoral dissertation, Department of Machine Intelligence, University of Edinburgh, Edinburgh, 1971. [24] Y. H. Pao and G. P. Hartoch, Fast memory access by similarity measure, in J. Hayes, D. Michie and Y . H. Pao (eds.), Machine Intelligence 10 (Wiley, New York, 1982). [25] J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, in Proc. Nat. Acad. Sci. 74 (1982) 2554-2558. [26] B. Kosko, Neural Networks and Fuzzy Systems (Prentice-Hall, Englewood Cliffs, NJ, 1992). [27] J. R. Anderson and G. H. Bower, Human Associate Memory (V. H. Winston, Washington, DC, 1973) (distributed by the Halsted Press, Division of Wiley, NY). [28] J. J. Hopfield and D. W. Tank, Neural computation of decisions in optimization problems, Biol. Cybern. 52 (1985) 144-152. [29] Y . Takefuji, Neural Network Parallel Computing (Kluwer Academic, Boston, MA, 1992). [30] J. R. Anderson, The Adaptive Character of Thought (Lawrence Erlbaum Associates, Hillsdale, NJ, 1990). [31] Y . H. Pao and W. Hafez, Analog computational models of concept formation, International Journal of Analog Integrated Devices and Signal Processing, Special Neural-Net Issue on Analog VLSI Neural Networks 2 (1992) 3-10. [32] G. E. Hinton, J. M. McClelland and D. E. Rumelhart, Distributed Representations, in D. E. Rumelhart and J. M. McClelland (eds.), Parallel Distributed Processing: Explorations i n the Microstructure of Cognition, Vol. 1 (Bradford Books, Cambridge, MA, 1986).
1.4 Neural Net Computing for Pattern Recognition 137 [33] D. S. Touretzky, BoltzCONS: Reconciling connectionism with the recursive nature of stacks and trees, in Proc. Eighth Annual Conf. of the Cognitive Science Society, Amherst, MA, Aug. 1986.
Appendix A A list of texts, monographs and edited volumes which tnight contain detailed information of interest to readers: Adaptive Pattern Recognition and Neural Networks. AUTHOR: Pao, Yoh-Han. PUBLISHER: Reading, MA: Addison-Wesley, 1989. ISN/OTHER No.: 0201125846.0 Advanced Neural Computers. EDITOR: Rolf Eckmiller. PUBLISHER: Amsterdam; New York: North-Holland, 1990. ISN/OTHER No.: 0444884009 (US.) Analog VLSI: Implementation of Neural Systems. EDITORS: Carver Mead and Mohammed Ismail. PUBLISHER: Boston: Kluwer Academic Publishers, 1989. SERIES: The Kluwer international series in engineering and computer science; SECS 80. ISN/OTHER No. 0792390407 Artificial Neural Networks for Computer Vision. AUTHORS: Yi-Tong Zhou and Rama Chellappa. PUBLISHER: New York: Springer-Verlag, 1992. SERIES: Research ISN/ OTHER No.: 0387976833 (New York), 3540976833 (Berlin) Artificial Neural Networks: Theoretical Concepts. AUTHOR: V. Vemuri. PUBLISHER: Washington, D.C.: IEEE Computer Society Press, 1988. SERIES: Neural networks. Computer Society Press technology series. ISN/OTHER No.: 0818608552 Artificial Neural Systems: Foundations, Paradigms, Applications, and Implementations. AUTHOR: Patrick K. Simpson. PUBLISHER: New York: Pergamon Press, 1990. SERIES: Neural networks, research and applications. ISN/OTHER No.: 0080378951, 0080378943 (pbk.) Code Recognition and Set Selection with Neural Networks. AUTHOR: Clark Jefiies. PUBLISHER: Boston: Birkhauser, 1991. SERIES: Mathematical modeling (Boston, MA); no. 7. ISN/OTHER No.: 0817635858 (acid-free paper), 3764335858 (acid-free paper) Cognizers: Neural Networks and Machines That Think. AUTHOR: R. Collin Johnson and Chappell Brown; illustrated by Lisa Metzger. PUBLISHER: New York: Wiley, 1988. SERIES: Wiley science editions. ISN/OTHER No.: 0471611611 Cognitive Psychology: A Neural-Network Approach. AUTHOR: Colin Martindale. PUBLISHER: Pacific Grove, CA: Brooks/Cole Pub. Co., 1991. ISN/OTHER No.: 23654900, 0534141307 Common LISP Modules: Artificial Intelligence in the Era of Neural Networks and Chaos Theory. AUTHOR: Mark Watson. PUBLISHER: New York: Springer-Verlag, 1991. ISN/OTHER No.: 0387976140, 3540976140 Competitively Inhibited Neural Networks for Adaptive Parameter Estimation. AUTHOR: Michael Lemmon; foreword by B. V. K. Vijaya Kumar. PUBLISHER: Boston: Kluwer Academic, 1991. SERIES: The Kluwer international series in engineering and computer science; SECS 111. Knowledge representation, learning, and expert systems. ISN/OTHER No.: 0792390865 Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. AUTHORS: Sholom M. Weiss and Casimir Kulikowski. PUBLISHER: San Mateo, CA: M. Kaufmann Publishers, 1990. ISN/ OTHER No.: 1558600655
138
Y.-H. Pao
Connectionist Modeling and Brain Function: The Developing Interface. EDITORS: Stephen Jose Hanson and Carl R. Olson. PUBLISHER: Cambridge, MA: MIT Press, 1990. SERIES: Neural network modeling and connectionism. ISN/OTHER No.: 0262081938 DARPA Neural Network Study: October 1987-February 1988. AUTHOR: DARPA Neural Network Study (U.S.). PUBLISHER: Fairfax, VA: AFCEA International Press, 1988. LC Card Number: 88031655//r90 ISBN No.: 0-916159-17-5 Exploring the Geometry of Nature: Computer Modeling of Chaos, Fractals, Cellular Automata, and Neural Networks. AUTHOR: Ed Rietman. PUBLISHER: Blue Ridge Summit, PA: Windcrest, 1989. SERIES: The advanced programming technology series. ISN/OTHER No.: 0830691375, 0830631372 (pbk.) Hebbian Neural Network Simulation: Computer Program Documentation. AUTHORS: Robert G. Day and Lee J. White. PUBLISHER: Columbus, OH: Computer and Information Science Research Center, Ohio State University, 1969. SERIES: Ohio State University, Columbus, Computer and Information Science Research Center, Technical report series; OSU-CISRC-TR-69-19 Introduction to Artificial Neural Systems. AUTHOR: Jacek M. Zurada. PUBLISHER: St. Paul, New York, Los Angeles, San Francisco: West Publishing Company, 1992. ISN/ OTHER NO.: ISBN 0-314-93391-3 A n Introduction to Fuzzy Logic Applications i n Intelligent Systems. EDITORS: Ronald R. Yager and Lotfi A. Zadeh. PUBLISHER: Boston: Kluwer Academic, 1992. SERIES: The Kluwer International series in engineering and computer science; SECS 165. ISN/OTHER No.: 0792391918 A n Introduction to Neural Computing. AUTHOR: Igor Aleksander and Helen Morton. PUBLISHER: London: Chapman and Hall, 1990. ISN/OTHER No.: GB90-14110, 0412377802 (pbk) Introduction to Neural Networks. AUTHORS: Jeannette Stanley and Evan Bak. EDITOR: Sylvia Luedeking. PUBLISHER: Sierra Madre, CA 91024: California Science Software, 1988 The Metaphorical Brain 2 : Neural Networks and Beyond. AUTHOR: Michael A. Arbib. PUBLISHER: New York: Wiley, 1989. ISN/OTHER No.: 0471098531 Modeling Brain Function: The World of Attractor Neural Networks. AUTHOR: Daniel J. Amit. PUBLISHER: New York: Cambridge University Press, 1989. ISN/OTHER No.: 0521361001 Models of Neural Networks. EDITORS: E. Domany, J. L. van Hemmen and K. Schulten. PUBLISHER: Berlin, New York: Springer-Verlag, 1991. SERIES: Physics of neural networks. ISBN 0387511091 Nested Neural Networks [microform]. AUTHOR: Yoram Baram. PUBLISHER: Moffett Field, CA.: National Aeronautics and Space Administration, Ames Research Center (Springfield, VA: For sale by the National Technical Information Service, 1988). SERIES: NASA technical memorandum; 101032. ISN/OTHER No.: N 88-30373 NASA., 0830-d (MF), GOV DOC No.: NAS 1.15:101032 Neural and Automata Networks: Dynamical Behavior and Applications. AUTHOR: Eric Goles Servet Martinez. PUBLISHER: Dordrecht, Boston: Kluwer Academic 1990. SERIES: Mathematics and its applications (Kluwer Academic Publishers); Vol. 58. ISN/ OTHER No.: 0792306325 (alk. paper) Neural and Intelligent Systems Integration: Fifth and Sixth Generation Integrated Reasoning Information Systems. AUTHORS: Branko Soucek and the IRIS Group. PUB-
1.4 Neural Net Computing f o r Pattern Recognition
139
LISHER: New York: Wiley, 1991. SERIES: Sixth-generation computer technology series. ISN/OTHER No.: 0471536768 Neural and Massively Parallel Computers: The Sixth Generation. AUTHORS: Branko Soucek and Marina Soucek. PUBLISHER: New York: Wiley, 1988. ISN/OTHER No.: 0471635332 Neural Computation and Self-organizing Maps: A n Introduction. AUTHORS: Helge Ritter, Thomas Martinez and Klaus Schulten. PUBLISHER: Addison-Wesley Publishing CO., 1992. ISN/OTHER NO.: ISBN 0-201-55443-7 (hbk.), 0-201-55442-9 (pbk.) Neural Computers. EDITORS: Rolf Eckmiller and Christoph v.d. Malsburg. CONFERENCE: NATO Advanced Research Workshop on Neural Computers (1987: Neuss, Germany) PUBLISHER: Berlin, New York: Springer-Verlag, 1989. SERIES: NATO AS1 Series (Advanced Science Institute Series) F, Computer and systems sciences; vol. 41. ISN/OTHER No.: 0387508929 (U.S.) Neural Computing: An Introduction. AUTHORS: R. Beale and T. Jackson. PUBLISHER: Bristol: Hilger, 1990. ISN/OTHER No.: GB90-35434, 0852742622 Neural Computing: Theory and Practice. AUTHOR: Philip D. Wasserman. PUBLISHER: New York: Van Nostrand Reinhold, 1989. ISN/OTHER No.: 0442207433 Neural Dynamics of Adaptive Sensory-motor Control. AUTHORS: Stephen Grossberg and Michael Kuperstein. EDITION: Expanded ed. PUBLISHER: New York: Pergamon Press, 1989. SERIES: Neural networks, research and applications. ISN/OTHER No.: 008036828X, 0080368271 (pbk.) Neural Models and Algorithms for Digital Testing. AUTHORS: Srimat T. Chakradhar, Vishwani D. Agrawal and Michael L. Bushnell. PUBLISHER: Boston: Kluwer Academic Publishers, 1991. SERIES: The Kluwer international series in engineering and computer science; SECS 140. VLSI, computer architecture, and digital signal processing. ISN/OTHER No.: 0792391659 (acid-free paper) Neural Network Application to Aircraft Control System Design [microform]. AUTHORS: Terry Troudet, Sanjay Garg and Walker C. Merrill. PUBLISHER. Washington, DC: National Aeronautics and Space Administration; [Springfield, VA: For sale by the National Technical Information Service, 19911. SERIES: NASA technical memorandum; 105151. ISN/OTHER No.: N 91-27167 NASA. 0830-D (MF), GOV DOC No.: NAS 1.15:105151 Neural Networks Architectures: An Introduction. AUTHOR: Judith E. Dayhoff. PUBLISHER: New York: Van Nostrand Reinhold, 1990. ISN/OTHER No.: 0442207441 Neural Network Design and the Complexity of Learning. AUTHOR: J. Stephen Judd. PUBLISHER: Cambridge, MA: MIT Press, 1990. SERIES: Neural network modeling and connectionism. ISN/OTHER No.: 0262100452 Neural Network Models in Artificial Intelligence. AUTHOR: Matthew Zeidenberg. PUBLISHER: New York: Ellis Horwood, 1990. SERIES: Ellis Horwood series in artificial intelligence. ISN/OTHER No.: 0136121853, 0745806007 Neural Network Parallel Computing. AUTHOR: Yoshiyasu Takefuji. PUBLISHER: Boston: Kluwer Academic publishers, 1992. The Kluwer international series in engineering and computer science; SECS 0164. ISN/OTHER No.: 079239190X (acid-free paper) Neural Networks: An Introduction. AUTHOR: B. Muller and J. Reinhardt. EDITION: Corr. 2nd print. PUBLISHER: Berlin, New York: Springer-Verlag, 1991. SERIES: Physics of neural networks. ISN/OTHER No.: 3540523804 (Berlin: alk. paper), 0387523804 (New York: alk. paper)
140
Y.-H. Pao
Neural Networks and Natural Intelligence. EDITOR: Stephen Grossberg. PUBLISHER: Cambridge: MIT Press, 1988. ISN/OTHER No.: 026207107X Neural Networks and Speech Processing. AUTHORS: David P. Morgan and Christopher L. Scofield; foreword by Leon N. Cooper. PUBLISHER: Boston: Kluwer Academic publishers, 1991. SERIES: The Kluwer international series in engineering and computer science. VLSI, computer architecture, and digital signal processing. ISN/OTHER No.: 0792391446 (alk. paper) Neural Networks: Concepts, Applications, and Implementations. EDITORS: Paolo Antognetti and Veljko Milutinovic. PUBLISHER: Englewood Cliffs, NJ: Prentice Hall, 1991. SERIES: Prentice Hall advanced reference series. Engineering. ISN/OTHER No.: 0136125166 (Vol. l ) , 0136127630 (Vol. 2) Neural Networks f o r Computing, Snowbird, U T , 1986. EDITOR: John S . Denker. PUBLISHER: New York: American Institute of Physics, 1986. AIP conference proceedings; no. 151. ISN/OTHER No.: 088318351X Neural Networks for Control. EDITORS: W. Thomas Miller, 111, Richard S. Sutton and Paul J. Werbos. PUBLISHER: Cambridge, MA: MIT Press, 1990. SERIES: Neural network modeling and connectionism. ISN/OTHER No.: 0262132613 Neural Networks for Perception. EDITOR: Harry Wechsler. PUBLISHER: Boston: Academic Press, 1992. ISN/OTHER No.: 0127412514 (Vol. 1: acid-free paper), 0127412522 (Vol. 2: acid-free paper) Neural Networks: Theoretical Foundations and Analysis. EDITOR: Clifford Lau. PUBLISHER: New York: IEEE Press, 1992. ISN/OTHER No.: 0879422807 Neural Networks: Theory and Applications. EDITORS: Richard J. Mammone and Yehoshua Y. Zeevi. PUBLISHER: Boston: Academic Press, 1991. ISN/OTHER No.: 0124670504 (alk. paper) Neurale Netuaerk. In English: Neural Networks: Computers with Intuition. AUTHORS: Soren Brunak and Benny Lautrup. PUBLISHER: Singapore: World Scientific Pub. Co., 1988. ISN/OTHER No.: 9971509385, 9971509393 (pbk.) NeuralSource: The Bibliographic Guide to Artificial Neural Network. AUTHORS: Philip D. Wasserman and Roberta M. Oetzel. PUBLISHER: New York: Van Nostrand Reinhold, 1990. ISN/OTHER No.: 0442237766 Neurocomputing. AUTHOR: Robert Hecht-Nielsen. PUBLISHER: Reading, MA: Addison-Wesley, 1990. ISN/OTHER No.: 0201093553 Neurocomputing: Foundations of Research. EDITORS: James A. Anderson and Edward Rosenfeld. PUBLISHER: Cambridge, MA: MIT Press, 1988. ISN/OTHER No.: 0262010976 New Developments in Neural Computing: Proceedings of a meeting on neural computing sponsored by the Institute of Physics and the London Mathematical Society held in London, 19-21 April 1989. EDITORS: J. G. Taylor and C. L. T. Mannion. PUBLISHER: Bristol [England], New York: A. Hilger, 1989. ISN/OTHER No.: 0852741936 Orthogonal Patterns in Binary Neural Networks [microfom]. AUTHOR: Yoram Baram. PUBLISHER: Moffett Field, CA: National Aeronautics and Space Administration, Ames Research Center; (Springfield, VA: For sale by the National Technical Information Service, 1988). SERIES: NASA technical memorandum; 100060. ISN/OTHER No.: A-88068., 0830-D (MF), GOV DOC No.: NAS 1.15: 10060
1.4 Neural Net Computing for Pattern Recognition 141 Pattern Recognition b y Self-organizing Neural Networks. EDITORS: Gail A. Carpenter and Stephen Grossberg. PUBLISHER: Cambridge, MA: MIT Press, 1991. ISN/OTHER No.: 0262031760 The Perception of Multiple Objects: A Connectionist Approach. AUTHOR: Michael C. Mozer. PUBLISHER: Cambridge, MA: MIT Press, 1991. SERIES: Neural network modeling and connectionism. ISN/OTHER No.: 0262132702 (hc) Physical Models of Neural Networks. AUTHOR: Tamas Geszti. PUBLISHER: Singapore: World Scientific, 1990. ISN/OTHER No.: 9810200129 A Real Time Neural Net Estimator of Fatigue Life [microform]. AUTHOR: T. Troudet and W. Merrill. PUBLISHER: Washington, DC: National Aeronautics and Space Administration; [Springfield, VA: For sale by the National Technical Information Service, 19901. SERIES: NASA technical memorandum; 103117. ISN/OTHER No.: N 90-21564 NASA., 0830-D (MF), GOV DOC No.: NAS 1.15:103117 Recursive Neural Networks for Associative Memory. AUTHORS: Yves Kamp and Martin Hasler. PUBLISHER: Chichester, New York: John Wiley and Sons, 1990. SERIES: Wiley-Interscience series in systems and optimization. ISN/OTHER No.: 0471928666 Simulation Tests of the Optimization Method of Hopfield and Tank Using Neural Networks [microform]. AUTHOR: Russell A. Paielli. PUBLISHER: Moffett Field, CA: National Aeronautics and Space Administration, Ames Research Center; [Springfield, VA: For sale by the National Technical Information Service, 19881. SERIES: NASA technical memorandum; 101047. ISN/OTHER No.: A-88275 Structure Level Adaptation f o r Artificial Neural Networks. AUTHOR: Tsu-Chang Lee; foreword by Joseph W. Goodman. PUBLISHER: Boston: Kluwer Academic publishers, 1991. SERIES: The Kluwer international series in engineering and computer science; SECS 133. Knowledge representation, learning, and expert systems. ISN/OTHER No.: 0792391519 VLSI Design of Neural Networks. EDITOR: Ulrich Ramacher. PUBLISHER: Boston: Kluwer Academic Publishers, 1991. ISN/OTHER No.: 0792391276
Appendix B A list of names of journals wholly or partially devoted to publishing neural net computing articles: Advances in Connectionist and Neural Computation Theory. Frequency: Irregular. PUBLISHER: Ablex Publishing Corp., 355 Chestnut St., Norwood, NJ 07648. Tel: (201) 767-8450. EDITOR: John Barnden. Biological Cybernetics. Frequency: monthly. PUBLISHER: Springer-Verlag, Heidelberger Platz 3, D-1000 Berlin 33, Germany (also in New York). Tel: 030-8207-1. EDITOR: W. Reichardt. IEEE Transactions on Neural Networks. Frequency: Bi-monthly. PUBLISHER: IEEE, Inc., 345 E. 47th St., New York, NY 10017-2394. Tel: (212) 705-7366. Subscriptions to 445 Hoes Lane, Box 1331, Piscataway, NJ 08855-1331. Tel: (908) 562-3948. EDITOR: Herbert Rauch. IEEE Transactions on Pattern Analysis and Machine Intelligence. Frequency: monthly. PUBLISHER: IEEE, Inc., 345 E. 47th St., New York, NY 10017-2394. Tel: (212) 705-7366. Subscriptions to 445 Hoes Lane, Box 1331, Piscataway, NJ 08855-1331. Tel: (908) 562-3948. EDITOR: Anil K. Jain.
142
Y.-H. Pao
International Journal of Neural Networks. Frequency: quarterly. PUBLISHER: Learned Information, Inc., 143 Old Marlton Pike, Medford, NJ 08055. Tel: (609) 654-6266. EDITOR: Kamal Karna and Ian Croall. Journal of Parallel and Distributed Computing. Frequency: monthly. PUBLISHER: Academic Press, Inc., JOURNAL Division, 1250 Sixth Ave., San Diego, CA 92101. Tel: (619) 230-1840. EDITOR: Kai Hwang and Howard Siegel. Neural Computation. Frequency: quarterly. PUBLISHER: MIT Press, 55 Hayward St ., Cambridge, MA 02142. T e l (617) 253-2889. EDITOR: Terence Sejnowski, Salk Institute, Box 85800, San Diego, CA 92138. Neural Network Review. Frequency: quarterly. PUBLISHER: Lawrence Erlbaum Associates, Inc., 365, Broadway, Hillsdale, NJ 07642. Tel: (201) 666-4110. EDITOR: Craig Will. Neural Networks. Frequency: Bi-monthly. PUBLISHER: Pergamon Press, Inc., JOURNALS Division, Maxwell House, Fairview Park, NY 10523. Tel: (914) 592-0770. Neurocomputing. Frequency: Bi-monthly. PUBLISHER: North Holland (Subsidiary of Elsevier Science Publishers B. V.), P.O. Box 211, 1000 AE Amsterdam, Netherlands. EDITOR: V. David Sanchez. Pattern Recognition. Frequency: monthly. PUBLISHER: Pergamon Press, Inc. , JOURNALS Division, Maxwell House, Fairview Park, NY 10523. Tel: (914) 592-0770. EDITOR: Robert Ledley. Progress i n Neural Networks. Frequency: annual. PUBLISHER: Ablex Publishing Corp., 355 Chestnut St., Norwood, NJ 07648. Tel: (201) 767-8450. EDITOR: Omid Omidvar.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 143-181 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
LCHAPTER
1.5
I
ON MULTIRESOLUTION WAVELET ANALYSIS USING GAUSSIAN MARKOV RANDOM FIELD MODELS
C. H. CHEN and G. G. LEE Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth, N . Dartmouth, M A 0274 7, USA This chapter presents a novel algorithm for image segmentation and classification via the use of the multiresolution wavelet analysis and the expectation maximization (EM) algorithm. The development of a multiresolution wavelet feature extraction scheme is based on the Gaussian Markov random field (GMRF) assumption in image modeling. Real-world images are hierarchically decomposed into different resolutions. In general, larger image components are characterized by coarser resolutions whereas higher resolutions show finer and more detailed structures. These hierarchical variations in the anatomical features displayed by multiresolution decomposition is further quantified through the application of the Gaussian Markov random field. Due to its uniqueness in locality, adaptive features based on the nonstationary assumption of GMRF are defined for each pixel of the image. Distinct image regions are then segmented via the Fuzzy c-Means (FCM) algorithm using these localized features. Subsequently the segmentation results are further enhanced via the introduction of a maximum a posteriori (MAP) segmentation estimation scheme based on the Bayesian learning paradigm. Gibbs priors or Gibbs random fields have also been incorporated into the learning scheme with very effective outcomes. In this chapter, the expectation maximization (EM) algorithm for MAP estimation will be formulated. The EM algorithm provides an iterative and computationally simple algorithm based on the incomplete data concept. The algorithm presented is applied to digital mammograms and wafer inspection images.
Keywords: Image segmentation, multiresolution wavelet analysis, Gaussian Markov random field, adaptive local features, EM algorithm, textured images, digital mammograms, wafer inspection.
1. Introduction In recent years, considerable interest has risen concerning new transform techniques and image models for the analysis of real-world images [l].Presently, much of the work of scientists, engineers and even medical doctors relies significantly upon successful extraction of information embedded in different image modalities originated from biomedical, manufacturing, and remotely sensed applications. Many of these intelligent behaviors have been automated by computers. Indeed, automated analysis and interpretation of information conveyed by images have become very important tools in several aspects of science, engineering, and medicine. Moreover, due to limitations of even the most recent and advanced image capturing 143
144
C. H. Chen 6Y G. G. Lee
systems, the recorded images are inevitably degraded in many ways. Thus image analysis algorithms based on new image transform techniques and image modeling paradigms for segmentation and feature extraction are now strongly sought for. In this chapter we discuss the issues of segmentation and classification of natural images using the spectral-spatial properties of the multiresolution wavelet transform in conjunction with the contextual information provided by the Markov random field models. Image segmentation via automated machine recognition algorithms has aroused considerable interest in recent machine vision literature [l]. In the segmentation problem, each disjoint region of the image is assumed t o be a different class. The task of an image segmentation algorithm is to find an optimal classification which best characterizes the regions of the image. In the deterministic approach, a significant emphasis has been placed on the study of textures based on wavelet analysis [2]. Multi-channel and/or tree-structured wave-packets have also been developed for classification purposes. Furthermore many researchers have discovered significant advantages in the use of the multiresolution concept [3,4]. The concepts of multiresolution wavelet analysis (MWA) will be discussed in Section 2. As a result of cross-fertilization of innovative ideas from image processing, spatial statistics and statistical physics, a significant amount of research activities on image modeling and segmentation have also been concentrated on the 2-D Markov Random Field (MRF). Although much of the potential of MRF had been envisioned by the early works of Levy, McCormick, and Abend et al. [5-71, exploitation of the powers of the MRF was not possible until significant recent advances had been made in the appropriate mathematical and computational tools. Kashyap and Chellappa [8] successfully applied the non-causal autoregressive (NCAR) model or the Gaussian Markov random field (GMRF) to the characterization of real world textural images. Woods [9] also reported on the issues of two dimensional discrete Markovian fields. As a result of the MRF and Gibbs distribution (GD) equivalence [10,11],Cross and Jain, Geman and Geman, and Derin and Elliott have also demonstrated substantial successes in image segmentations and restoration [12-141. The fundamental concepts and issues concerning the 2-D MRF’s will be described in Section 3. In Section 4, we introduce a novel multiresolution wavelet analysis and MRF based algorithm for image segmentation. In many of today’s real world applications such as the low contrast mammograms or wafer images, it is necessary that the recognition algorithm be capable of categorizing the complex 2-D signals with considerably high accuracy. Thus many of the real world image processing paradigms necessitate the combination of both the deterministic and stochastic approaches of image studies. The task of automated image recognition can in general be achieved in two steps. The first step involves a careful selection of features that best characterize the class membership of the patterns. The second step requires that a good classifier be designed to differentiate the given or measured patterns based on the information provided by the selected features with minimum error. The performances of classifiers in image recognition vary with the nature of the
I . 5 O n Multiresolution Wavelet Analysis Using. . . 145 images to which they are applied. Thus characterization or modeling of the class of images that are t o be processed or analyzed provides not only profound understanding of image structures but also renders a means of feature extraction. The mammographic and wafer images considered in this chapter are most effectively represented by both the NCAR and GD models. The nonstationary GMRF model was used in the extraction of adaptive local features. Moreover, in computer vision problems, grayscale values of the images are generally insufficient for the tasks of differentiation and interpretation. Many types of natural patterns or primitives consist of hierarchical structures that reveal different information under different resolutions and are not generally available in a single resolution. In general, coarse resolution images reveal macrostructures whereas microstructures can be observed in finer resolutions. These hierarchical pattern variations can be systematically quantified by the GMRF applied a t different resolutions of the image. Chellappa and Krishnamachari [15] introduced Markov filters for wavelet analysis which retains the Markovianity of the lower resolution images after decomposition. In our previous works [16-181 we have shown in both the 1-D and 2-D cases that it is necessary to analyze the signals and select stochastic features from different resolutions for classification. In this chapter, a novel statistical feature extraction scheme using MRF under a multiresolution wavelet decomposition framework is introduced. Many of the recent efforts have also been made on the development of unsupervised segmentation techniques. Several authors [ 191 have reported on the successful applications of K-means clustering for unsupervised image segmentation. The outcomes of the K-means algorithm are based primarily on hard decision which are in general less efficient than the soft decisions from an information point of view. Thus soft decision making classifier or the fuzzy-c-means (FCM) algorithm introduced by Bezdek [20,21] will be used for classification or initial image label estimations. The multiresolution wavelet analysis and MRF feature extraction scheme together with the FCM classifier thus provide a novel paradigm for unsupervised segmentation. This unsupervised segmentation paradigm is outlined in Section 4. In the study of real-world images, prior experiences or a priori data provide significant information for segmentation. It is therefore crucial that the a priori expertise be incorporated into the learning processes of the recognition algorithms. A maximum a posteriori (MAP) segmentation estimation scheme based on the Bayesian learning paradigm will be introduced in recognition of the potential contribution of machine learning to the development of robust computer vision algorithms. The goal of this learning task is aimed a t the development of perceptive capabilities of the computers in decision making based on information acquired from past experiences. Model-based and knowledge-based segmentation have been studied by many authors [14,22] with much successes. The essence of these MAP estimations relies primarily on first problem formulation under a Bayesian framework followed subsequently by an estimation or optimization of the a posteriori distribution. Having assumed some prior understanding or information on the image by the specification of an a priori distribution, the Bayesian framework allows
146
C. H. Chen & G. G. Lee
the computer to learn and accumulate experiences from the observed or measured image data for decision making. Gibbs priors have been incorporated into the learning scheme of many researchers for the purpose of segmentation with very effective results [13,14,23]. The MAP estimation problem usually results in the minimization of the energy function contained in the a posteriori distribution. The main difficulty of the problem is primarily due to the existence of nonconvexity in the energy function so obtained. Thus it is not impossible that the estimations be trapped in local minimas of the energy function. As a remedy to this difficulty, different approaches for global optimization have also been introduced by several authors. Inspired by the applications in statistical physics [24], Geman and Geman I131 have demonstrated significant success via the stochastic relaxation technique of simulated annealing. To avoid the pitfall of dropping into the local minima, the stochastic simulated annealing consists of random fluctuations that free the estimations from these stable points. Blake and Zisserman [25] introduced the graduated nonconvexity (GNC) algorithm in which an approximating convex function to the original energy or objective function is assumed. The construction of these convex approximations however is effected by the specification of the controlling parameters of the system. Geiger and Yuille [26] reported on the use of mean field theory
Multiresolution Wavelet Decomposition
Gaussian Markov Random Field
.............................................................................................................................
(a)
I
I
I
...........................................................................
Standalone,Non-supervised Algorithm (b) Fig. 1. (a) MWA & GMRF Scheme for Microcalcification Detection; and (b) Image Segmentation Scheme for Subtle Mass Localization.
1.5 O n Multiresolution Wavelet Analysis Using. . . 147 and tree approximations have also been investigated [27]. Most of the concurrent research are aimed at the reduction of computational complexity and sensitivity to selection of model parameters during MAP estimations. In this chapter the expectation maximization (EM) algorithm for MAP estimation will be formulated. The EM algorithm is an iterative and computationally simple algorithm based on the incomplete data concept. In this formulation, the observed image pixels are assumed to be the incomplete data whereas the class label or status of each pixel is assumed to be unknown. The initial labels estimated from the previously mentioned unsupervised FCM classification will then serve as an initial estimate to the EM algorithm. The EM algorithm is formulated in Section 5 for MAP estimation. The block diagram of the overall novel algorithm is shown in Fig. 1. 2. Wavelet Analysis The main concept of multiresolution wavelet analysis (MWA) can be understood by either the top-down algorithm from the inner product viewpoint or it can alternatively be considered as the fast wavelet herringbone algorithm providing a bottom-up interpretation of the transform. In the top-down algorithm, the coarser resolution wavelet coefficients are computed before the finer resolution wavelet coefficients. On the contrary, the herringbone bottom-up algorithm starts with the computation of finer resolution wavelet coefficients. In the following two subsections, we provide discussions of both the top-down and bottom-up interpretations of the wavelet transform respectively. 2.1. Top-Down Algorithm In the top-down algorithm of wavelet analysis, a function can be represented as a superposition of wavelet basis functions [28,29]. From a single finite duration mother wavelet $(x) a family of orthonormal bases functions can be obtained through dilations and translations:
where m and n are integers. By variations of the integer n, the translational process transforms the function into a separate vector space with distinct spatial orientation. Similarly, by changing m, the dilation process amounts t o a transformation of the original function into another vector space possessing a different resolution. Representing the vector spaces by Vm’s where m is the resolution, we have,
A function f (x)can then be decomposed by the equation:
m.n
148
G. H . Chen & G. G. Lee
where c,,,’s
are the wavelet coefficients calculated by the inner product via
L oc)
cm,n =
(f(.),
$m,n(x)) =
f(z)$m,n(z)dz.
(2.4)
Each of the wavelet coefficient c ~ therefore , ~ represents the resemblance of the function f(z)to the wavelet bases $)m,n(z) at a specific resolution m and translation n. In the construction of the mother wavelet, it is necessary that an appropriate scaling function 4(z) be first chosen. According to the work of Daubechies [28] this scaling function should satisfy the two-scale difference equation, (2.5) k
Thus if estimation of the mother wavelet is straightforward by the equation
The impulse responses hl(Ic) and hz(Ic) constitute a pair of conjugate Quadracture Mirror Filters (QMF) with frequency responses H1 and H2 respectively. The QMF filters can be constructed with perfect reconstruction or synthesis of the original signal if certain conditions of h l ( k ) and h~(Ic) are satisfied. Thus in the top-down algorithm of the wavelet transform, one computes first the coarse resolution coefficients via the inner product of Eq. (2.4). 2.2. The Herringbone Bottom- U p Interpretation In the herringbone interpretation, the discrete wavelet transform can be considered an efficient application of the two-band subband coding in an iterative fashion. The first step in the wavelet decomposition starts with a simple two-band subband coding scheme in which the signal is initially convoluted or filtered by the half-band lowpass filter with impulse response hl ( k ) followed by subsequent sub-sampling. Similarly the same input function is also convoluted with the half-band highpass filter h2(k) followed by subsequent sub-sampling. The respective outputs 91 ( I c ) and 92(Ic) of the lowpass and highpass filters can then be represented as,
n
n
where f ( n )represents the discrete sequences of the continuous signal f(z).Thus if
H,2+H2=1,
(2.10)
1.5 O n Multiresolution Wavelet Analysis Using. . . 149
f(n)can be reconstructed perfectly using the equation,
c Dc)
f(n)=
[g1(k)h1(2k - n ) + gz(k)h2(2k - n)l
.
(2.11)
k=-cc
In the second subband coding iteration of the discrete wavelet transform, the lowpass signal gl(k) is the input to another two-band subband coding scheme. This process is continued until the lowpass signal consists of only a single point. The original signal f ( n )can of course be reconstructed by putting the lowpass and highpass signals back into the subband coding scheme iteratively using Eq. (4.11). It is evident from the iterative subband coding scheme of the discrete wavelet transform that the finer resolution schemes are computed prior t o that of coarser resolution coefficients. As in the herringbone structure, each set of wavelet coefficients is computed via the convolution of f ( n ) repeated with hl(k) and then once with h z ( k ) . Thus in a way analogous to the top-down scheme described from the inner product point of view, the orthonormal basis functions of the present herringbone scheme are h2(k) and other functions resulting from convolution with hz(k). Observing the equivalent structures of Eqs. (2.5) and (2.8), and Eqs. (2.7) and (2.9), the highpass impulse response can be interpreted as the mother wavelet where the set of wavelet basis functions is derived from the convolution with the highpass filter. Thus in the top-bottom algorithm, the scaling functions in a sense are an intermediate step for the generation of wavelet basis as in the herringbone scheme. By the same token, in two dimensional image analyses, by choosing the analyzing wavelet +(x, y) to be localized in space, spatial information such as shapes and orientations of the images can be emphasized. The multiresolution wavelet representation scheme shown in Fig. 2 allows an image to be decomposed into a hierarchy of localized subimages at different spatial frequencies. This representation divides the 2-D frequency spectrum of an image Y into a lowpass subband image Y$ and a set of bandpass subimages where i = 1,.. . , L and j = 0 , 1 , 2 , 3 . The integer i represents the number of resolution levels used while j represents the number of orientations. Thus the q ’ s for j = 1 , 2 , 3 represent the detail subimages obtained as a result of the 2-D wavelet decomposition outlined in Fig. 2. H1 and H2 represent respectively the lowpass and highpass conjugate quadrature mirror filters (QMF) . The notation y3” = W,,j[Y] (2.12)
q
represents the wavelet decomposition or the subimage of Y a t the ith level and spatial orientation j. Thus by tacitly choosing the resolution level i and spatial orientation j, the multiresolution wavelet scheme provides a natural hierarchy for the embodiment of an iterative paradigm for accomplishing spectral-spatial feature analysis. One may then visualize the analyzed image on the computer from a coarse to fine matching strategy. The transform scheme thus enabIes us t o first visualize the coarse features of the images embedded in the lower frequency components or
150
C. H. Chen & G. G. Lee
Y'
Keep one column out of two:
)2.11)
Keep one row out of two: Convolution with filter H :
(4
Fig. 2. (a) Two dimensional wavelet decomposition of the image Y iby a two-channel QMF. Y;+l represents the resulting low resolution subimage while Y:+', Y;+l, Y;*' denoted the different spectral-spatial detail subimages; and (b) Disposition of the lower resolution and detail snbimages shown in this paper.
resolutions of the transformed images followed by examination of the finer details contained in the higher frequency levels or resolutions. The spectral-spatial and multiresolution properties of wavelet transform have been reported to be intrinsic to the human visual system [30]. Specialized cortical neurons are known to respond specifically to stimuli within certain spatial orientations and frequencies. As will be shown in Section 6, these properties of the MWA provide significantly efficient tools for the extraction of localized spectral-spatial features such as the microcalcifications in digital mammograms [31].
1.5 On Multiresolution Wavelet Analysis Using. . . 151
3. Markov Random Field for Image Modeling Pertinent contextual information contained in natural images is not generally provided by the two-channel wavelet transform. Contextual information extracted via a Gaussian Markov random field (GMRF) and incorporated into the localized spectral-spatial details of MWA can provide remarkable improvements in image segmentation. This subsection presents a brief overview of the Gaussian Markov random field and Gibbs Distribution. Gibbs priors are used in subsequent sections for the expectation maximization. The readers are referred to [13,14,32,33]for more details on the topic.
3.1. Gaussian Markov Random Field Let an image Y be modeled by a finite lattice Gaussian Markov random field (GMRF) where Y = {yij : 0 5 i 5 M , 0 5 i 5 M } and L = {(i, j ) : 0 5 i 5 M , 0 5 i 5 M } . The positions or sites of the pixel in the M by M square lattice, L are denoted by ( i , j ) . For notational conveniences, the pixel sites are also represented as s, r, t , etc. A neighborhood system of the given lattice L is any collection of subsets of L described as q = {qij : ( i , j ) E L , qij C L } such that (i) ( i , j ) is not an element of qij and (ii) if (k,Z) is a n element of qij, then ( i , j ) should be a n element of ( i , j ) which is an element of L.
qkl
given any
The systematic and sequential ordering of the neighborhood system commonly used in image modeling is qo, where 0 = {1,2,3, . . .} represent the order of the neighborhood system. The relative positions of the pixel in the ordered neighborhood system are shown in Fig. 3(a). Due to the finite lattice approximation, the sum of pixels, say (i, j ) ( k ,I ) , are evaluated in modulo which is equivalent to the assumption of L being a toroidal lattice. In modeling the image Y as a GMRF with respect to a certain neighborhood system 17 and reshaping Y into a single vector y = [yl,y2, . . . ,yM2ITin the lexicographic order," the image Y will then be assumed to be a set of jointly Gaussian random variables that also possess the Markov property. Thus the joint probability density function of the random variables constituted by the pixel values in y has the form,
+
where C is the covariance matrix of y. In addition, the Markovianity property requires that, P(YijlYpq7
(P,Q) # ( i , j ) )= p ( Y i j l y ( i , j ) + ( k , ~ ) (, k ,1) E % j ) .
"T represents matrix transpose.
(3.2)
152
C. H. Chen €d G. G. Lee
Fig. 3. (a) Ordered neighborhood system vo for MRF. The numbers indicate the order of the model relative t o x; and (b) The clique types in second order neighborhood system, q 2 .
In the above Markovian expression, the two-dimensional pixel site notation was used whereas the joint Gaussian distribution expression assumed a lexicographic ordering of the pixel sites. Thus under the GMRF assumption of image modeling any pixel ( i , j ) can be predicted by the linear combination of the pixels contained in the neighborhood. That is Yij
=
eklY(i,j)+(k,l)
+ &Wij
7
( i , j )E L
(3.3)
(k,l)Evij
where Bkl are the parameters of linear combination and w i j is a white Gaussian noise with zero mean and unit variance [32]. Due to the definition and choice of the neighborhood system, it can be seen that the above linear combination scheme results in a noncausal and autoregressive (NCAR) prediction. It is clear from the above discussion that the key in image modeling via GMRF lies in the specification of the neighborhood system together with the parameters of linear combination characterizing the image. In our example in the present chapter, we have chosen the second order neighborhood system, q2. Innovative applications of the NCAR MRF were documented in [8,34]. Natural textures were modeled using the GMRF and the issue of the proper selection of
I . 5 O n Multiresolution Wavelet Analysis Using. . . 153 the neighborhood system were also discussed. In the present context, GMRF was chosen due to its close resemblance to the real world textural structure. This is of crucial importance for texture analysis in mammograms and wafer images as significant statistical information can be obtained via MRF modeling. Nonstationarity will require the adaptation of features to the local variations of pixel gray levels. The issue of the choice of the neighborhood system will be discussed further in a subsequent section on multiresolution feature extraction. 3.2. Gibbs Distribution
Gibbs distribution provides a powerful tool for the formulation of the Bayesian learning framework. The origin of GD lies primarily in physics and statistical mechanics. In this subsection, we will briefly discuss the class of GD used in this chapter. The readers are referred however to [10,13] for more detailed and thorough explanations of GD. In the specification of a GD, it is necessary that the cliques be defined in association with the neighborhood system as depicted in the last subsection. Denoted by c, a clique of the graph ( L ,77) is defined to be a subset of the lattice L such that all the pairs of individual sites in c are also neighbors of each other. The clique types for the second order neighborhood system q2 are shown in Fig. 3(b). The set of cliques associated with a specific order of neighborhood system is denoted as C. Let the random field or image Y = be defined over the graph ( L ,77). Then Y is a MRF with respect to 77 if and only if its joint distribution is of the form
{x,}
where (3.5) CEC
is the energy function and Vc(y) is the potential associated with the clique c. In addit ion, Y
represents the partition function which is a normalizing constant obtained by the summation over all the gray level G = (91, g2, . . . , gN}. The clique potentials Vc(g) are constituted only from the pixel values within the clique c. As with its GMRF counterpart, the specification of a neighborhood system is sufficient for the definition of a Gibbs distribution for the image at hand. That is by a proper choice of t,he clique potentials, a wide variety of images can be characterized efficiently. Once again, the second order neighborhood system is assumed in the present chapter. However only the single pixel and double pixel cliques are considered. This results in the following energy function,
1-54 C. H. Chen & G. G. Lee
U(Y) = POYij + Pl[Yi-l,j + Yi+l,jl
+ PZ[Yi,j-l + YZ,j+I] f P3[%-l,j+l f Yi+l,j-l]
+ P4[!/-l,j-l
f Yi+l,j+l].
(3-7)
The parameters in the energy function control the extent of pixel clustering in specific directions described by the corresponding cliques. From the above energy function, image intensity is controlled by the quantity ,Do, whereas PI, pz, P 3 , and ,B4 control the degree of clustering of the image pixels in the vertical, horizontal, crossdiagonal and diagonal directions respectively. This idea, which originated from the modeling of ferromagnetism in statistical physics [34], was applied to image modeling of textural images in [12,34]. As a result of the Hammersly-Clifford theorem which was proved independently by several researchers [10,11], there is a one-to-one correspondence between MRF and Gibbs random field (GRF). Image modeling via both the MRF and GRF assumptions provides a powerful tool in image analysis from both the mathematical and computational perspectives. Due to the nonhomogeneous property of real world images studied in this chapter, it is essential that a nonstationary GMRF assumption be made for the extraction of locally adaptive features. On the other hand, since GD is basically exponential in nature, it can be used with more convenience in the formulation of MAP segmentation estimations and in the EM algorithm. Thus utilization of both assumptions have revealed significant results in the present study. 4. Multiresolution Feature Extraction Scheme and Fuzzy Clustering for Image Segmentation In this section we will introduce the multiresolution wavelet analysis (MWA) and Gaussian Markov random field (GMRF) based feature extraction scheme. By using this feature extraction scheme, adaptive local GMRF features are extracted from different resolutions of the wavelet decomposed image. Each pixel in the original image is characterized by a vector of discriminant features. We assume that the constituent regions in natural images are characterized by distinctive textures. Together with FCM clustering, a novel unsupervised segmentation algorithm is presented for image labeling. 4.1. Wavelet Decomposition and GMRF for Feature Extraction
In the segmentation of real-world images, the multiresolution hierarchical framework of the wavelet transform can be used with significant advantage. Although the Fourier transform provides one of the earliest ways of signal or image decomposition, it performs best primarily for periodic signals and images. As a result of the infinite duration of the complex sinusoids serving as the basis function of the inner product in the transform integral, Fourier decomposed signals do not display the locality
1.5 O n Multiresolution Wavelet Analysis Using.. . 155
Fig. 4. (a) Simple Image; (b) Corresponding Image; (c) 1st pass in Wavelet; and (d) Corresponding Subimages.
property in the spatial domain. Due to the choice of the mother wavelet and the corresponding QMF's described in Section 2, the conjugate filters H1 and H2 serve as a two-channel QMF representation of the image. As a result of their finite duration, both frequency and spatial localities can be observed in the wavelet decomposed image. Since the detailed image contains the higher frequency components of the original image in different orientations, abrupt changes in grayscales such as edge information can be observed. This can be seen from the simple example shown in Fig. 4. Figure 4(a) shows a 64 x 64 image with a simple background having a grayscale of approximately 100. The inner square consists of pixels having grayscales in the vicinity of 200. Figure 4(c) is the result of the first pass in the wavelet decomposition scheme shown in Fig. 2. It can be seen that Y;" emphasizes on the edge of the square in the horizontal direction whereas Y;" shows the edge in the vertical direction. Thus the spatial and frequency domain localization property of the wavelet transform makes it a very efficient tool for image segmentation. However, Fig. 4(a) is a simple image with only two uniform grayscale regions. Figure 5(a) shows a more complex picture with a square texture in the middle of a square background with different textures. Although still placing emphasis on the horizontal and vertical directions, Y;" and Y;'' of Fig. 5(c) were not able t o show
156
C.H. Chen B G. G. Lee
Fig. 5 . (a) Natural Textile; (b) Corresponding Image; (c) 1st Pass in Wavelet; and (d) Corresponding Subimages.
clear cut edges of the square as they did in Fig. 4(c). Thus the two-channel QMF does not in general provide sufficient statistical information contained in real world images necessary for segmentation purposes. This necessitates the introduction of the Gaussian Markov Random Field in conjunction with wavelet decomposition for image analysis. The locally interactive pixels and statistically dependent properties of the GMRF model provide a remedy to the nonuniformity of natural textures in real-world images. As was described in Section 3 , each pixel in the GMRF can be predicted by the linear combination of pixels in the carefully chosen neighborhood system. The parameters in the linear combination can be found from the least square estimate over the entire image [8,34]. In natural images, each pixel and region can display significant variations and complexities that are neither periodic nor homogeneous. The inherent assumption of uniform discriminatory properties at local regions or pixels of real world images such as mammograms is in general inadequate. Thus local features which adapt to each of the individual pixels of the image are selected. In other words, the least square estimation is performed on only a small window centered a t the corresponding pixel rather than on the entire image. In this chapter, the second order neighborhood system or the first eight neighbors of each pixel are
1.5 O n Multiresolution Wavelet Analysis Using. . . 157
used for the estimation of the linear combination parameters. That is if we let,
the least square estimates of the parameters of the eight neighbors of each pixel of y i j are
where the variance estimate at each pixel is
with w being the size of the window. The mean estimates at each pixel are then taken to be,
Therefore each pixel of the image at a specific resolution can be characterized by a set of three features. The first two features arise from the mean and variance estimates and the last is the rotational invariant feature estimated from the sum of parameters of the linear combination estimated from the pixels in the chosen window. Each of the parameters estimated in O i j represents the influence of the corresponding pixel in the neighborhood system on the pixel under study from one of the eight directions. Thus the effects of all eight pixels in the neighborhood from the eight different directions are combined by summing the eight parameters. As a result of the estimation from the windows defined around each pixel, the features will then adaptively reflect the local variations or properties of the textured image. Therefore different regions of the image will be characterized by discriminant local features essential for segmentation or region labeling purposes. Pixels in the same region will be characterized by similar features whereas different regions are modeled by different discriminant features. It will be shown that greater variations of the features will be observed in different image regions whereas less variations of the features can be observed from within the region. These changes or variations of the features within the image are measured by the local variances calculated from the pixels of the feature maps, that is
In the former equation, w represents again the size of the window used in estimating the feature variance, fj:’ represents the ( i , j ) pixel of the kth feature extracted
158
C. H. Chen t3 G. G. Lee
from the nonstationary GMRF described above. The mean, estimation is,
$’”’
of the variance
At each resolution of the image, the variance in effect reveals the relative importance of the corresponding features. Thus every extracted feature at each image pixel is multiplied by its corresponding variance estimates which functions as the weight of a specific feature. Intuitively, discontinuities of gray levels can exist at the edges of the constituent regions in the image. Therefore, if we begin by examining a coarse resolution of the studied image larger structures will be observed with more abrupt changes and larger steps in grayscales. In the course of gradually increasing the resolution, finer details and smaller patterns can then be seen. The changes in the gray levels at the edges of the higher resolution levels in the wavelet pyramid are indeed less. Thus the coarser the resolution, the more abrupt changes there will be in the gray levels of the image. These changes of gray levels at the edge between different regions from different resolutions of the image are quantified by the application of the nonstationary GMRF at each resolution of the wavelet pyramid. Given the original image Y , wavelet decomposition is first applied to find the lower resolution images. Subsequently, features are extracted at each resolution of the image via the nonstationary GMRF model. The features estimated from the lower resolutions are then restored to the original resolution size via simple upsampling. Thus each pixel of the image is characterized by a single discriminant vector consisting of GMRF features estimated from every resolution of the wavelet decomposed image. That is each pixel of the image is characterized by the feature vector,
(4.7)
where f $ ) R is the kth feature of the image pixel ( i , j ) at resolution R. In this study, only two resolutions were used. That is features are extracted from the three images Y:, Yo”,and Yo”. It can be seen from the pixel feat,ure vector that features from different resolutions are embedded. In this manner, information from lower resolution images is propagated t o the higher resolution. The importance of each of the features is weighted by the corresponding variance. This scheme provides a means of careful scrutiny of the individual image pixels which are of great significance for natural images. The inclusion and/or exclusion of a certain pixel in a specific region is then decided by the FCM algorithm based on the discriminatory information conveyed by the GMRF features extracted from each resolution of the wavelet pyramid. These adaptive features will be shown to have highly effective outcomes especially for the nonhomogeneous images such as mammograms and wafer inspection images. The FCM algorithm will be described in the following subsection.
1.5 O n Multiresolution Wavelet Analysis Using. . .
159
4.2. Fuzzy-C-Means ( F C M ) Clustering
Having found the characteristic indicators for different regions in the image represented by pixel gray levels in Section 3, the subsequent task is the partitioning of the set of pixels into the corresponding regional sets. Thus given the set Y = {yi : 0 5 I 5 m2 - 1 = n } ,we intend to find the optimal partition of Y exhibiting categorically homogeneous subsets.b This can be achieved by minimizing a cost [20,21] defined from the similarity measures of feature vectors selected for each of the pixels in the set Y . Before proceeding to the definition of cost function, it is necessary that we review some basic definitions of fuzzy reasoning. Let P ( Y ) denote the algebra of the power set of Y . In other words, P ( Y ) is the set of subsets of Y . Let S be an element of P ( Y ) ,then the function us is defined as the characteristic function which maps the set of elements Y into the set {0,1}. The function us defined as u s = { l , a ~ ”
0, otherwise
}
is the characteristic function of hard subset S c Y . Only one characteristic function us corresponds to every S which is an element of P ( Y ) . The fuzzy subset of Y is thus the function u which maps the set Y into the inclusive set [0, 11. Now suppose that the set Y is made up of mutually exclusive and exhaustive hard subsets of S subsets denoted by,
Y = S ] u s2 u...sc.
(4.9)
Following this notation, the characteristic functions of Sq’s where q = { 1 , 2 , . . . , c } can be represented as uq, (4.10) Having defined the basic notations, a hard c partition of Y can be defined as the set, (4.11) MC = {U E Rcn : uq/eE {0,1}} such that C
for any !
(4.12)
q=l
and
n
(4.13) bThe index k represents one of the coordinate pairs in the M by M square lattice. By using this notation, we have again assumed a lexicographical ordering of the image pixels as in the GMRF. The number n here is the number of pixels in the image and is different from the interger n used in the wavelet decomposition scheme.
160
C. H. Chen €9 G. G. Lee
In words, the hard c-partition of a set Y is the set of c by n matrix with each element of the matrix being zero or one such that the previous two conditions are true. The fuzzy partition of Y is then defined to be,
M f , = {U E 8'" such that
: uqe E [O,1]}
(4.14)
c
(4.15) q=l
and
n
0
for any q .
(4.16)
e=i Now let the cost function J , be defined as a function which maps the c by p matrices into the real interval [0, m), that is,
J , : M f c x Rcp + RRf is defined as n,
c
(4.17) e=i q = i
In the above equation, U is an element of M f c and is thus a fuzzy c-partition of A. The vectors c are a set of ptuples which are the cluster centers of the characteristic function uq where q is between 1 and c. The similarity measures (dqe)2 are the Euclidean norm on RP. The parameter e is the weighting exponent of the cost function J , defined above and is in the range [l,m). There the task of the partition is the minimization of J , with weighting exponents e. This leads to the Fuzzy cMeans clustering algorithm introduced by Bezdek [20]. According to Bezdek's theorem, let e be in the range (1,m) and for any Ic define
Ie
= {i : 15 i
-
Ie
=
{1,2,.
5 c and (d,!)
= 0)
. . ,c} - I [ .
(4.18) (4.19)
Then if (4.20) or
Ie #
I
0 =+
uqe = 0 for any i
E
It
(4.21)
and
C uqe n
n
e=i
k= 1
=
1,
(4.22)
(4.23)
1.5 O n Multiresolution Wavelet Analysis Using.. . 161 there may exist a global minimum for the cost function J,. Here fe represents the feature vector a t pixel 1 under a l-D vector notation representation of the image. The FCM algorithm can be then be described by the following steps, (i) Decide on the required number of classes t o partition the given set into. In our application we used two classes. Choose also a means of inner product norm. The Euclidean norm was used in this research. Choose any U E M f c . (ii) Calculate the cluster centers cq’sand U at each step via Eq. (4.23). (iii) Update U using Eq. (4.21). (iv) Calculate the matrix norm of the new U and the previous U , if the result is smaller than a certain preset threshold, stop, else repeat steps (ii) and (iii).
4.3. Unsupervised Segmentation In our work with digital mammograms, a primary goal is to segment the textural regions of tumors from the textures of the background tissues. Each pixel of the image is characterized by a distinguishing feature vector estimated from the scheme formulated in Section 4.1. From this set of indicative feature vectors, an initial estimate of the label field can be found using the FCM clustering algorithm as mentioned above. This application results in an optimized fuzzy partition U in which each column of the matrix contains information concerning the belongingness of the corresponding pixel to a certain regional class. This amounts to a fuzzy decision of the class of the pixels as opposed to the conventional k-means clustering algorithm. Since each element within the column of the matrix U gives a measure of membership of the corresponding pixel t o a certain class, the pixel class can be determined from the element with the maximum value. This constitutes an unsupervised segmentation algorithm for the calculation of class labels. From an information processing point of view, fuzzy decisions have been found to be more accurate than the hard or crisp decision based algorithms. In addition, the fuzzy partition matrix U can be used as a good initial estimate for the EM algorithm t o be discussed below. 5. Expectation Maximization (EM) for MAP Segmentation In this section, we will formulate the maximum a posteriori (MAP) segmentation algorithm for the estimation of the image labels or states via expectation maximization. Bayesian learning or MAP estimation using Gibbs distribution has been applied to the problem of image restoration and reconstruction by many authors and researchers. The basic idea is that the images are modeled by a lattice of random variables in which each random variable describing a pixel is characterized only by its dependency on a carefully chosen neighborhood system. Due to the equivalence of the Markov random field and the exponential Gibbs distribution or GRF according to Hammersley and Clifford [ll],many recent attempts in the
162
C. H. Chen & G. G. Lee
study of image segmentation have been concentrated on the study of MAP estimation under the Bayesian formulation. One major effort of the MAP estimation is the minimization of the energy function so obtained. In what follows, the EM algorithm [36,37,22,38]will be described briefly and it will be incorporated into the MAP segmentation estimation.
5.1. Basic Concepts of EM In the EM algorithm, it is assumed that one has a measure space X of complete data and a measurable mapping of X -+Y ( X ) to a space Y of incomplete data. Let a member of the parametric family of density functions defined on the complete data space be denoted by f(xl@) and a member from the incomplete data space be f(yl@). As in maximum likelihood estimation (MLE), the final goal is the maximization of the likelihood function defined on the density function f(yl@). However since this contains incomplete data, it is in many instances easier and more convenient to maximize the complete data space density function f(xl@). The maximization of the probability density function (pdf) f(yl@) is ultimately achieved via the exploitation of the relationship between the density functions f(xl@) and f(yI@), which is the central idea of the EM algorithm. Presently we assume that the original image is characterized by a set of random vectors x = (y,z). Using this notation, y = {yl, y2,. . . ,yM2} represents the set of incomplete data set of observable pixel intensities while z = (21, 2 2 , . . . , z M 2 }is the unobservable data set which represents the state or label of each pixel. For the formulation here, zi's are represented by standard unit vectors. Assume that there are K distinct regions or classes in the image, 1 5 k 5 K . If a specific pixel belongs to class k , then z, = ek . ek is the standard unit vector which has the lcth element being one and all other elements being zero. As in the GMRF case the image pixels are again arranged in the lexicographic order for notational convenience. Representing the pdf of the incomplete data y by f(yl@) and complete data x by f(xl@) where @ = (ay, Q Z ) is a vector' of parameters characterizing the density, the incomplete pdf of y can be related to the a posteriori density f(xl@) via Bayes rule,
f(xl@) = f ( Y 1 zI@) (5.1)
= f(YlZ1 @)f(ZI@).
Having specified the definition above, the EM algorithm can be considered an iterative algorithm consisting primarily of two steps: E-Step)
Find
M-Step)
Q(@l&(')
Find
&('+l)
= E[log f(xl@)ly, =
&(')I
argmax Q(@l&(')) iteratively. '3
(5.2) (5.3)
cWe assume that the parameters characterizing the obserable data y is independent of the parameters characterizing the non-observable data z.
1.5 On Multiresolution Wavelet Analysis Using. . . 163 In the notation used above, I represents the number of iterations. In the EM algorithm, the expectation or the E-step consists of finding the expectation of the log likelihood function of f(xl@) given the observed or measured image pixel values y and some initial estimates of the parameter vector & ( I ) . The maximization or the M-step requires that the Q function found in the E-step be maximized. 5 . 2 . M A P Segmentation
As can be seen, the relationship between the incomplete and complete data obtained from Bayes rule requires a priori knowledge of the label fields z. In the present formulation, we assume both f ( y l z , @ ) and f(zl@) to be GRF's. That is we let, , ' exP[-Uz(zl@z)] (5.4) f(z1Q.E) = 2 and f ( Y l Z , @) = 2;' exP[-U,(Ylz,
@)If
(5.5)
2 ; l and 2 ;' represent the partition functions of the Gibbs distribution for z and y respectively. U , and U, represent respectively the energy functions for the Gibbs distribution of z and y. Having made the GRF assumptions above, the log likelihood function can then be expressed as,
logf(xl@) = log f(Y, zI@) = logf(ylz,@)+logf(zl@) =
-U,(ylz, @) - log 2, - U z ( z p z )- log 2,.
(5.6)
The application of conditional expectation of this log likelihood function results in the Q function of the E-step described previously, that is,
Q( @ l & ( ' )
=
-EIUy(ylz, @)
+ log Zyly,& ( r ) ]
-
E[U,(z(@,)
+ log Zz] .
(5.7)
In order to estimate the expectation function above, it is necessary that z be separated from y. This is possible even if we have assumed a GRF for f(ylz, @). For example, for a second order '7 GRF having the energy function: MZ
c
UY(YlZ) = CK(YZIZ2) + 2=
1
(1
vd(Yz,YjlzZ,zj),
(5.8)
* 3 )t C z f 3
V, and Vd represent the clique functions of the cliques with single and double pixels respectively. CJis omitted here for notational convenience. z can then be separated from the clique functions resulting in, M2
164 C. H. Chen & G. G. Lee
where T represents matrix transpose and Vs(yi), Vd(yi, yj) are represented by
Applying the expectation in the E-step, we obtain for y,
An unsupervised algorithm for obtaining the parameters results if the above equation is maximized [22]. The parameters constituting the Gibbs distribution of both classes or regions can be estimated from the scheme outlined by Derin [14]. The different regions are estimated initially via the unsupervised segmentation mentioned above or from experienced users. Our present interest lies upon the improvement of segmentation results after interaction with the expertise of trained personnel. Thus we are interested in finding the expectation E[zily, &(I)]:
(5.13)
+p
= f ( Z i = ek).
(5.14)
+r)
The next task is then to estimate and f(ylzi = ek). In the following paragraphs, we will first discuss the estimation of the a priori density of z followed by estimation of f(YI& = ek). The a priori density of z, arranged for notational convenience in the lexicographic orderd can be estimated by the pseudo-likelihood function introduced by dThis 1-D notation is assumed so that the pseudo-likelihood function can be estimated without confusion during derivation. The parameter vector QZ had been neglected in the conditional density function for convenience.
1.5 O n Multiresolution Wavelet Analysis Using. . . 165 Besag [39]. That is, M2
f(z) = n f ( Z j I Z t J
(5.15)
E 77j)
j=1
where q j is the predefined GRF local neighborhood of the pixel j. Thus each pixel label can be expressed as, (5.16) "3 623
(5.17)
If we represent the previously estimated values of pixel labels in the neighborhood as it,the equation above can be rewritten as,
= f(z,lit,t
(5.18)
E 7,).
The density function of the entire data record y can be estimated as [39], f(YIG)
=
f(Yz\~z,Y{I}.)f(Y{z}*
12%)
(5.19)
where {z}* represents the set of all pixel sites in the image except i. Making the approximation that is independent of z, we obtain,
f(Y{z}*l~z)= f(Y{z}*).
(5.20)
Assuming that the image is also a MRF, we can write, = f(Y/zlzz,YTd
f(Ylll~1lY{ll>*)
(5.21)
where 77, is again the local neighborhood defined. After substituting, we obtain,
f(Ylzz)
=
f(Yzlzz,Y,Jf(Y{,}*).
(5.22)
Thus assuming that the same neighborhood system is defined for both z and y , the components of the label vector can be rewritten as,
iif' represent
the probability that the pixel image belongs to class k .
166
C.H. Chen & G. G. Lee Now the MAP estimates of the label vectors are defined to be ~i
= argmax f(ylz, i ( ' ) ) f ( z ~ i ( ' ) ) .
(5.24)
z2
This can be found by the largest value of the ,2jL)'s, k = 1,.. . , K , which amounts to a form of hard decision made on the soft probability estimates. In the present chapter, the initial estimates of the label fields are found from the optimal Fuzzy partition matrix U of the FCM algorithm. Together with the parameter estimates, this MAP segmentation algorithm provides a soft label estimation scheme based on the EM algorithm. The EM algorithm in conjunction with the MWA and FCM clustering constitutes a significantly efficient tool for the analysis or segmentation of real world images. In the subsequent sections, we will describe the application of the present novel algorithm in digital mammogram screening and wafer inspection.
6. Application to Digital Mammography Breast cancer is the second leading cause of cancer-related mortality among American women [40] with predictions indicating increased incidences in the future. Considerable evidences have shown that early diagnosis and proper treatment of breast cancer will increase the chances of survival significantly. Recently, mammography screening has been recommended as the most reliable and non-invasive detection method as compared to other diagnostic methods which are currently available. Early diagnosis of breast cancer is based upon the radiographic visualization of microstructures such as microcalcifications and the localization of suspicious mass regions on the mammogram. Due to non-invasive concerns, decrement in the radiation dose of mammographic images degrades the contrast and overall visibility of the microcalcifications and tumorous mass regions from their surrounding tissue. The extremely minute and elongated salt-like particles of microcalcifications are sometimes no larger than 0.1 mm in size and are responsible for the detection of approximately half (43%-49%) of all cancers detected mammographically. Localization of suspicious mass lesions has also been reported to be notoriously difficult due to the high degree of variability associated with the normal and abnormal mammographic appearances ranging from relative uniformity to patterns of bright streaks and blobs. It has been reported that negative mammograms were observed in 10% to 30% of the women who actually have breast cancer [41,42] and 40% of these misdiagnosed cancers appear as masses on the mammograms [42]. Consequently computer-aided-diagnostic (CAD) methods for breast cancer screening have been assessed as the most cost effective method to assist radiologists in x-ray interpretation. Serving as a second opinion to the radiologists, automation of the film screening process also ensures intra-consistencies as it is not subject to fatigue. Computer-aided-diagnosis is based upon the radiographic enhancement and visualization of microstructures such as microcalcifications and letic mass regions on
1.5 O n Multiresolutaon Wavelet Analysis Using. . . 167 the input mammogram. The rudimentary process of many CAD methods therefore involves the detection of target microcalcifications and segmentation of suspicious mass regions on the mammographic images. A number of computer vision methods have been proposed for both microcalcification detection and mass region segmentation. Qian et al. [43,44] reported on the extraction of microcalcification via M-channel quadracture mirror filters (QMF) and classification of these extracted microcalcifications via a artificial neural network. Giger et al. [45] described a supervised method for the detection of microcalcifications. Brzakovic and Neskovic [46] also reported on the use of fuzzy pyramid linking for mass localization. Recently] Petrick et al. [47]reported on a two-stage adaptive density-weighted contrast enhancement (DWCE) technique for breast mass detection. In this section we apply the novel multiresolution wavelet analysis (MWA) and nonstationary Gaussian Markov random field (GMRF) technique for the identification of microcalcifications with high accuracy. The hierarchical multiresolution wavelet information in conjunction with the contextual information of the images extracted from GMRF provides a highly efficient technique for microcalcification detection. A Bayesian learning paradigm realized via the expectation maximization (EM) algorithm will also be introduced for edge detection or segmentation of subtle mass lesions recorded on the mammograms. The effectiveness of the present approach has been extensively tested with a number of biopsy proven mammographic images. The relative performances of the MWA and GMRF algorithm and traditional wavelet methods in microcalcification detection are assessed via the receiver operating curve (ROC). The accuracy of the EM algorithm in mass region segmentation is also presented via simple visualization. 6.1. Microcalcification Detection via the Multiresolution Wavelet
Analysis and Gaussian Markov Random Field Scheme The detection of microcalcification on mammographic images is notoriously difficult due to the minute sizes of these microstructures which reveal themselves as clusters of extremely tiny white dots. In computer visualization of digital mammograms] it is therefore essential that these localized features be extracted for diagnostic interpretations. In this section, the MWA and GMRF algorithm introduced in Section 4 is used for the detection of biopsy proven microcalcifications. Initially, mammographic images are hierarchically decomposed into different resolutions using the two-channel wavelet transform. In general, larger breast lesions are characterized by coarser resolutions whereas higher resolutions show finer and more detailed anatomical structures. By hierarchically and systematically studying the variations of detailed anatomical structures discerned at a hierarchy of resolutions, extremely fine details of the images such as microcalcifications can be observed. In addition, during the process of resolution reduction] the difference in image details or image information between two consecutive resolution levels is also recorded by the wavelet transform. This difference in information or the so-called detailed
168
C. H. Chen & G. G. Lee
signals are important for the purpose of image detail comparison during microcalcification detection. Wavelet analysis is very appropriate in the study of images with more localized features such as the microcalcifications. The two-channel wavelet transform does not in general provide sufficient contextual information contained in different textures such as the mammograms under study. Thus in addition to examining the image details at different resolutions, contextual information from the wavelet decomposed image is also studied using a nonstationary GMRF with remarkable improvements in detection accuracy. By using this concept, the computer is programmed exhaustively to scrutinize each image pixel based on the information provided by a small neighborhood of pixels in the immediate vicinity. In addition, the hierarchical variations in the anatomical features displayed by multiresolution wavelet decomposition is further quantified through the application of GMRF. Since GMRF models natural textures such as mammograms well and because of its uniqueness in locality, incorporation of GMRF in the MWA scheme is expected to provide a highly efficient tool for microcalcification detection. In the detection of microclacifications, the spectral-spatial components of the mammographic images are initially found from the multiresolution decomposition scheme of MWA. Originating from the concept of the pyramidal hierarchy of resolutions, the original high resolution X-ray image consists of all the information including fine details that are necessary for the interpretation of diagnostic symptoms. As the resolution of the mammograms is reduced, the finer details of the images possibly consisting of microcalcifications are decomposed out but recorded in the detailed subband subimages. Thus by a systematic examination of the detailed information lost in each hierarchical resolution, one can resort to the spectral and spatially oriented detailed signals for the position and frequency components where the microcalcifications reside. The emphasis of these frequency components at different spatial orientations can also be observed in the subimages which consist of different subbands of the frequency spectrum in the spatial domain. However, due to the variations in real world textural images and great variability in medical images existing amongst patients, the cutoff frequencies in the two-channel wavelet transform may miss the frequency components which characterize the microcalcifications at certain sites of the image lattice. These essential localized characteristics can however be provided by the contextual information embedded around the vicinity of the calcification candidates. Therefore, by the use of nonstationary GMRF, scrutinzation of the image pixels in the close neighborhood provides pertinent information for the detection of the microcalcifications. After MWA and denoting the original image as Y ' , the absolute difference between Yf and Y: will provide localized spectral-spatial properties. Subsequently, the localized variance feature of each pixel in the resulting image is estimated using Eq. (4.3). As can be seen, each pixel in the image lattice is being characterized by a variance whose estimation is based upon the linear combination parameters in the neighborhood estimated from Eq. (4.2). Since each of the estimated parameters signifies the relative influence of
1.5 O n Multiresolution Wavelet Analysis Using. . . 169 the corresponding neighborhood pixels to the pixel under study, contextual information has been embedded in the pixel variance estimate described by Eq. (4.3). Finally, the pixel variances estimated from the nonstationary GMRF, i?ij are normalized via the sigmoid function expressed below,
where a is the scaling constant and b is the slope parameter of the logistic function. This contextual information has been incorporated into the spectral-spatial properties of MWA with remarkable improvements in detection accuracy. Representative images of the digitized mammograms and the results of microcalcifications are shown in Figs. 6 to 11. Healthy mammograms which consist of no microcalcifications and subtle mass regions are presented in Figs. 12. The a and c portions of Figs. 6 to 10 display the result of mammogram digitization with varying sizes and shapes of microcalcifications. In each figure, the a and c parts represent regions of interest (ROI) taken from the mammograms of the same patient from
Fig. 6. (a) RCC; (b) Recult of Microcalcification Detection; (c) RMLO; and (d) Result of Microcalcification Detection.
170
C. H. Chen & G. G. Lee
Fig. 7. (a) RCC; (b) Recult of Microcalcification Detection; (c) RMLO; and (d) Result of Micrccalcification Detection.
different angles during X-ray acquisition. The results of microcalcification detections are shown in the b and d parts of each figure. In each of Figs. 6 t o 10, part b corresponds to the result of microcalcification detection corresponding to part a whereas part d corresponds to part c. In the course of X-ray acquisitions, 2 views, including the craniocaudad (CC) and medio-lateral-oblique (MLO) , are taken for each breast. The first letters in the mammograms represent the right or left breast. Figure 11 shows the case of a patient with the development of microcalcifications in a cancerous mass region. In Fig. l l ( a ) , the tumors with microcalcifications are seen from the left auxiliary (LAUX) view. Figures l l ( c ) and l l ( e ) show the calcified tumor from the left-caudal (LCAUD) and left-medio-lateral (LMEDLAT) views respectively. As can be seen from mammograms of patients 1 t o 5, the background graylevels and tissue textures vary greatly amongst different patients. The variations in the background graylevels of different patients are consequences of different dosages and results of film processing during each patient’s clinical visit. Variabilities of the tissue textures and low perceptibility of target microcalcifications accounts for the major difficulties of mammogram screening. The MWA and GMRF microcalcification detection algorithm described in this chapter achieved a sensitivity of 94% with a specificity of 88%. The conventional
1.5 O n Multiresolution Wavelet Analysis Using. . . 171
Fig. 8. (a) R calcification Detection.
I l t of Mi(:To-
wavelet transform however achieved only a sensitivity of 80% with a specificity of 80%. A comparison of the relative performances of the MWA and GMRF scheme and conventional wavelet transform is made by using the receiver operating characteristic curve (ROC) as shown in Fig 13. The solid line in Fig. 13 represents the ROC curve for MWA and GMRF whereas the dotted line shows the result of microcalcification detection using conventional wavelet transforms. 6.2. Segmentation of Subtle Mass Regions
Segmentation of subtle mass regions constitutes another difficult and important context of digital mammography screening. Due t o the low contrast of mammograms, images containing fuzzy edges with low signal to noise ratio and complex textural backgrounds are observed. In the present segmentation problem, each disjoint region of the image is assumed to be a different class. The task of the image recognition algorithm is thus to find the optimal classification which best characterizes the regions of the image. The maximum a posteriori (MAP) segmentation estimation scheme based on the Bayesian learning paradigm introduced in Section 5 will be used due to the prospective potentials of machine learning in contribution t o the development of robust computer vision algorithms in mammography. Moreover,
172
C. H. Chen & G. G. Lee
Fig. 9. (a) LCC; (b) Result of Microcalcification Detection; (c) LMLO; and (d) Result of Microcalcification Detection.
the experiences and expertises of the radiologist can also be incorporated into the learning paradigm not only for higher segmentation accuracy but also for constructing a fundamental framework for future comparison and screening of prospective letic mass regions. The most difficult task in the segmentation of subtle masses is the detection of the low signal-to-noise ratio edges or boundaries between the mass regions and the surrounding textures of the breast tissues. Due t o non-invasive concerns, the radiation dose applied in the acquisition of mammograms in general degrades the contrast between prospective letic mass regions and their surrounding textural backgrounds. It is necessary again that adaptive localized features be extracted via the multiresolution wavelet analysis scheme to defuzzify the uncertainties existing a t the borders of different regions. In a manner similar to that described for the detection of microcalcifications, the mammograms with subtle mass candidates are first decomposed into a hierarchy of different resolutions. As was mentioned, the gross outlines of edge information can be observed in the coarser or lower resolutions of the pyramidal scheme whereas the finer or more fuzzy details of the boundaries are revealed in the higher resolutions of the pyramid [48]. Thus a hierarchy of mammographic image resolutions systematically archives the gradual variations of the
1.5 On Multiresolution Wavelet Analysis Using. . . 173
Fig. 10. (a) LCC; (b) Result of Microcalcification Detection; (c) LMLO; and (d) Result of Microcalcification Detection.
edge information which is quantified in this chapter via the application of GMRF. The nonstationary GMRF are applied t o each of the resolution levels obtained via the two-channel wavelet decomposition. Pixels in the same region will therefore be characterized by similar features whereas different regions are modeled by different discriminant features. ft can be shown that greater variations of the features will be observed in different image regions whereas less variations of the features can be observed from within the region. This feature extraction scheme is especially efficient for the characterization of the blurry or fuzzy pixels at the boundaries between the subtle mass regions and background tissues. Based on the mutliresolution and contextual information embedded in each feature vector, determination of the relative membership or belongingness of the pixel ( i , j )t o a certain textural region or class is built upon the fuzzy decision concept of FCM clustering as discussed in Section 4. For the purpose of subtle mass localization, the result of this non-supervised algorithm is used as initial estimates for the MAP estimation. In the formulation of the MAP estimate, Gibbs priors or Gibbs random fields have also been incorporated into the learning scheme of the present research with very effective results. The MAP estimation problem usually results in the
174
C. H. Chen €4 G. G. Lee
Fig. 11. (a) LAUX; (b) Result of Microcalcification Detection; (c) LCAUD; (d) Result of Microcalcification Detection; ( e ) LMEDLAT; and (f) Result of Microcalcification Detection.
minimization of the energy function contained in the a posteriori distribution. The main difficulty of the problem is due primarily to the existence of non-convexity in the energy function so obtained. Thus it is not impossible that the estimations be trapped in local minima of the energy function. The result of localization of the subtle mass regions is shown in Figs. 14 to 17. Figure 14(a) shows the result of digitizing a ROI from patient 7 having fibroadenoma
1.5 On Mdtiresohtion Wavelet Analysis using. . . 175
Fig. 12. Mammograms with Healthy Tissues
Fig. 13. ROC Curve, the solid line represents the result for the MWA & G MRF Algorithm while the dashed line represents the conventional wavelet methods.
L76
C. H. Chen & G. G. Lee
Fig. 14. (a) RCC; (b) Result of Segmentation via EM Algorithm; (c) LCC; and (d) Result of Segmentation via EM Algorithm Using 3 Classes, c = 3.
Fig. 15. (a) ROI taken from LAUX; (b) Result of Segmentation Via EM Algorithm; and ( c ) Superposition of Microcalcification Detection and Mass Region Segmentation.
in the right breast. The result of image segmentation using the EM algorithm after FCM clustering is shown in Fig. 14(b). In this image, only two textural regions or classes are assumed where one of the classes characterizes the mass regions and the second class characterizes the background tissue. The cancerous mammogram
1.5 O n Multiresolution Wavelet Analysis Using. . . 177
Fig. 16. (a) ROI Taken From LCAUD; (b) Result of Segmentation Via EM Algorithm; and (c) Superposition of Microcalcification Detection and Mass Region Segmentation.
Fig. 17. (a) ROI Taken From LMEDLAT; (b) Result of Segmentation Via EM Algorithm; and ( c ) Superposition of Microcalcifiacation Detection and Mass Region Segmentation.
of the left breast of patient 7 is seen in Fig. 14(c). The result of FCM and EM segmentation superimposed on the original mammographic image is presented in Fig. 14(d). For this image, three classes or textural regions were assumed resulting in the contour lines observed in the segmentation. In this figure, each of the contoured or segmented regions represents an area of different density on the tumor. In Figs. 15(a), 16(a), and 17(a), the mammograms for patient 6 from the LAUX, LCAUD, and LMEDLAT views are repeated due to their interesting existences of microcalcifications within the mass region. The result of their FCM and EM segmentation are shown in Figs. 15(b), 16(b), and 16(c) respectively. The result of both microcalcification detection and mass region localization are shown superimposed in Figs. 15(c), 16(c) and 17(c).
7. Application to Wafer Inspection Another real-world application of the computer vision algorithm presented in this chapter is in the detection and isolation of defects in wafers in the integrated circuit (IC) industry [49]. It is well known in the IC industry that there is a
178
C. H. Chen l3 G. G. Lee
continuous drive for shrinking line widths and increasing wafer sizes. In the production of DRAM, 40-60 steps were required in the manufacturing process of an average semiconductor device five years ago. Presently, 230 processing steps are required for the production of a 16-megabit DRAM and 590 steps are required in a 256-megabit device. As a result of this increasing demand in complexity of manufacturing processes, it is inevitable that the IC industry requires process control tools with higher accuracy, faster speed and greater sensitivity in inspection. These demands however are based on the designation of good computer vision algorithms for the inspection of wafers. In the process of semiconductor manufacturing, wafer inspections are used for the detection of defects and to subsequently provide feedback information to the process control system. Throughput and sensitivity are the two basic trade-offs of a wafer inspection system in an endeavor to increase productivity. Presently two technical approaches namely the “darkfield” and “brightfield” imaging techniques have been introduced for the wafer inspection problem. The first method uses brightfield imaging techniques for the comparison of images in a die-to-die or cell-to-cell process. The brightfield imaging technique was accessed t o be more sensitive but slower. The darkfield approach based on illumination in a darkfield and the detection of light scattering using CCD imaging is very fast and sensitive. This technique however is more susceptible to noise. The MWA and GMRF algorithm can be applied for the achievement of sufficient sensitivity and noise immunity in wafer inspection for the IC industry. Initially a darkfield color image was converted into grayscale images in preparation for image analysis. In a manner described in Section 4, the MWA and GMRF algorithm was applied subsequently. Figure lS(a) shows a n original image with a defect and background both characterized by complex textures. Thus simple edge detectors and region labeling techniques are not applicable. The result of using the MWA and GMRF algorithm is presented in Fig. 18(b). It can be seen that the defect is extracted from the background with clear cut edges.
Fig. 18. (a) Original Wafer Inspection Image; and (b) Segmentation Using MWA & GMRF
1.5 O n Multiresolution Wavelet Analysis Using. . . 179
As a concluding remark, the MWA and GMRA algorithm presented in this chapter can indeed achieve a superior image segmentation performance because of effective utilization of contextual information at different resolution levels of a textured image, as demonstrated by real-world examples. The significant computational cost remains to be a price to be paid and needs to be greatly reduced. References [l]B. Bhanu and T. Poggio, Introduction to the special section on learning in computer vision, IEEE Trans. P A MI 16, 9 (1994) 865-868. [2] H. Chang and C. C. J. Kuo, Texture analysis and classification with tree-structured wavelet transform, I E E E Trans. Imaging Processing 2, 4 (1993) 429-444. [3] K. C. Chou, S. A. Golden and A. S. Wilsky, Multiresolution stochastic models, data fusion and wavelet transform, Signal Processing 34 (1993) 257-282. [4] S. Unser and M. Eden, Multiresolution feature extraction and selection for texture segmentation, I E E E Trans. P A M I 11, 7 (1989) 717-728. [5] P. A. Levy, A special problem of Brownian motion and a general theory of Gaussian random functions, in Proc. 3rd Berkeley Symp. Math. Statist. and Prob. 2 (1956). [6] B. McCormick and S. N. Jayaramamrhy, Time series model for texture synthesis, Intl. J. Comput. Inform. Sci. 3 (1974) 329-343. [7] K. Abend, T. Harley and L. N. Kanal, Classification of binary random patterns, I E E E Trans. Inform. Theory 11 (1965) 538-544. [8] R. Chellappa and R. Kashyap, Texture synthesis using 2-D noncausal autoregressive models, IEEE Trans. A S S P 33,1 (1985) 194-203. [9] J. Woods, Two-dimensional discrete Markovian fields, I E E E Trans. Inform. Theory 18, 2 (1972) 232-240. [lo] J. Besag, Spatial interaction and statistical analysis of lattice system (with discussion), J. Royal Statistics Society B36 (1974) 192-236. [ll] F. Spitzer, Markov random fields and Gibbs ensembles, Amer. Math. Mon. 78 (1971) 142-154. [12] G. Cross and A. J. Jain, Markov random field texture models, I E E E Trans. PAMI 5 , 1 (1983). [13] S. Geman and D. Geman, Stochastic relaxation, Gibbs distribution, and Bayesian restoration of images, I E E E Trans. Pattern Anal. Mach. Intell. 6,6 (1984) 721-774. [14] H. Derin and H. Elliot, Modeling and segmentation of noisy and textured images using Gibbs random field, I E E E Trans. PAMI 9, 1 (1987). [15] S. Krishnamachari and R. Chellapa, GMRF models and wavelet decomposition for texture segmentation, I E E E Proc. ICIP (1995) 568-571. [16] C. H. Chen and G. G. Lee, Neural networks for ultrasonic NDE classification using time-frequency analysis, I E E E ICASSP’93 Proceeding. [17] C. H. Chen and G. G. Lee, Wavelet analysis in IR image feature extraction, Progress in Electromagetics Research Symposium (1995). [18] C. H. Chen and G. G. Lee, Multiresolution wavelet analysis based feature extraction for neural network classification, IE E E Proc. on ICNN (1996). [19] T. N. Pappas, An adaptive clustering algorithm for image segmentation, I E E E Trans. Signal Processing 40 (1992) 901-914. 1201 J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum Press, 1981). [21] J. C. Bezdek, A review of probabilistic, fuzzy, and neural models for pattern recognition, in Fuzzy Logic and Neural Network Handbook, C. H. Chen (eds.), Ch. 2, (McGraw-Hill, 1996).
180
C. H. Chen B G. G. Lee
[22] J. Zhang and J. W. Modestino, Maximum-likelihood parameter estimation for unsupervised stochastic model-based image segmentation, IEEE Trans. Image Processing 3,4 (1994) 404-420. [23] C. H. Wu and P. C. Doerschuk, Cluster expansion for the deterministic computation of Bayesian estimators based on Markov random fields, I E E E Trans. PAMI 17,3 (1995) 275-293. [24] S. Kirkpatrick, C. D. Gellat, Jr., and M. P. Vecchi, Optimization b y Simulated A n nealing (IBM Thomas J. Watson Research Center, Yorktown heights, NY, 1982). [25] A. Blake and A. Zisserman, Visual Reconstruction (MIT Press, 1987). [26] D. Geiger and A. Yuille, A common framework for image segmentation, Int. J. Comp. Vision 6, 3 (1991) 227-243. [27] C. H. Wu and P. C. Doerschuk, Tree approximations to Markov random fields, I E E E Trans. PAMI 17,4 (1995). [28] I. Debauchies, The wavelet transform, time-frequency localization and signal analysis, IEEE Trans. Inform. Theory 36 (1988) 961-1005. [29] G. Mallat, A theory for multiresolution signal decomposition: The wavelet representation, IEEE Trans. Patt. Anal. Mach. Intell. 11 (1989) 674-693. [30] T. N. Wiesel, Postnatal development of visual cortex and the influence of environment (Nobel Lecture), Nature 299 (1982) 583-591. [31] C. H. Chen and G. G. Lee, A Gaussian Markov random field and multiresolution wavelet analysis of digital mammography, IEEE Proc. Inter. Conf. of Pattern Recognition (1996). [32] R. L. Kashyap and R. Chellappa, Estimation and choice of neighborhoods in spatial interaction models of images, I E E E Trans. Inform. Theory 29 (1983) 6Ck72. [33] R. Chellappa and R. L. Kashyap, Model based texture segmentation and classification, in Handbook of Pattern Recognition B Computer Vision, C. H. Chen , L. F. Pau and P. S. P. Wang (eds.) (World Scientific, 1992) 277-310. [34] E. Ising, Zeitschrifl Physik, 31 (1925) 253. [35] M. Hassner and J. Sklansky, The use of Markov random fields as models of textures, Comput. Graph. Image Process. 12 (1908) 357-370. [36] R. A. Redner and H. F. Walker, Mixture densities, maximum likelihood and the E M algorithm, SIAM Review, Society for Industrial and Applied Math. 26, 2 (1984). [37] L. Bedini, E. Salerno and A. Tonazzini, Edge-preserving tomographic reconstruction from Gaussian data using Gibbs prior and a generalized expectation-maximization algorithm, Int. J. Imaging, Systems, and Technology 5, 3 (1994). [38] A. P. Dempster and N. M. Laird, et al., Maximum likelihood from incomplete data via the EM algorithm, J. Royal Statistical Society 39 (1977) 1-38. [39] J. Besag, On the statistical analysis of dirty pictures, J. Royal Statistics Society B48, 3 (1986) 192-236. [40] C. C. Boring, T. S. Squires, T. Tong and S. Montgomery, Cancer statistics, A Cancer J. Clinicians 52, 1 (1994) 7-26. [41] J. E. Martin, M. Moskowitz and J. R. Milbrath, Breast cancer missed by mammography, A m . J. Roentgen0 132 (1979) 737-739. [42] R. E. Bird, T. W. Wallace and B. C. Yankaskas, Analysis of cancers missed at screening mammography, Radio. 184 (1992) 613-617. [43] W. Qian, L. P. Clark, H. D. Li, R. Clark and M. L. Silbiger, Digital mammography: M-channel quadrature mirror filters (QMFs) for microcalcification extraction, Computerized Medical Imaging and Graphics 18, 5 (1994) 301-314. [44] W. Qian, L. P. Clark, H. D. Li, R. Clark and M. L. Silbiger, Digital mammography: Mixed feature ANN for automatic detection of microcalcifications, W C N N 2 (1995) 849-853.
1.5 O n Multiresolution Wavelet Analysis Using. . . 181 [45] H. Yoshida and W. Zhang et al., Optimized wavelet transform based on supervised learning for detection of microcalcifications in digital mammography, IEEE PTOC. ICIP (1995) 152-155. [46] D. Brzakovic and M. Neskovic, Mammogram screening using multiresolution-based image segmentation, Int. J. Patt. Recog. and Artzficial Intell 7,6 (1993) 1437-1460. [47] N. Petrick, H. P. Chan and D. Wei, An adaptive density-weighted contrast enhancement filter for mammographic breast mass detection, IEEE Trans. Medical Imaging 15, 1 (1996) 59-67. [48] A. Rosenfeld (ed.), Image Modeling (Academic Press, 1981). [49] B. E. Dom and V. Brecher, Recent advances in automatic inspection of integrated circuits for pattern defects, Machine Vision and Applications 8, 1 (1995) 5-19.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 183-203 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 1.6 I A FORMAL PARALLEL MODEL FOR THREE-DIMENSIONAL OBJECT PATTERN REPRESENTATION
P. S. P. WANG College of Computer Science, Northeastern University, Boston, M A 02115, USA A new model for three-dimensional object pattern representation is introduced. It uses parallel techniques and significantly reduces the time required for dealing with threedimensional image analysis problems. The fundamental properties, the concept of finite representations, the tools of three-dimensional feature extraction and segmentation are investigated and several interesting examples are illustrated. In addition to its importance in theoretical study, the model can also be applied to three-dimensional object recognition, image processing, and computer vision in industries, and the military and medical fields.
Keywords: Computer vision, image processing, 3-D array grammars, universal array grammars, parallel derivation and generation, object pattern representation.
1. Introduction Three-dimensional vision and image processing problems have attracted wide attention among pattern recognition researchers. Because of its complexity and the large number of pixels involved in a 3-D image, sequential methods normally take too much time and are not very practical. One way to overcome such difficulty is to use “parallel processing”, i.e. to handle several pixels at the same time (simultaneously) rather than one at a time. As mentioned in [12,13] by Rosenfeld, pictorial patterns often consist of subpatterns of simple(r) types that are combined in particular ways. The subpatterns in turn may consist of still simpler sub-subpatterns, and so on. This method of describing patterns in terms of subpatterns, sub-subpatterns, etc. is analogous to describing sentences in terms of clauses, phrases, etc. Since one can determine the syntactic structure of a sentence by parsing it in accordance with grammatical rules, this suggests that it should be possible to determine the structure of a pictorial pattern by “parsing” it in accordance with the rules of a “picture grammar”. Ever since such an idea was first raised by Minsky [lo], there have been many syntactic and structural methods developed for solving scene analysis, image understanding, and pattern recognition problems [5,13]. But most of them are sequential and are limited to two-dimensional space only. Today, in dealing with more and
183
184
P. S. P. Wang
more complicated problems, there is a need to establish a more general model for higher-dimensional images and patterns. In this chapter, we introduce such a formal model known as “array grammar”, which has several advantages over others. It is a powerful pattern generative model generalized from Chomsky’s phrase structure grammar [3]; it is sufficiently flexible to be extended to higher dimensions [6,18]; it has been shown to be more accurate than some other methods for clustering analysis [22]; it can be highly parallel and as powerful as tessellation or cellular automata [6,7]; and it can provide a sequential/parallel model that serves as a compromise between a purely sequential model, which takes too much time for large arrays, and a purely parallel one, which normally requires too much hardware for large digital patterns [20]. Besides, it provides a good setting to get inside and in-depth views of multi-dimensional parallel computation, automata, and language theory [11,14,15]. Part of this research was motivated by the work done a t MIT [8,9,1O]and a preliminary version of the idea was presented at the SPIE Conference on Intelligent Robotics and Computer Vision [2]. 2. Preliminaries, Notations, Definitions and Examples Let us take a look at the two objects in Fig. 1. From what we see, how do we describe them and what are the differences (and similarities) between the two objects? One probable answer is, “Both are wire-like objects, but Fig. l ( a ) has
\
(a) %branchwire-like object
(b) &branch wire-like object
Fig. 1. Two multi-branch wire-like objects.
three branches while Fig. l(b) has six, and they are of various lengths.” More specifically, if we look a t these objects from location (O,O,O), then (a) has a line segment stretching m units along the y direction, reaching location A , then from A it stretchesp units in the x direction, and n units in the t direction. Likewise, object (b) can be perceived, understood, described, analyzed, memorized, and recognized in a similar way. This is from the human point of view. But what about the point
1.6 3-D Object Pattern Representation 185
of view of the computer? What is a mechanical way of describing an object like in Fig. 1 and can the computer understand it? This involves a very important concept in pattern recognition] known as pattern representation. Here we propose a structural approach using universal array grammar for three-dimensional objects represent ation. We adapt the basic definitions and notations of array grammars from earlier work in the literature [2,18,24]. The concept of three-dimensional (3-D) array grammars will be introduced, which can be considered as an extension or generalization of their two-dimensional (2-D) counterparts. It also retains the basic properties of array grammars, i.e. all productions (generating, rewriting, or derivation rules) are isometric, i.e. both sides of each rule are geometrically identical. This is to avoid the shearing effect [18].
Definition 2.1. A 3-D array grammar is G where V,: set of nonterminals, K: terminals, s E v,: start symbol, # 4 V, U V,: blank symbol,
=
(V, , V,, P, S , #) ,
P: +P, 4 5 ,Yl ). + P (2,Y , .>. During the derivation process, the locations of each nonterminal that should be applied (replaced) simultaneously (in parallel) are specified (by their ( 2 ,y , z ) coordinates). Definition 2.2. Parallel derivation. When a rule is applied, it is applied to all nonterminals of the (Y simultaneously (under specifications). Let ==-+be a binary relation between two sentential arrays (arrays that are connected and derivable from the initial sentential array with a singleton S surrounded by an infinite number of blank symbols in the 3-D Cartesian space) a and p . We say (Y ==-+ p if a produces (generates, derives) p. Let be a transitive and reflexive closure of +. Then the language (pattern) generated by G is denoted as follows:
L(G)= { R I S
R
E
V,++ according t o the coordinates specified and are connected (according t o the six-neighborhood)} .
The six direction vectors of 3-D space and the basic 3-D six neighborhood are shown in Fig. 2.
Example 2.1.
G, = (Vn1V,,P, S, #), where V, P: (1) s + s s # S (2) s -+ s (3) # S ---t s s
=
{ S } ,V, = {*} and
186
P. S. P. Wang 5 z
4
3-D six-neighborhood
-2
6 Fig. 2. 3-D space and six-neighborhood.
S
S
(4) # 3s ( 5 ) Sa# + S a S [where a means the left symbol is above the right symbol] (along z-axis) (6) Sb# tSb S [where b means the left symbol is below the right symbol] (along z-axis)
(7)
s +*
Without loss of generality, let us assume at the beginning that S is at (O,O,O) (surrounded by an infinite number of #’s). Notice that the neighborhood of (O,O,O) is {(O1 0, o), ( L O ,01, (0,1,0),(O,O, I), (-17 O,O), (01 - 1 , O ) , ( 0 , O I -1)).
In general, the neighborhood of (2, j , k ) is
a
Consider * as a unit cube where * = . G, works as a “universal 3-D array grammar” extended from 2-D universal array grammar introduced in [24]. In conventional syntactic pattern recognition, each pattern is characterized by a grammar. When the number of classes under consideration is very large, the grammar becomes very big, involving many grammar symbols and production rules evolved from combining all classes of patterns, each represented by its individual grammar. Therefore parsing a given input pattern is very time consuming. This in turn makes classification, clustering and recognition very difficult, if not impossible. The 3-D universal array grammar introduced in this paper can overcome such difficulty. Each 3-D object can be represented by a 1-D string (parsing sequence).
1.6 3-D Object Pattern Representation 187 Patterns of similar properties or characteristics are represented by the same or similar parsing sequence, as illustrated by the following examples.
Example 2.2. Consider the objects in Fig. 1. The parsing sequence of Fig. l(a) is:
Since every sequence is terminated by the rule 7 , from now on, we will omit ‘7’ without loss of generality, and combining the two A’s, make a more compact representation for Fig. l ( a ) as: 2mA(1P5n). Similarly the parsing sequence of Fig. l(b) is: 2”A( 1P2T3S5n64)
3. Finite Representation of Infinite Class of Objects
The formulas shown in Section 2 actually illustrate the concept of finite representation of objects. For instance, in Example 2.2, the last formula represents a n infinite class of wire-like line drawing objects with six branches (arms) of various lengths denoted by variables m, n, p , q, r and s along -y, z , x, - z , y and -x axis respectively. When m = p = r = s = n = q , i.e. 2mA(lm2m3m5m6m), it represents a proper infinite subclass of six-branch wire-like objects whose six arms are all of equal length. This idea can be used for describing many interesting real 3-D objects. Here are some more examples.
Example 3.1. A track hurdle is shown in Fig. 3, together with its digitized 3-D object as a 3-D array. Its pattern representation is: 2 m A l * { A , B } 5 n + q ( s 2 0 , m, n
+ q 2 z 2 n)lPB4m.
Notice that this is a finite representation, which actually characterizes an infinite class of objects sharing some common patterns (structural shapes) with various side lengths, e.g. they can have the same base width denoted by the variable p , but different heights denoted by the variable n, and vice versa. Also notice that all locations A, B , G, D are relative coordinates from (0, 0,O) and can be easily computed, e.g. A is reached after stretching m units along the y-axis ( 2 ) , therefore A is (O,m,O) etc. Figure 4 shows three different values of n, representing three different types of track hurdles. Also notice that this finite representation is better and simpler than a grammar, which is very difficult to find and will involve too many nonterminal symbols and rewriting rules for patterns sensitive t o the “lengths” even in 2-D cases [3,21,25].
188
P. 5’. P. Wang
(a) a track hurdle
(b) the digitized array object of (a) Fig. 3. A track hurdle and its digitized 3-D array object version.
(a) 200 m dash low hurdle
(b) 110 m dash high hurdle
(c) 400 m dash middle hurdle
Fig. 4. Three types of track hurdles in different height values from Fig. 3 .
Example 3.2. Several illustrations of n-tooth rakes are shown in Figs. 5 and 6. The original three-tooth rake (2,y, z ) Cartesian coordinates are shown as follows (for m = 9, p = 4, n = 3):
Notice that in contrast to the Cartesian coordinates, its string pattern representation, 29A(1434){A,B , C}S3 ,
1.6 3-D Object Pattern Representation
189
(b) the 3-D array object of (a)
(a) three-tooth rake
Fig. 5. A three-tooth rake and its 3-D array object.
B' 2
L >
(0.0.0)
(a) a four-tooth rake
(b) rotated 90° clockwise along z-y plane
(c) rotated 180' clockwise of (a)
Fig. 6 . A four-tooth rake and its rotations.
is a better way to describe the object, and is easier to understand, manipulate and be compared with other objects. Its structural similarities and differences are clearly reflected from the pattern representation when compared with other objects including its variations and rotations as shown in Figs. 6(a), (b) and (c), whose string representations are as follows:
4. Parallelism in 3-D UAG
This finite representation not only works for wire-like line-drawing objects, but also for other interesting non-wire-like objects. In fact, its property and advantage of parallelism will be even more obvious for describing these non-wire-like objects.
190
P. 5’. P. Wang
Example 4.1. Figure 7 shows a solid 5 * 5 * 5 cube, whose derivation process is illustrated as follows: S#&
SS#
A
S S S # -L, S S S S #
Asssss
A S S S S S
sssss
sssss sssss
&&&
A #####
sssss
Asssss Asssss
sssss sssss sssss
sssss sssss sssss sssss
5 * 5 * 5 cube (solid)
Fig. 7. A 5
* 5 * 5 solid cube.
n= P 5
* m Fig. 8. A 3 * 4
m-3
* 5 solid brick.
Note that if no positions (locations) are specified during the derivation process, by default, all locations wherever applicable are applied in parallel (simultaneously). Therefore, the derivation sequence is: 1 1 1 1 2 2 2 2 5 5 5 5 7 or 1424547. In general the derivation sequence ln-12n-15n-17is for an n x n x n solid cube, where n 2 1. Notice that a solid cube is a special case of the following solid brick objects.
1.6 3-D Object P a t t e r n Representation
191
Example 4.2. I*
S ===+
23 54 + ===+ & 3 * 4 * 5 brick
(solid)
In general the m * n * p solid brick sequence is lm-12"-15p-17 , where m,n,p 2 1. Notice that in dealing with three-dimensional patterns by the conventional sequential methods, it normally takes O ( n * m * p ) or O(n3)cubic time, whereas here it only takes O ( n m + p ) or O ( n ) linear time. Here we also point out a concept which is very important and fundamental to the formation of geometric objects as shown in Fig. 9.
+
The trajectory of a moving dot forms a line, i.e.
.=+ ........... =+ _______________ The trajectory of a moving line segment forms a plane (area), i.e.
...........
4 ........ ........... ........... .......... ........... The trajectory of a moving plane forms a volume, i.e.
Fig. 9. Geometric object formation.
This also serves as a simulation of what one may perceive, that a line, area, or volume is from the very basic geometric unit of a dot(.). Such a concept can be
192
P. S. P. Wang
extended to more complicated objects such as a pyramid and stair-like objects as illustrated in Examples 4.4 and 4.5. Example 4.3.
* ===+ {(0 5 z 5 n p-1
S
1 , y = O , z > 0), (0 5 z < n - l , y = p - 1 , z > 0), ( z = O , O 5 y 5 p - 1 , z > O), ( z = n - 1 , O I y 5 p - 1 , z > O)}
577-1
-----r.
n*p*m hollow (up) brick
or ln-l 2P-1 C grnp1 7, where C is the set of cells defined above within the brackets {}. An n x p x m hollow(up) brick is shown in Fig. 10. It is interesting t o compare the figures in Examples 4.1-4.6 with the object figures in [4,8,16,17,19],which are typical 3-D illustrations used for analyzing 3-D objects .
Z
Fig. 10. An upward hollow brick realized from a real 3-D object image in different gray levels.
Example 4.4. The pyramid is an interesting object, which has been widely used by many computer vision and pattern recognition researchers as a challenging test data for image description, understanding, representation, and recognition [8,9,12]. Here we show how a pyramid can be stmcturally represented by a 3-D UAG. Without loss of generality, a 5 * 5 * 3 pyramid is shown in Fig. 11 and its parsing sequence (representation) is shown as follows:
1.6 3 - 0 Object Pattern Representation
S
S
S S S S S S A S S S S S ((1 2 5 I 3 , l 5 y 5 3 , =~0))
sssss sssss sssss -5,
=
193
ci
(2,2,1) = c 2
& 5 * 5 * 3 pyramid or the parsing sequence is
top v i e w s :
side v i e w s :
m
l
Fig. 11. A 5
I 1 I I
* 5 * 3 solid pyramid.
Example 4.5. Figure 1 2 shows a stair, which can also be considered as a n approximation of Fig. 12(b). Its string representation is 1 n 2 m 5 k(z, y
2 k, 2)sk(z, y 2 2k, 2)s'" (z, y 2 3k, z)5k (z, y 2 4k, 2)s'".
Example 4.6. Compare the two stairs in Fig. 13, which cannot be properly distinguished by the method in IS], but can be clearly distinguished by 3-D UAG from their respective representations as follows: (a): 1 ~ 2 (z, ~ y 52 k, ~ 2)s'" (z, y
2 2k, 2 ) ~ ~ ' "
(b): I n 2m 52k ( 5 , y 2 k , z)52k (z, y 2 2k, z)s2"
194 P.
s. P.
Wang
P
.
. (a) a stair array
--.
(b) approximation array
Fig. 12. A stair and its approximation arrays.
stair (a)
stair (b)
Fig. 13. Two similar but different objects (stairs) that cannot be distinguished by Marill’s method in [S].
5. 26-Neighborhood UAG and from Pixels to Object Features
So far, the 3-D UAG uses the six-neighborhood, which can handle changes of n* 90” only. In this section, this restriction is lifted by expanding the six-neighborhood to a 26-neighborhood. In this case, a 3-D UAG using 26-neighborhood is defined in Example 5.1, using two-point normal form [3]. Example 5.1.
G, = (V,, V,, P, S , #), where V, = { S ) , V, = {*) as defined in Section 2, and rewriting rules in P are in the following form: P : (u)( S , SS, w) where in (1, 2, . .. , 6, a , b . .. , n, p , q . . . , u}, and each of these 26 symbols is a neighbor as defined as a vector in Fig. 14. (Note that we exclude the letter “o” (oh) to avoid confusion with the original location (O,O,O)). or ( S , *, -) for a terminal rule.
1.6 3-0 Object Pattern Representation
5
195
m
( -1,-1.1
-X
= 9
...
I'
Cell
-
-Y
( -1, -1.
4
-1) =
-2
6 I
(l,-l,-L)=u
-2
-Y
-2
Fig. 14. 3-D space with its center cell and 2-D-neighborhood, each denoted by an alphanumeral ranging from 1 to 6, and a to u (excluding '(oh").
Notice that there are 27 rules including the terminal rule. Again, since every array sentence must terminate by the terminal rule, in the parsing sequence, the last digit indicating the terminal rule can be omitted, without loss of generality.
Example 5.2. A standing up coat rack is shown in Fig. 15. Its stfring representation is 2m A( 1" 2"3"
5%+4+')B(epfp)C(ipjp)
,
+
where A = (0, m, 0 ) , B = (0, m, n) and C = (0, m, n 4). A more complicated example combining both wire-like and solid volume objects is illustrated in the following example.
Example 5.3. Two types of overhead projectors are shown in Fig. 16, with their back, side and bird's-eye views. Their string pattern representations are: (a): I m 2 n 5 P A ' + q B m ~ s C ( 1 u 2 v 5 ) (yxL, i, 2 ) 5 , i = 1, . . . , w - 1 (b): 1"2" SPAT+'B i ~ s l S C ( 1 u 2 v 5 ) (ys ,2 i, 2)5, i = 1 , . . . , w - 1 where A , B and C are computed and shown in Fig. 16.
196 P. S. P. Wang C
Fig. 15. A standing up coat rack and its digitized 3-D array object.
Again this example demonstrates two structurally similar but distinguishable objects, which are reflected from their respective string pattern representations. In fact, its representation also shows a segmentation of the object into four major portions (feature extraction) shown by four different shadings in Fig. 16, where the key difference between the two objects is indicated by the darkest black region (neck of the overhead projector). 6. Approximating Distorted, Noisy and Curved Objects
Distorted and noisy objects, can be approximated by straight line segments according to probabilistic distribution and thresholding, very much similar to those methods for 2-D line drawings [5,22]. Objects whose line drawing have arbitrary angles 8, rather than a multiple of 45O, can be approximated by x and y = tan 8 line segments along a particular plane. Example 6.1. In Fig. 17, (i) is a noisy line segment along the x-axis in the x-y plane, whose representation is lmaldllndlallp. If m ,n,p >> 1, then this representation can be approximated as l m + n f p . Figure 17 (ii) with an arbitrary angle 8 can be approximated by a sequence of line segments in the x-y plane by l"2" tan '. Curved objects can be approximated by line segments along the quantized planes tangent to the quantized 45O's,as shown in Example 6.2. Example 6.2. Figure 18 shows a type of glass and the different angles of its views. Its string pattern presentation is
lt2t3t4tatbtct dt5" B(e" f vivjvmvn"pvqv)C5w ,
1.6 3-0 Object Pattern Representation 197
y4 ...............
Type1
(
A
!!G
.i5O ................
Fig. 16. Two types of overhead projectors I and 11, and their different views.
where C is a circle described by x2 k d n , u q o ) , where x 5 T .
+
+ y2 = r2 at
z
=
u + Tv, Jz i.e. C
= (&x,
Example 6.3. More examples of curved objects are illustrated in Fig. 19. Their respective pattern representations are as follows:
lt2t3t4tat btct dt5uB(15 2 5 3%4; a f bf c~ d; qqr)D5f,
)qe $rf$ri 4 .3.4 r m $ r n + r
P
$r
P. S. P. Wang
198
Y
I
I:
(a)
Fig. 17. Noisy line segments, with arbitrary angle 0 and their digitized arrays.
Fig. 18. A glass and its 3-D arrays.
1.6 3-D Object Pattern Representation 199
21
Fig. 19. More illustrations of curved objects and their 3-D arrays.
~zF, u),
where C and D are two circles indicated in the figure, i.e. C = (hz, where z 5 ;, and D = ( & z , & d m ,u f), where z 5 T , C
+
hd-,
= (&z,
q u ) ,where x 5 r ; (b) l t 2 t 3 t 4 t a t b t c t d t 5 " B ( l x 5 y 2 ~ 5 y 3 ~ 5 y 4 ~ 5 y ~ ~ 5 y ~ x 5, y c ~ 5 y ~ ~ 5 y ) "
y.
where y = x t a n 0 and w = Notice that the three different types of glasses shown in Figs. 18 and 19 have structural similarities and differences, which are reflected from their respective string pattern representations via 3-D UAG shown in Examples 6.2 and 6.3.
7. Discussions and Future Research We have introduced a formal model known as "3-D universal array grammar" (3-D UAG) for three-dimensional object representation. It is parallel, and simple
200
P. S. P. Wang
to manipulate by computers, including orientations (along the IC-, y- and z-axis), shift, enlargement, elongations and reductions. This model is basically extended from 2-D universal array grammar (2-D UAG) [23]. But the difference here is not just the dimensionality. The types of production rules are different and it is parallel. The 2-D UAG in [23] uses regular (type 3) rules while here the 3-D UAG G, uses “context-free’’ (type 2) rules. Please note that here by “context-free” we mean t o borrow the terminology from Chomsky [3,25]. Because of its dimensions and the use of blanks (#) in the context, it is still more or less sensitive to the # symbols. Nevertheless it is interesting to see that 3-D UAG G, is more powerful (in terms of generative capability) than 2-D UAG, in that the following “multi-branch wire-like” patterns shown in Fig. 20 (i) (symbolizing a digitized Chinese character meaning “center” or “ central”) can be generated by G, but not by any 2-D UAG even in two-dimensional space [3].
X X
X
X
xxxxxxxxxx x x X x x X xxxxxxxxxx X X
X
X
x
x
X
X
X
X
X
x
x
X (i 1
(ii)
Fig. 20. (i) Multi-branch wire-like pattern symbolizing a digitized Chinese character “center”, and (ii) diagonal pattern symbolizing a Chinese character “human” or “man”.
Further, neither G, nor 2-D UAG can generate any “diagonal” patterns shown in Fig. 20 (ii) (symbolizing a Chinese digitized character meaning “human” or “man”). This is because the limitation of “6-neighborhood”. But with “26-neighborhood” introduced in Section 5 of this article, it can. Therefore there is a certain hierarchy in the patterns depending not only on their production rules but also on neighborhood definitions. It would be interesting to investigate such a three-dimensional pattern hierarchy, whose 2-D array counterpart has been investigated in [21]. It is also interesting to explore multi-dimensional arrays in other spaces such as hexagonal space using 60 and 120 degrees (rather than 90 degrees) as illustrated in 2-D space 111. The idea introduced in this paper cannot only generate many interesting 3-D objects, but can also be used for 3-D object learning, understanding, and description. For example, according to the sequence of rules (universal array grammar), Fig. 8 can be described and understood as a 3-D object with 12 sides, forming six perpendicular rectangle surface areas, i.e. a brick. When all sides are of equal
1.6 3-D Object Pattern Representation 201
length, it is a cube, i.e. a cube is a brick (with all six surfaces perpendicular t o each other) with all sides of equal length. Indeed, when one learns, understands, describes, memorizes, and recognizes a cube, these are the key characteristics all reflected by our representation : 1n-l
2n - 1 5 n - l .
From the theory in The Society of Mind [9], the idea of which was reiterated in [26], this can be considered as a small agent that can recognize all sizes of cubes. Another small agent is able to recognize bricks. These two small agents are very much alike in nature, and probably reside in one’s brain (memory) very close to each other. Translating to pattern recognition terms, their string pattern representations occupy nearby or neighboring addresses in the dictionary. There are also many other small agents, each recognizing an infinite subclass of objects sharing some common properties characterized by its representation, and so on. Altogether, we have a society that can recognize (describe, understand, memorize, and interpret) any object that has been taught through training via a 3-D UAG. For future research, more can be done, including: (1) 3-D objects clustering, need alignment, dictionary construction, matching, (2) from pixels to 3-D object feature extraction, segmentation, scene analysis, understanding, description, representation, and recognition, and (3) thinning (skeletonization) 3-D digitized arrays. For example, there are some more interesting applications to the real world, e.g. satellite launching environment such as the one shown in Fig. 21 that can be described by 3-D UAG.
Fig. 21. Some illustrations of 3-D objects from a shuttle launching station.
It is the author’s hope that this ground work can also pave the road for further studies of the 3-D formal model for object pattern recognition and to stimulate research in 3-D object clustering analysis involving noisy and distorted patterns.
202
P. S. P. Wang
Acknowledgement
Part of this work was done when the author was visiting the LIPN Labs of University of Paris VII and XIII. The author is grateful to Profs. M. Nivat and A. Saoudi for providing an excellent environment for research and for the financial support. References [l]K. Aizawa and A. Nakamura, Grammars on the hexagonal array, in P. S. P. Wang (ed.), Array Grammars, Patterns and Recognizers (World Scientific, 1989) 191-200. (21 L. Baird and P. S. P. Wang, 3-D object recognition using gradient descent and the universal 3-D array grammar, SPIE Vol. 1607 Intelligent Robots and Computer Vision, 1992, 711-719. [3] C. Cook and P. S. P. Wang, A chomsky hierarchy of isotonic array grammars and languages, Comput. Graph. Image Process. 8 (1978) 144-152. [4] S. Edelman, H. Bulthoff and D. Weinshall, Stimulus Familiarity Determines Recognition Strategy for Novel 3-D, MIT A1 Lab. Memo 1138, Jul. 1989. [5] K. S. Fu, Syntactic Pattern Recognition and Applications (Prentice-Hall, Englewood Cliffs, NJ, 1982). [6] W. I. Grosky and P. S. P. Wang, The relation between uniformly structured tessellation automata and parallel array grammars, in Proc. I E E E ISUSAL 75, Tokyo, Japan (1975) 97-102. [7] K. Inoue, I. Sakuramoto, M. Sakamoto and I. Itsanami, 2-D automata operating in parallel, in Proc. Int. Colloquium on Parallel Image Processing, Paris, 1991, 239-262. [8] T. Marill, Emulating the human interpretation of line-drawings as 3-D objects, Int. J . Comput. Vision 6, 2 (1991) 147-161. A preliminary version of this paper also appeared as a technical report: Recognizing Three-Dimensional Objects Without the Use of Models, MIT A1 Lab. Memo 1157, Sept. 1989. [9] M. L. Minsky, The Society of Mind (Heinemann, London, 1986). [lo] M. L. Minsky, Steps toward artificial intelligence, in Proc. I R E 49, 1961, 8-30. [ll] M. Nivat, A. Saoudi and V. R. Dare, Parallel generation of finite images, in P. S. P. Wang (ed.), Array Grammars Patterns and Recognizers (World Scientific, 1989) 1-16. [ 121 A. Rosenfeld, Picture Languages: Formal Models for Picture Recognition (Academic Press, New York, 1979). [13] A. Rosenfeld, Preface, in P. S. P. Wang (ed.), Array Grammars, Patterns and Recognizers (World Scientific, 1989). [ 141 A. Rosenfeld, Coordinate grammars revisited: generalized isometric grammars, in P. S. P. Wang (ed.), Array Grammars, Patterns and Recognizers (World Scientific, 1989) 157-166. [15] A. Saoudi, M. Nivat and P. S. P. Wang (eds.), Parallel Image Processing (World Scientific, 1992). [16] R. N. Shepard and J. Metzler, Mental rotation of 3-D objects, Science 171 (1971) 701-703. [17] R. N. Shepard and J. Metzler, Mental rotation: Effects of dimensionality of objects and type of task, J. Exp. Psychol.: Human Perception and Performance 14 (1988) 3-11. [18] R. Siromoney, Array language and Lindenmayer systems - A survey, in G. Rozenberg and A. Salomaa (eds.), The Book of L (Springer Verlag, 1986). [19] S. Ullman, An Approach to Object Recognition: Aligning Pictorial Descriptions, MIT A1 Lab. Memo 931, Dec. 1986.
1.6 3-D Object Pattern Representation 203 [20] P. S. P. Wang, Finite-turn repetitive checking automata and sequential/parallel matrix languages, IEEE Trans. Comput. 30 (1981) 366-370. [all P. S. P. Wang, Hierarchical structures and complexities of isometric patterns, IEEE Trans. Pattern Anal. Mach. Intell. 5 , 1 (1983) 92-99. [22] P. S. P. Wang, An application of array grammars to clustering analysis for syntactic patterns, Pattern Recogn. 17,4 (1984) 441-451. [23] P. S. P. Wang, On-line Chinese character recognition by array grammars, in Proc. 6th IGC Int. Conference on Electronic Image '88 (1988) 209-214. 124) P. S. P. Wang (ed.), Array Grammars, Patterns and Recognizers (World Scientific, 1989). [25] Y. Yamamoto, K. Morita and K. Sugata, Context-sensitivity of 2-D regular array grammars, in P. S. P. Wang (ed.), Array Grammars, Patterns and Recognizers (World Scientific, 1989) 17-41. [26] P. Winston with S. Shellard (eds.), Artificial Intelligence at M I T - Expanding Frontiers (MIT Press, 1990).
PART 2 BASIC METHODS IN COMPUTER VISION
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 207-248 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 2.1 1 TEXTURE ANALYSIS
MIHRAN TUCERYAN Department of Computer and Information Science, Indiana University - Purdue University at Indianapolis, 723 W. Michigan St. Indianapolis, I N 46202-5132 and ANIL K. JAIN Computer Science Department, Michigan State University East Lansing, M I 48824-1027, USA Internet:
[email protected] This chapter reviews and discusses various aspects of texture analysis. The concentration is on the various methods of extracting textural features from images. The geometric, random field, fractal, and signal processing models of texture are presented. The major classes of texture processing problems such as segmentation, classification, and shape from texture are discussed. The possible application are= of texture such as automated inspection, document processing, and remote sensing are summarized. A bibliography is provided at the end for further reading. Keywords: Texture, segmentation, classification, shape, signal processing, fractals, random fields, Gabor filters, wavelet transform, gray level dependency matrix.
1. Introduction In many machine vision and image processing algorithms, simplifying assumptions are made about the uniformity of intensities in local image regions. However, images of real objects often do not exhibit regions of uniform intensities. For example, the image of a wooden surface is not uniform but contains variations of intensities which form certain repeated patterns called visual texture. The patterns can be the result of physical surface properties such as roughness or oriented strands which often have a tactile quality, or they could be the result of reflectance differences such as the color on a surface. We recognize texture when we see it but it is very difficult to define. This difficulty is demonstrated by the number of different texture definitions attempted by vision researchers. Coggins [l]has compiled a catalogue of texture definitions in the computer vision literature and we give some examples here. <Wemay regard texture as what constitutes a macroscopic region. Its structure is simply attributed to the repetitive patterns in which elements or primitives are arranged according to a placement rule.” [2]
208 0
0
0
0
0
M . Tuceryan & A . K. Jain
“A region in an image has a constant texture if a set of local statistics or other local properties of the picture function are constant, slowly varying, or approximately periodic.” [3] “The image texture we consider is nonfigurative and cellular.. . An image texture is described by the number and types of its (tonal) primitives and the spatial organization or layout of its (tonal) primitives.. . A fundamental characteristic of texture: it cannot be analyzed without a frame of reference of tonal primitive being stated or implied. For any smooth gray-tone surface, there exists a scale such that when the surface is examined, it has no texture. Then as resolution increases, it takes on a fine texture and then a coarse texture.” [4] “Texture is defined for our purposes as an attribute of a field having no components that appear enumerable. The phase relations between the components are thus not apparent. Nor should the field contain an obvious gradient. The intent of this definition is to direct attention of the observer to the global properties of the display -i.e. its overall “coarseness,” “bumpiness,” or “fineness.” Physically, nonenumerable (aperiodic) patterns are generated by stochastic as opposed to deterministic processes. Perceptually, however, the set of all patterns without obvious enumerable components will include many deterministic (and even periodic) textures.” [5] “Texture is an apparently paradoxical notion. On the one hand, it is commonly used in the early processing of visual information, especially for practical classification purposes. On the other hand, no one has succeeded in producing a commonly accepted definition of texture. The resolution of this paradox, we feel, will depend on a richer, more developed model for early visual information processing, a central aspect of which will be representational systems at many different levels of abstraction. These levels will most probably include actual intensities at the bottom and will progress through edge and orientation descriptors to surface, and perhaps volumetric descriptors. Given these multilevel structures, it seems clear that they should be included in the definition of, and in the computation of, texture descriptors.” [B] “The notion of texture appears to depend upon three ingredients: (i) some local ‘order’ is repeated over a region which is large in comparison to the order’s size, (ii) the order consists in the nonrandom arrangement of elementary parts, and (iii) the parts are roughly uniform entities having approximately the same dimensions everywhere within the textured region.” [7]
This collection of definitions demonstrates that the “definition” of texture is formulated by different people depending upon the particular application and that there is no generally agreed upon definition. Some are perceptually motivated, and others are driven completely by the application in which the definition will be used. Image texture, defined as a function of the spatial variation in pixel intensities (gray values), is useful in a variety of applications and has been a subject of intense study by many researchers. One immediate application of image texture is the recognition of image regions using texture properties. For example, in Fig. l(a), we
2.1 Texture Analysis 209
Fig. 1. (a) An image consisting of five different textured regions: cotton canvas (D77), straw matting (D55), raffia (D84), herringbone weave (D17), and pressed calf leather (D24) [8]. (b) The goal of texture classification is to label each textured region with the proper category label: the identities of the five texture regions present in (a). ( c ) The goal of texture segmentation is to separate the regions in the image which have different textures and identify the boundaries between them. The texture categories themselves need not be recognized. In this example, the five texture categories in (a) are identified as separate textures by the use of generic category labels (represented by the different fill patterns).
can identify the five different textures and their identities as cotton canvas, straw matting, raffia, herringbone weave, and pressed calf leather. Texture is the most important visual cue in identifying these types of homogeneous regions. This is called texture classification. The goal of texture classification then is to produce a classification map of the input image where each uniform textured region is identified with the texture class it belongs to as shown in Fig. l ( b ) . We could also find the texture boundaries even if we could not classify these textured surfaces. This is then the second type of problem that texture analysis research attempts to solve texture segmentation. The goal of texture segmentation is to obtain the boundary map shown in Fig. l(c). Texture synthesis is often used for image compression applications. It is also important in computer graphics where the goal is to render object surfaces which are as realistic looking as possible. Figure 2 shows a set of synthetically generated texture images using Markov random field and fractal models (91. The shape from texture problem is one instance of a general class of vision problems known as “shape from X”. This was first formally pointed out in the perception literature by Gibson [lo]. The goal is to extract three-dimensional shape information from various cues such as shading, stereo, and texture. The texture features (texture elements) are distorted due t o the imaging process and the perspective projection which provide information about surface orientation and shape. An example of shape from texture is given in Fig. 3. 2. Motivation Texture analysis is an important and useful area of study in machine vision. Most natural surfaces exhibit texture and a successful vision system must be able t o
210
M . Tuceryan €d A . K. J a i n
Fig. 2. A set of example textures generated synthetically using only a small number of parameters. (a) Textures generated by discrete Markov random field models. (b) Four textures (in each of the four quadrants) generated by Gaussian Markov random field models. (c) Texture generated by fractal model.
Fig. 3. We can extract the orientation of the surface from the variations of texture (defined by the bricks) in this image.
deal with the textured world surrounding it. This section will review the importance of texture perception from two viewpoints- from the viewpoint of human vision or psychophysics and from the viewpoint of practical machine vision applications.
2.1 Texture Analysis 211 2.1. Ps ychophysics The detection of a tiger among the foliage is a perceptual task that carries life and death consequences for someone trying to survive in the forest. The success of the tiger in camouflaging itself is a failure of the visual system observing it. The failure is in not being able to separate figure from ground. Figureground separation is an issue which is of intense interest to psychophysicists. The figureground separation can be based on various cues such as brightness, form, color, texture, etc. In the example of the tiger in the forest, texture plays a major role. The camouflage is successful because the visual system of the observer is unable to discriminate (or segment) the two textures of the foliage and the tiger skin. What are the visual processes that allow one to separate figure from ground using the texture cue? This question is the basic motivation among psychologists for studying texture perception. Another reason why it is important to study the psychophysics of texture perception is that the performance of various texture algorithms is evaluated against the performance of the human visual system doing the same task. For example, consider the texture pair in Fig. 4(a), first described by Julesz [ll].The image consists of two regions each of which is made up of different texture tokens. Close scrutiny of the texture image will indicate this fact to the human observer. The immediate perception of the image, however, does not result in the perception of two different textured regions; instead only one uniformly textured region is perceived. Julesz says that such a texture pair is not “effortlessly discriminable” or “preattentively discriminable.” Such synthetic textures help us form hypotheses about what image properties are important in human texture perception. In addition, this example raises the question of how to evaluate the performance of computer algorithms that analyze textured images. For example, suppose we have an algorithm that can discriminate the texture pair in Fig. 4(a). Is this algorithm “correct?” The answer, of course, depends on the goal of the algorithm. If it is a very special purpose algorithm that should detect such scrutably different regions, then it is performing correctly. On the other hand, if it is to be a computational model of how the human visual system processes texture, then it is performing incorrectly. Julesz has studied texture perception extensively in the context of texture discrimination [11,12,13]. The question he posed was “When is a texture pair discriminable, given that they had the same brightness, contrast, and color?” Julesz concentrated on the spatial statistics of the image gray levels that are inherent in the definition of texture by keeping other illumination-related properties the same. To discuss Julesz’s pioneering work, we need t o define the concepts of first- and second-order spatial statistics. (i) First-order statistics measure the likelihood of observing a gray value at a randomly-chosen location in the image. First-order statistics can be computed from the histogram of pixel intensities in the image. These depend only on individual pixel values and not on the interaction or co-occurrence of
212
M . Tuceryan & A . K. Jain
neighboring pixel values. The average intensity in a n image is an example of the first-order statistic. (ii) Second-order statistics are defined as the likelihood of observing a pair of gray values occurring at the endpoints of a dipole (or needle) of random length placed in the image at a random location and orientation. These are properties of pairs of pixel values.
(4
(b)
Fig. 4. Texture pairs with identical second-order statistics. The bottom halves of the images consist of texture tokens that are different from the ones in the top half. (a) Humans cannot perceive the two regions without careful scrutiny. (b) The two different regions are immediately discriminable by humans.
Julesz conjectured that two textures are not preattentively discriminable if their second-order statistics are identical. This is demonstrated by the example in Fig. 4(a). This image consists of a pair of textured regions whose second-order statistics are identical. The two textured regions are not preattentively discriminable. His later counter-examples t o this conjecture were the result of a careful construction of texture pairs that have identical second-order statistics (see Fig. 4(b)) [14,15,16]. Julesz proposed the “theory of textons” to explain the preattentive discrimination of texture pairs. Textons are visual event9 (such as collinearity, terminations, closure, etc.) whose presence is detected and used in texture discrimination. Terminations are endpoints of line segments or corners. Using his theory of textons, Julesz explained the examples in Fig. 4 as follows. Recall that both texture images in Fig. 4 have two regions that have identical second-order statistics. In Fig. 4(a), the number of terminations in both the upper and lower regions is the same (i.e. the texton information in the two regions is not different), therefore the visual system is unable to preattentively discriminate the two textures. In Fig. 4(b), on the other hand, the number of terminations in the upper half is three, whereas the number of terminations in the lower half is four. The difference in this texton makes the two textured regions discriminable. Caelli has also proposed the existence of perceptual analyzers by the visual system for detecting textons [17]. Beck et al. [18] have conducted experiments and argue that the perception of texture segmentation in certain types of patterns is primarily a function of spatial frequency analysis and not the result of higher level symbolic grouping processes.
2.1 Texture Analysis 213 Studies in psychophysiology have suggested that a multi-channel, frequency and orientation analysis of the visual image formed on the retina is performed by the brain. Campbell and Robson [19] performed psychophysical experiments using various grating patterns. They suggested that the visual system decomposes the image into filtered images of various frequencies and orientations. De Valois et al. [20] have studied the brain of the macaque monkey which is assumed t o be close t o the human brain in its visual processing. They recorded the response of the simple cells in the visual cortex of the monkey to sinusoidal gratings of various frequencies and orientations and concluded that these cells are tuned to narrow ranges of frequency and orientation. These studies have motivated vision researchers t o apply multi-channel filtering approaches to texture analysis. 2.2. Applications
Texture analysis methods have been utilized in a variety of application domains. In some of the mature domains (such as remote sensing) texture already has played a major role, while in other disciplines (such as surface inspection) new applications of texture are being found. We will briefly review the role of texture in automated inspection, medical image processing, document processing, and remote sensing. Images from two application domains are shown in Fig. 5. The role that texture plays in these examples varies depending upon the application. For example, in the SAR images of Figs. 5(b) and (c) texture is defined to be the local scene heterogeneity and this property is used for classification of land use categories such as water, agricultural areas, etc. In the ultrasound image of the heart in Fig. 5(a), texture is defined as the amount of randomness which has a lower value in the vicinity of the border between the heart cavity and the inner wall than in the blood filled cavity. This fact can be used t o perform segmentation and boundary detection using texture analysis methods. 2.2.1. Inspection There has been a limited number of applications of texture processing t o automated inspection problems. These applications include defect detection in images of textiles and automated inspection of carpet wear and automobile paints. In the detection of defects in texture images, most applications have been in the domain of textile inspection. Dewaele et al. [21] used signal processing methods to detect point defects and line defects in texture images. They have sparse convolution masks in which the bank of filters are adaptively selected depending upon the image to be analyzed. Texture features are computed from the filtered images. A Mahalanobis distance classifier is used to classify the defective areas. Chetverikov [22] defined a simple window differencing operator to the texture features obtained from simple filtering operations. This allows one to detect the boundaries of defects in the texture. Chen and Jain [23] used a structural approach to defect detection in textured images. They extract a skeletal structure from images, and by detecting
214
M. Tuceryan & A . K. Jain
Fig. 5. Examples of images from various application domains in which texture analysis is important. (a) The ultrasound image of a heart. (b), (c) are example aerial images using SAR sensors.
anomalies in certain statistical features in these skeletons, defects in the texture are identified. Conners et al. [24] utilized texture analysis methods to detect defects in lumber wood automatically. The defect detection is performed by dividing the image into subwindows and classifying each subwindow into one of the defect categories such as knot, decay, mineral streak, etc. The features they use to perform this classification are based on tonal features such as mean, variance, skewness, and kurtosis of gray levels along with texture features computed from gray level cooccurrence matrices in analyzing pictures of wood. The combination of using tonal features along with textural features improves the correct classification rates over using either type of feature alone. In the area of quality control of textured images, Siew et al. [25] proposed a method for the assessment of carpet wear. They used simple texture features that are computed from second-order gray level dependency statistics and from first-order gray level difference statistics. They showed that the numerical texture features obtained from these techniques can characterize the carpet wear successfully. Jain et al. [26] used the texture features computed from a bank of Gabor filters to automatically classify the quality of painted metallic surfaces. A pair of automotive paint finish images is shown in Fig. 6 where the image in (a) has a uniform coating of paint, but the image in (b) has a “mottled” or “blotchy” appearance.
2.1 Texture Analysis 215
Fig. 6. Example images used in paint inspection. (a) A non-defective paint which has a smooth texture. (b) A defective paint which has a mottled look.
2.2.2. Medical Image Analysis Image analysis techniques have played an important role in several medical applications. In general, the applications involve the automatic extraction of features from the image which are then used for a variety of classification tasks, such as distinguishing normal tissue from abnormal tissue. Depending upon the particular classification task, the extracted features capture morphological properties, color properties, or certain textural properties of the image. The textural properties computed are closely related to the application domain to be used. For example, Sutton and Hall [27] discuss the classification of pulmonary disease using texture features. Some diseases, such as interstitial fibrosis, affect the lungs in such a manner that the resulting changes in the X-ray images are texture changes as opposed to clearly delineated lesions. In such applications, texture analysis methods are ideally suited for these images. Sutton and Hall propose the use of three types of texture features t o distinguish normal lungs from diseased lungs. These features are computed based on a n isotropic contrast measure, a directional contrast measure, and a Fourier domain energy sampling. In their classification experiments, the best classification results were obtained using the directional contrast measure. Harms et al. [28] used image texture in combination with color features to diagnose leukemic malignancy in samples of stained blood cells. They extracted texture micro-edges and “textons” between these micro-edges. The textons were regions with almost uniform color. They extracted a number of texture features from the textons including the total number of pixels in the textons which have a specific color, the mean texton radius and texton size for each color and various texton shape features. In combination with color, the texture features significantly improved the correct classification rate of blood cell types compared to using only color features. Landeweerd and Gelsema [29] extracted various first-order statistics (such as mean gray level in a region) as well as second-order statistics (such as gray level
218
M. Tuceryan €9 A . K. Jain
bar code block. A similar method is used for locating text blocks in newspapers. A segmentation of the document image is obtained using three classes of textures: one class for the text regions, a second class for the uniform regions that form the background or images where intensities vary slowly, and a third class for the transition areas between the two types of regions (see Fig. 9). The text regions are characterized by their high frequency content.
Fig. 9. Text/graphics separation using texture information. (a) An image of a newspaper captured by a flatbed scanner. (b) The three-class segmentation obtained by the Gabor filter based texture segmentation algorithm. (c) The regions identified as text.
2.2.4. Remote Sensing Texture analysis has been extensively used t o classify remotely sensed images. Land use classification where homogeneous regions with different types of terrains (such as wheat, bodies of water, urban regions, etc.) need t o be identified is an important application. Haralick et al. [41] used gray level co-occurrence features t o analyze remotely sensed images. They computed gray level co-occurrence matrices for a distance of one with four directions (O", 45", go", and 135"). For a seven-class classification problem, they obtained approximately 80% classification accuracy using texture features. Rignot and Kwok [42] have analyzed SAR images using texture features computed from gray level co-occurrence matrices. However, they supplement these features with knowledge about the properties of SAR images. For example, image restoration algorithms were used to eliminate the specular noise present in SAR images in order to improve classification results. The use of various texture features was studied for analyzing SAR images by Schistad and Jain [43]. SAR images shown in Figs. 5(b) and (c) were used to identify land use categories of water, agricultural areas, urban areas, and other areas. Fractal dimension, autoregressive Markov random field model, and gray level co-occurrence texture features were used in the classification. The classification errors ranged from 25% for the fractal based models to as low as 6% for the MRF features. Du [44] used texture features derived
2.1 Texture Analysis 219 from Gabor filters to segment SAR ima.ges. He successfully segmented the SAR images into categories of water, new forming ice, older ice, and multi-year ice. Lee and Philpot [45] also used spectral texture features to segment SAR images.
3. A Taxonomy of Texture Models Identifying the perceived qualities of texture in an image is an important first step towards building mathematical models for texture. The intensity variations in an image which characterize texture are generally due to some underlying physical variation in the scene (such as pebbles on a beach or waves in water). Modelling this physical variation is very difficult, so texture is usually characterized by the two-dimensional variations in the intensities present in the image. This explains the fact that no precise, general definition of texture exists in the computer vision literature. In spite of this, there are a number of intuitive properties of texture which are generally assumed t o be true. 0 Texture is a property of areas; the texture of a point is undefined. So, texture is a contextual property and its definition must involve gray values in a spatial neighborhood. The size of this neighborhood depends upon the texture type, or the size of the primitives defining the texture. 0 Texture involves the spatial distribution of gray levels. Thus, two-dimensional histograms or co-occurrence matrices are reasonable texture analysis tools. 0 Texture in an image can be perceived a t different scales or levels of resolution [lo]. For example, consider the texture represented in a brick wall. At a coarse resolution, the texture is perceived as formed by the individual bricks in the wall; the interior details in the brick are lost. At a higher resolution, when only a few bricks are in the field of view, the perceived texture shows the details in the brick. 0 A region is perceived to have texture when the number of primitive objects in the region is large. If only a few primitive objects are present, then a group of countable objects is perceived instead of a textured image. In other words, a texture is perceived when significant individual “forms” are not present. Image texture has a number of perceived qualities which play an important role in describing texture. Laws [47] identified the following properties as playing an important role in describing texture: uniformity, density, coarseness, roughness, regularity, linearity, directionality, direction, frequency, and phase. Some of these perceived qualities are not independent. For example, frequency is not independent of density and the property of direction only applies t o directional textures. The fact that the perception of texture has so many different dimensions is an important reason why there is no single method of texture representation which is adequate for a variety of textures. 3.1. Statistical Methods One of the defining qualities of texture is the spatial distribution of gray values. The use of statistical features is therefore one of the early methods proposed in the
220
M. Tuceryan lY A . K. Jain
machine vision literature. In the following, we will use { I ( $ ,y), 0 5 x 5 N - 1 , 0 5 y 5 N - 1) to denote an N x N image with G gray levels. A large number of texture features have been proposed. But, these features are not independent as pointed out by Tomita and Tsuji [46]. The relationship between the various statistical texture measures and the input image is summarized in Fig. 10 [46]. Picard [48] has also related the gray level co-occurrence matrices t o the Markov random field models. 3.1.1. Go-occurrence Matrices
Spatial gray level co-occurrence estimates image properties related to secondorder statistics. Haralick [lo] suggested the use of gray level co-occurrence matrices (GLCM) which have become one of the most well-known and widely used texture features. The G x G gray level co-occurrence matrix pd for a displacement vector d = (dz, dy) is defined as follows. The entry ( i ,j ) of Pd is the number of occurrences of the pair of gray levels i and j which are a distance d apart. Formally, it is given as p d ( i , j ) = l { ( ( r , s ) , ( t , v ):)I ( r , S )= i , I ( t , v )= j } l (3.1)
+
+
where ( T , s), (t,v) E N x N , (t,v ) = ( r dx, s dy), and 1 . 1 is the cardinality of a set. As an example, consider the following 4 x 4 image containing three different gray values: 1 1 0 0 1 1 0 0 0 0 2 2 0 0 2 2 The 3 x 3 gray level co-occurrence matrix for this image for a displacement vector of d = ( 1 , O ) is given as follows:
Pd=
[:: :I 2
2 0
.
Here the entry (0,O) of Pd is 4 because there are four pixel pairs of that are offset by (1, 0) amount. Examples of p d for other displacement vectors i s given below.
Notice that the co-occurrence matrix so defined is not symmetric. But a symmetric co-occurrence matrix can be computed by the formula P = Pd P-d. T h e cooccurrence matrix reveals certain properties about the spatial distribution of the
+
2.1 Texture Analysis 221 gray levels in the texture image. For example, if most of the entries in the cooccurrence matrix are concentrated along the diagonals, then the texture is coarse with respect to the displacement vector d. Haralick has proposed a number of useful texture features that can be computed from the co-occurrence matrix. Table 1 lists some of these features. Here p3: and py are the means and g3:and gy are the standard deviations of pd(2) and pd(y), respectively, where pdc3:) = pd(z,j) and pd(y) = pd(i, I/).
cj
Table 1. Some texture features extracted from gray level co-occurrence matrices. Texture Feature
Entropy
Formula
-
Homogeneity
The co-occurrence matrix features suffer from a number of difficulties. There is no well established method of selecting the displacement vector d and computing co-occurrence matrices for different values of d is not feasible. For a given d, a large number of features can be computed from the co-occurrence matrix. This means that some sort of feature selection method must be used to select the most relevant features. The co-occurrence matrix-based texture features have also been primarily used in texture classification tasks and not in segmentation tasks. 3.1.2. Autocorrelation Features
An important property of many textures is the repetitive nature of the placement of texture elements in the image. The autocorrelation function of an image can be used to assess the amount of regularity as well as the fineness/coarseness of the texture present in the image. Formally, the autocorrelation function of an image I ( z , y) is defined as follows:
222
M. Tuceryan tY A . K. J a i n
The image boundaries must be handled with special care but we omit the details here. This function is related to the size of the texture primitive (i.e. the fineness of the texture). If the texture is coarse, then the autocorrelation function will drop off slowly; otherwise, it will drop off very rapidly. For regular textures, the autocorrelation function will exhibit peaks and valleys. The autocorrelation function is also related t o the power spectrum of the Fourier transform (see Fig. 10). Consider the image function in the spatial domain I(x, y) and its Fourier transform F ( u ,v). The quantity IF(u,. ) I 2 is defined as the power spectrum where I . I is the modulus of a complex number. The example in Fig. 11 illustrates the effect of the directionality of a texture on the distribution of energy in the power spectrum. Early approaches using such spectral features would divide the frequency domain into rings (for frequency content) and wedges (for orientation content) as shown in Fig. 12. The frequency domain is thus divided into regions and the total energy in each of these regions is computed as texture features.
Original Image
Difference Statistics
w
7
Fourier Tmsform
Au toregression Model
Fig. 10. The interrelation between the various second-order statistics and the input image [46]. @ Reprinted by permission of Kluwer Academic Publishers.
Fig. 11. Texture features from the power spectrum. (a) A texture image, and (b) its power spectrum. The directional nature of this texture is reflected in the directional distribution of energy in the power spectrum.
2.1 Texture Analysis 223
r = JiFT7
0 = atan(v/u)
(4
r
=
J2G-3
0 = atan(v/u) (b)
Fig. 12. Texture features computed from the power spectrum of the image. (a) The energy computed in each shaded band is a texture feature indicating coarseness/fineness, and (b) the energy computed in each wedge is a texture feature indicating directionality.
3.2. Geometrical Methods
The class of texture analysis methods that falls under the heading of geometrical methods is characterized by their definition of texture as being composed of “texture elements” or primitives. The method of analysis usually depends upon the geometric properties of these texture elements. Once the texture elements are identified in the image, there are two major approaches to analyzing the texture. One computes statistical properties from the extracted texture elements and utilizes these as texture features. The other tries to extract the placement rule that describes the texture. The latter approach may involve geometric or syntactic methods of analyzing texture. 3.2.1. Voronoi Tessellation Features Tuceryan and Jain [49] proposed the extraction of texture tokens by using the properties of the Voronoi tessellation of the given image. Voronoi tessellation has been proposed because of its desirable properties in defining local spatial neighborhoods and because the local spatial distributions of tokens are reflected in the shapes of the Voronoi polygons. First, texture tokens are extracted and then the tessellation is constructed. Tokens can be as simple as points of high gradient in the image or complex structures such as line segments or closed boundaries.
224
M . Tuceryan & A . K. Jain
In computer vision, the Voronoi tessellation was first proposed by Ahuja as a model for defining “neighborhoods” [50]. Suppose that we are given a set S of three or more tokens (for simplicity, we will assume that a token is a point) in the Euclidean plane. Assume that these points are not all collinear, and that no four points are cocircular. Consider an arbitrary pair of points P and Q. The bisector of the line joining P and Q is the locus of points equidistant from both P and Q and divides the plane into two halves. The half plane H?(H,‘) is the locus of points closer to P ( Q ) than to Q ( P ) . For any given point P , a set of such HQ half planes is obtained for various choices of Q. The intersection defines a polygonal region consisting of points closer to P than any other point. Such a region is called the Voronoi polygon [51] associated with the point. The set of complete polygons is called the Voronoi diagram of S [52]. The Voronoi diagram together with the incomplete polygons in the convex hull define a Voronoi tessellation of the entire plane. Two points are said to be Voronoi neighbors if the Voronoi polygons enclosing them share a common edge. The dual representation of the Voronoi tessellation is the Delaunay graph which is obtained by connecting all the pairs of points which are Voronoi neighbors as defined above. An optimal algorithm to compute the Voronoi tessellation for a point pattern is described by Preparata and Shamos [ 5 3 ] . A simple 2-D dot pattern and its Voronoi tessellation are shown in Fig. 13.
nQEs,QZp
I ......
...... ......
.. .. .. .. .. ..... .. .. ... ...... .. .. ...... .. .. .. .. .. . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Fig. 13. Voronoi tessellation: (a) An example dot pattern, and (b) its Voronoi tessellation.
The neighborhood of a token P is defined by the Voronoi polygon containing P . Many of the perceptually significant characteristics of a token’s environment are manifest in the geometric properties of the Voronoi neighborhoods (see Fig. 13). The geometric properties of the Voronoi polygons are used as texture features. In order to apply geometrical methods to gray level images, we need to first extract tokens from images. We use the following simple algorithm to extract tokens from input gray level textural images. 1. Apply a Laplacian-of-Gaussian (LOG or V2G) filter t o the image. For computational efficiency, the V2G filter can be approximated with a difference
2.1 Texture Analysis 225
Fig. 14. Texture segmentation using the Voronoi tessellation. (a) An example texture pair from Brodatz’s album [92], (b) the peaks detected in the filtered image, and (c) the segmentation using the texture features obtained from Voronoi polygons [49]. The arrows indicate the border direction. The interior is on the right when looking in the direction of the arrow.
of Gaussians (DOG) filter. The size of the DOG filter is determined by the sizes of the two Gaussian filters. Tuceryan and Jain used ~1 = 1 for the first Gaussian and ~2 = 1 . 6 0 ~for the second. According to Marr, this is the ratio at which a DOG filter best approximates the corresponding V2G filter [54]. Select those pixels that lie on a local intensity maximum in the filtered image. A pixel in the filtered image is said to be on a local maximum if its magnitude is larger than six or more of its eight nearest neighbors. This results in a binary image. For example, applying steps 1 and 2 to the image in Fig. 14(a) yields the binary image in Fig. 14(b). Perform a connected component analysis on the binary image using eight nearest neighbors. Each connected component defines a texture primitive (token). The Voronoi tessellation of the resulting tokens is constructed. Features of each Voronoi cell are extracted and tokens with similar features are grouped to construct uniform texture regions. Moments of area of the Voronoi polygons serve as a useful set of features that reflect both the spatial distribution and shapes of the tokens in the textured image. The ( p q)th order moments of area of a closed region R with respect to a token with coordinates (zo, yo) are defined as [55]:
+
226
M. Tuceryan & A . K. Jain
Table 2. Voronoi polygon features used by the texture segmentation algorithm [49]. Here, f 2 gives the magnitude of the vector from the token to the polygon centroid, f3 gives its direction, f 4 gives the overall elongation of the polygon (f4 = 0 for a circle), and f5 gives the orientation of its major axis. ( 5 ,Q) are the coordinates of the Voronoi polygon’s centroid. Texture Feature
Computation
fl
moo
f2
VG-Tjp
f3
atan (g/%)
f5
atan
(
2mll m20 - m02
)
where p + q = 0 , 1 , 2 , . . .. A description of the five features used is given in Table 2 where (3,y) are the coordinates of the Voronoi polygon’s centroid. The texture features based on Voronoi polygons have been used for segmentation of textured images. The segmentation algorithm is edge based, using a statistical comparison of the neighboring collections of tokens. A large dissimilarity among the texture features is evidence for a texture edge. This algorithm has successfully segmented gray level texture images as well as a number of synthetic textures with identical second-order statistics. Figure 14(a) shows an example texture pair and Fig. 14(c) shows the resulting segmentation. 3.2.2. Structural Methods
The structural models of texture assume that textures are composed of texture primitives. The texture is produced by the placement of these primitives according to certain placement rules. This class of algorithms, in general, is limited in power unless one is dealing with very regular textures. Structural texture analysis consists of two major steps: (a) extraction of the texture elements, and (b) inference of the placement rule. There are a number of ways to extract texture elements in images. It is useful to define what is meant by texture elements in this context. Usually texture elements consist of regions in the image with uniform gray levels. Voorhees and Poggio [56] argued that blobs are important in texture perception. They have proposed a method based on filtering the image with Laplacian of Gaussian (LOG) masks at different scales and combining this information to extract the blobs in the image. Blostein and Ahuja [57]perform similar processing in order to extract texture tokens in images by examining the response of the LOG filter at multiple scales. They integrate their multi-scale blob detection with surface shape computation in order to improve the results of both processes. Tomita and Tsuji [46] also suggest a method
2.1 Texture Analysis 227 of computing texture tokens by doing a medial axis transform on the connected components of a segmented image. They then compute a number of properties such as intensity and shapes of these detected tokens. Zucker [58] has proposed a method in which he regards the observable textures (real textures) as distorted versions of ideal textures. The placement rule is defined for the ideal texture by a graph that is isomorphic to a regular or semiregular tessellation. These graphs are then transformed to generate the observable texture. Which of the regular tessellations is used as the placement rule is inferred from the observable texture. This is done by computing a two-dimensional histogram of the relative positions of the detected texture tokens. Another approach to modeling texture by structural means is described by Fu [59]. In this approach the texture image is regarded as texture primitives arranged according to a placement rule. The primitive can be as simple as a single pixel that can take a gray value, but it is usually a collection of pixels. The placement rule is defined by a tree grammar. A texture is then viewed as a string in the language defined by the grammar whose terminal symbols are the texture primitives. An advantage of this method is that it can be used for texture generation as well as texture analysis. The patterns generated by the tree grammars could also be regarded as ideal textures in Zucker’s model. 3.3. Model Based Methods
Model based texture analysis methods are based on the construction of a n image model that can be used not only t o describe texture, but also t o synthesize it. The model parameters capture the essential perceived qualities of texture. 3.3.1. Random Field Models
Markov random fields (MRFs) have been popular for modeling images. They are able to capture the local (spatial) contextual information in a n image. These models assume that the intensity at each pixel in the image depends on the intensities of only the neighboring pixels. MRF models have been applied to various image processing applications such as texture synthesis [60], texture classification [61,62], image segmentation [63,64], image restoration [65],and image compression. The image is usually represented by an N x N lattice denoted by L = {(i,j)Il 5 i 5 M , 1 5 j 5 N } . I ( i , j ) is a random variable which represents the gray level at pixel ( i , j ) on lattice L. The indexing of the lattice is simplified for mathematical convenience to I, with t = (i - l ) N j . Let A be the range set common to all random variables It and let Q = ( ( 2 1 , z2, . . . , z ~ ~ ) \E zA,M} t denote the set of all labellings of L. Note that A is specified according to the application. For instance, for an image with 256 different gray levels A may be the set {0,1,. . . ,255). The random vector I = (11, Iz, . . . , I M N )denotes a coloring of the lattice. A discrete Markov random field is a random field whose probability mass function has the properties of positivity, Markovianity, and homogeneity.
+
228
M. Tuceryan & A . K. Jain
The neighbor set of a site t can be defined in different ways. The first-order neighbors of t are its four-connected neighbors and the second-order neighbors are its eight-connected neighbors. Within these neighborhoods, sets of neighbors which form cliques (single site, pairs, triples, and quadruples) are usually used in the definition of the conditional probabilities. A discrete Gibbs random field (GRF) assigns a probability mass function to the entire lattice: 1 P ( X = x ) = -e--) vx E R (3.4) z where V(x) is an energy function and Z is a normalizing constant called the partition function. The energy function is usually specified in terms of cliques formed over neighboring pixels. For the second-order neighbors the possible cliques are given in Fig. 15. The energy function is then expressed in terms of potential functions VC(.) over the cliques Q:
U(x) = C V C ( X ) .
(3.5)
c€Q
0
m
E % 8
Fig. 15. The clique types for the second-order neighborhood.
We have the property that with respect t o a neighborhood system, there exists a unique Gibbs random field for every Markov random field and there exists a unique Markov random field for every Gibbs random field [66]. The consequence of this theorem is that one can model the texture either globally by specifying the total energy of the lattice or model it locally by specifying the local interactions of the neighboring pixels in terms of the conditional probabilities. There are a number of ways in which textures are modeled using Gibbs random fields. Among these are the Derin-Elliot model [67] and the auto-binomial model [66, 601 which are defined by considering only the single pixel and pairwise pixel cliques in the second-order neighbors of a site. In both models the conditional probabilities are given by expressions of the following form:
2.1 Texture Analysis 229 where Zt = x g E A e - w ( g , R t ) Tis e the normalization constant. The energy of the Gibbs random field is given by:
V(x)
=
c
1M N w(x,Rt)T8 2 t=l
-
(3.7)
where W ( x t , Rt) = [wI(xt) w 2 ( x t ) w3(xt) w4(xt)lT and 8 = [OI82 6% O4IT. The two models define the components of the w vector, w T ( x t ) ,differently as follows:
+
Derin-Elliott model: w,(xt) = I ( x t , zt-T) I ( z t , zt+r) 1 5 r 5 4. Auto-binomial model: w,(xt) = xt(ztVT xt+T) 1 5 r 5 4.
+
Here r is the index that defines the set of neighboring pixels of a site t , and I ( a , b) is an indicator function as follows:
I ( a ,b ) =
-1
ifa=b
1
otherwise.
The vector 8 is the set of parameters that defines and models the textural properties of the image. In texture synthesis problems, the values are set to control the type of texture to be generated. In the classification and segmentation problems, the parameters need to be estimated in order to process the texture images. Textures were synthesized using this method by Cross and Jain [60]. Model parameters were also estimated for a set of natural textures. The estimated parameters were used to generate synthetic textures and the results were compared to the original images. The models captured microtextures well, but they failed with regular and inhomogeneous textures. 3.3.2. Fractals Many natural surfaces have a statistical quality of roughness and self-similarity at different scales. Fractals are very useful and have become popular in modelling these properties in image processing. Mandelbrot [68] proposed fractal geometry and is the first one t o notice its existence in the natural world. We first define a deterministic fractal in order to introduce some of the fundamental concepts. Self-similarity across scales in fractal geometry is a crucial concept. A deterministic fractal is defined using this concept of self-similarity as follows. Given a bounded set A in a Euclidean n-space, the set A is said to be selfsimilar when A is the union of N distinct (non-overlapping) copies of itself, each of which has been scaled down by a ratio of r . The fractal dimension D is related to the number N and the ratio r as follows: (3.9) The fractal dimension gives a measure of the roughness of a surface. Intuitively, the larger the fractal dimension, the rougher the texture is. Pentland [69] has argued
230
M. Tuceryan & A . K. J a i n
and given evidence that images of most natural surfaces can be modelled as spatially isotropic fractals. Most natural surfaces and in particular textured surfaces are not deterministic as described above but have a statistical variation. This makes the computation of fractal dimension more difficult. There are a number of methods proposed for estimating the fractal dimension D. One method is the estimation of the box dimension as follows [70]. Given a bounded set A in Euclidean n-space, consider boxes of size L, on a side which cover the set A . A scaled down version of the set A by ratio r, will result in N = l / r D similar sets. This new set can be covered by boxes of size L = TL,,,. The number of such boxes is then related to the fractal dimension by (3.10)
The fractal dimension is then estimated from Eq. (3.10) by the following procedure. For a given L, divide the n-space into a grid of boxes of size L and count the number of boxes covering A. Repeat this procedure for different values of L. Then estimate the value of the fractal dimension D from the slope of the line ln(N(L)) = -D ln(L)
+ D ln(Lmax).
(3.11)
This can be accomplished by computing the least squares linear fit t o the data, namely, a plot of In(L) vs. - In(N(L)). An improved method of estimating the fractal dimension was proposed by Voss [71]. Assume we are estimating the fractal dimension of an image surface A . Let P ( m ,L) be the probability that there are m points within a box of side length L centered at an arbitrary point on the surface A . Let A4 be the total number of points in the image. When one overlays the image with boxes of side length L, then the ( M / m ) P ( m , L ) is the expected number of boxes with m points inside. The expected total number of boxes needed to cover the whole image is (3.12)
The expected value of N(L) is proportional to L-D and thus can be used t o estimate the fractal dimension D. Other methods have also been proposed for estimating the fractal dimension. For example, Super and Bovik [72] have proposed using Gabor filters and signal processing methods to estimate the fractal dimension in textured images. The fractal dimension is not sufficient to capture all textural properties. It has been shown [70] that there may be perceptually very different textures that have very similar fractal dimensions. Therefore, another measure, called Zacunarity [68,71,70], has been suggested in order to capture the textural property that will
2.1 Texture Analysis 231 let one distinguish between such textures. Lacunarity is defined as (3.13)
where M is the mass of the fractal set E ( M ) and is the expected value of the mass. This measures the discrepancy between the actual mass and the expected value of the mass. Lacunarity is small when texture is fine and it is large when the texture is coarse. The mass of the fractal set is related to the length L by the power law:
M(L)= K L ~ .
(3.14)
Voss [71] suggested computing lacunarity from the probability distribution P(m,L ) as follows. Let M ( L ) = C : = l m P ( m , L ) and M 2 ( L ) = CL=1m2P(m,L). Then lacunarity A is defined as: (3.15)
This measure of the image is then used as a texture feature in order to perform texture segmentation or classification. Ohanian and Dubes [73] have studied the performance of various texture features. They studied the texture features with the performance criteria: “which features optimized the classification rate?” They compared four fractal features, 16 co-occurrence features, four Markov random field features, and Gabor features. They used Whitney’s forward selection method for feature selection. The evaluation was done on four classes of images: Gauss Markov random field images, fractal images, leather images, and painted surfaces. The co-occurrence features generally outperformed other features (88% correct classification) followed by fractal features (84% classification). Using both fractal and co-occurrence features improved the classification rate to 91%. Their study did not compare the texture features in segmentation tasks. It also used the energy from the raw Gabor filtered images instead of using the empirical nonlinear transformation needed to obtain the texture features as suggested in [40] (see also Section 3.4.3).
3.4. Signal Processing Methods Psychophysical research has given evidence that the human brain does a frequency analysis of the image [19,74]. Texture is especially suited for this type of analysis because of its properties. This section will review the various techniques of texture analysis that rely on signal processing techniques. Most techniques try to compute certain features from filtered images which are then used in either classification or segmentation tasks. 3.4.1. Spatial Domain Filters
Spatial domain filters are the most direct way t o capture image texture properties. Earlier attempts at defining such methods concentrated on measuring the edge
232
M. Tuceryan & A . K. Jain
density per unit area. Fine textures tend to have a higher density of edges per unit area than coarser textures. The measurement of edgeness is usually computed by simple edge masks such as the Robert's operator or the Laplacian operator [10,47]. The two orthogonal masks for the Robert's operator and one digital realization of the Laplacian are given below. Robert's Operators
Laplacian Operator -1
-1
L = [-I -1
-1
8
It] -1
The edgeness measure can be computed over an image area by computing a magnitude from the responses of Robert's masks or from the response of the Laplacian mask. Malik and Perona [75] proposed spatial filtering to model the preattentive texture perception in the human visual system. Their proposed model consists of three stages: (i) convolution of the image with a bank of even-symmetric filters followed by half-wave rectification, (ii) inhibition of spurious responses in a localized area, and (iii) detection of the boundaries between the different textures. The even-symmetric filters they used consist of differences of offset Gaussian (DOOG) functions. The half-wave rectification and inhibition (implemented as leaders-takeall strategy) are methods of introducing a nonlinearity into the computation of texture features. A nonlinearity is needed in order t o discriminate texture pairs with identical mean brightness and identical second-order statistics. The texture boundary detection is done by a straightforward edge detection method applied to the feature images obtained from stage (ii). This method works on a variety of texture examples and is able to discriminate natural as well as synthetic textures with carefully controlled properties. Unser and Eden [76] have also looked at texture features that are obtained from spatial filters and a nonlinear operator. Reed and Wechsler [77] review a number of spatial/spatial frequency domain filter techniques for segmenting textured images. Another set of spatial filters are based on spatial moments [47]. The ( p + q)th moments over an image region R are given by the formula
If the region R is a local rectangular area and the moments are computed around each pixel in the image, then this is equivalent to filtering the image by a set of spatial masks. The resulting filtered images that correspond t o the moments are then used as texture features. The masks are obtained by defining a window of size W x W and a local coordinate system centered within the window. Let (2, j ) be the image coordinates at which the moments are computed. For pixel coordinates (m,n )
2.1 Texture Analysis 233 which fall within the W x W window centered at ( i , j ) ,the normalized coordinates ( x m ,yn) are given by:
(3.17) Then the moments within a window centered at pixel (i, j ) are computed by the sum in Eq. (3.16) that uses the normalized coordinates. (3.18) n=-W/2
m=-W/2
The coefficients for each pixel within the window to evaluate the sum is what defines the mask coefficients. If R is a 3 x 3 region, then the resulting masks are given below:
Moo=
1 1 1 1 1 1 1 1 1
-1
-1
1 0
-1
-1
-1
0
1
1 0 1
The moment-based features have been used successfully in texture segmentation [78]. An example texture pair and the segmentation are shown in Fig. 16.
Fig. 16. The segmentation results using moment based texture features. (a) A texture pair consisting of reptile skin and herringbone pattern from the Brodatz album [92]. (b) The resulting segmentation.
3.4.2. Fourier domain filtering
The frequency analysis of the textured image is best done in the Fourier domain. As the psychophysical results indicated, the human visual system analyzes the textured images by decomposing the image into its frequency and orientation components [19]. The multiple channels tuned to different frequencies are also referred
234
M. Tuceryun i 3 A . K. Juan
to as multi-resolution processing in the literature. The concept of multi-resolution processing is further refined and developed in the wavelet model described below. Along the lines of these psychophysical results, texture analysis systems have been developed that perform filtering in the Fourier domain to obtain feature images. The idea is similar to the features computed from the rings and wedges as described in Section 3.1.2, except that the phase information is kept. Coggins and Jain [79] used a set of frequency and orientation selective filters in multichannel filtering approach. Each filter is either frequency selective or orientation selective. There are four orientation filters centered at 0", 45", go", and 135". The number of frequency selective filters depends on the image size. For an image of size 128 x 128 six filters with center frequencies at 1, 2, 4, 8, 16, 32, and 64 cycles/image were used. They were able to successfully segment and classify a variety of natural images as well as synthetic texture pairs described by Julesz with identical second-order statistics (see Fig. 4). 3.4.3. Gabor and wavelet models The Fourier transform is an analysis of the global frequency content in the signal. Many applications require the analysis to be localized in the spatial domain. This is usually handled by introducing spatial dependency into the Fourier analysis. The classical way of doing this is through what is called the window Fourier transform. The window Fourier transform (or short-time Fourier transform) of a one-dimensional signal f ( s ) is defined as:
1, 00
~,(u, =
f ( z ) w ( z - t)e-j2nuzds
.
(3.19)
When the window function w ( s ) is Gaussian, the transform becomes a Gabor transform. The limits on the resolution in the time and frequency domain of the window Fourier transform are determined by the tame-bandwidth product or the Heisenberg uncertainty inequality given by:
AtAu 2
1 47r
-
.
(3.20)
Once a window is chosen for the window Fourier t,ransform, the time-frequency resolution is fixed over the entire time-frequency plane. To overcome the resolution limitation of the window Fourier transform, one lets the At and Au vary in the time-frequency domain. Intuitively, the time resolution must increase as the central frequency of the analyzing filter is increased. That is, the relative bandwidth is kept constant in a logarithmic scale. This is accomplished by using a window whose width changes as the frequency changes. Recall that when a function f ( t ) is scaled in time by a , which is expressed as f ( a t ) , the function is contracted if a > 1 and it is expanded when a < 1. Using this fact, the wavelet transform can be written as: (3.21)
2.1 Texture Analysis 235 Here, the impulse response of the filter bank is defined to be scaled versions of the same prototype function h(t). Now, setting in Eq. (3.21)
h ( t ) = w(t)e--j2"Ut
(3.22)
we obtain the wavelet model for texture analysis. Usually the scaling factor will be based on the frequency of the filter. Daugman [80] proposed the use of Gabor filters in the modeling of the receptive fields of simple cells in the visual cortex of some mammals. The proposal to use the Gabor filters in texture analysis was made by Turner [81]and Clark and Bovik [82]. Later Farrokhnia and Jain used it successfully in segmentation and classification of textured images [40,83]. Gabor filters have some desirable optimality properties. Daugman [84] showed that for two-dimensional Gabor functions, the uncertainty relations AxAu 2 n/4 and AyAu 2 n/4 attain the minimum value. Here Ax and Ay are effective widths in the spatial domain and Au and Av are effective bandwidths in the frequency domain. A two-dimensional Gabor function consists of a sinusoidal plane wave of a certain frequency and orientation modulated by a Gaussian envelope. It is given by: (3.23) where uo and 4 are the frequency and phase of the sinusoidal wave. The values LT, and gY are the sizes of the Gaussian envelope in the x and y directions, respectively. The Gabor function at an arbitrary orientation 8 0 can be obtained from Eq. (3.23) by a rigid rotation of the x-y plane by 00. The Gabor filter is a frequency and orientation selective filter. This can be seen from the Fourier domain analysis of the function. When the phase 13 is 0, the Fourier transform of the resulting even-symmetric Gabor function f ( z , y) is given by
(3.24) ~ . function is realwhere (T, = 1/(27ra,),a, = 1/(27rgY), and A = 2 7 r ( ~ , ( ~This valued and has two lobes in the spatial frequency domain, one centered around uo and another centered around -UO. For a Gabor filter of a particular orientation, the lobes in the frequency domain are also appropriately rotated. Jain and Farrokhnia [40]used a version of the Gabor transform in which window sizes for computing the Gabor filters are selected according to the central frequencies of the filters. The texture features were obtained as follows: (a) Use a bank of Gabor filters at multiple scales and orientations to obtain filtered images. Let the filtered image for the i t h filter be ri(x, y). (b) Pass each filtered image through a sigmoidal nonlinearity. This nonlinearity $(t) has the form of tanh(at). The choice of the value of (Y is determined empirically.
236
M . Tuceryan & A . K. J a i n
(c) The texture feature for each pixel is computed as the absolute average deviation of the transformed values of the filtered images from the mean within a window W of size M x M . The filtered images have zero mean, therefore, the ith texture feature image ei(lc, y) is given by the equation: (3.25) The window size M is also determined automatically based on the central frequency of the filter. An example texture image and some intermediate results are shown in Fig. 17. Texture features using Gabor filters were used in texture segmentation and texture classification tasks successfully. An example of the resulting segmentation is shown in Fig. 18. Further details of the segmentation algorithm are explained in Section 4.1. 4. Texture Analysis Problems
The various methods for modelling textures and extracting texture features can be applied in four broad categories of problems: texture segmentation, texture classification, texture synthesis, and shape from texture. We now review these four areas.
4.1. Tezture Segmentation Texture segmentation is a difficult problem because one usually does not know a priori what types of textures exist in an image, how many different textures there are, and what regions in the image have which textures. In fact, one does not need t o know which specific textures exist in the image in order to do texture segmentation. All that is needed is a way t o tell that two textures (usually in adjacent regions of the images) are different. The two general approaches to performing texture segmentation are analogous to methods for image segmentation: region-based approaches or boundary-based approaches. In a region-based approach, one tries to identify regions of the image which have a uniform texture. Pixels or small local regions are merged based on the similarity of some texture property. The regions having different textures are then considered to be segmented regions. This method has the advantage that the boundaries of regions are always closed and therefore, the regions with different textures are always well separated. It has the disadvantage, however, that in many region-based segmentation methods, one has to specify the number of distinct textures present in the image in advance. In addition, thresholds on similarity values are needed. The boundary-based approaches are based upon the detection of differences in texture in adjacent regions. Thus boundaries are detected where there are differences in texture. In this method, one does not need to know the number of textured regions in the image in advance. However, the boundaries may have
238
M . Tuceryan fd A . K. Jain
Fig. 18. The rc:sults of integrating region-based and boundary-based processing usirig the miultiscale Gabor filtering method. (a) Original image consisting of five natural textures. (b) Seven category region-based segmentation results. (c) Edge-based processing and texture edges detected. (d) New segmentation after combining region-based and edge-based results.
Boundary-based segmentation of textured images have been used by Tuceryan and Jain [49], Voorhees and Poggio [ 5 6 ] ,and Eom and Kashyap [85]. In all cases, the edges (or texture boundaries) are detected by taking two adjacent windows and deciding whether the textures in the two windows belong t o the same texture or to different textures. If it is decided that the two textures are different, the point is marked as a boundary pixel. Du Buf and Kardan [86] studied and compared the performance of various texture segmentation techniques and their ability to localize the boundaries. Tuceryan and Jain [49] use the texture features computed from the Voronoi polygons in order to compare the textures in the two windows. The comparison is done using a Kolmogorov-Smirnoff test. A probabilistic relaxation labeling, which enforces border smoothness, is used to remove isolated edge pixels and fill boundary gaps. Voorhees and Poggio extract blobs and elongated structures from images (they suggest that these correspond to Julesz’s textons). The texture properties are based on blob characteristics such as their sizes, orientations, etc. They then decide whether the two sides of a pixel have the same texture using a statistical test called maximum frequency difference (MFD). The pixels where this statistic is sufficiently large are considered to be boundaries between different textures.
2.1 Texture Analysis 239 Jain and Farrokhnia [40] give an example of integrating a region-based and a boundary-based method to obtain a cleaner and more robust texture segmentation method. They use the texture features computed from the bank of Gabor filters to perform a region-based segmentation. This is accomplished by the following steps: (a) Gabor features are calculated from the input image, yielding several feature images. (b) A cluster analysis is performed in the Gabor feature space on a subset of randomly selected pixels in the input image (this is done in order t o increase computational efficiency. About 6% of the total number of pixels in the image are selected). The number k of clusters is specified for doing the cluster analysis. This is set to a value larger than the true number of clusters and thus the image is oversegmented. (c) Step (b) assigns a cluster label to the pixels (pattern) involved in cluster analysis. These labelled patterns are used as the training set and all the pixels in the image are classified into one of the k clusters. A minimum distance classifier is used. This results in a complete segmentation of the image into uniform textured regions. (d) A connected component analysis is performed to identify each segmented region. (e) A boundary-based segmentation is performed by applying the Canny edge detectlor on each feature image. The magnitude of the Canny edge detector for each feature image is summed up for each pixel to obtain a total edge response. The edges are then detected based on this total magnitude. (f) The edges so detected are then combined with the region-based segmentation results to obtain the final texture segmentation. The integration of the boundary-based and region-based segmentation results improve the resulting segmentation in most cases. For a n example of this improvement see Fig. 18. 4.2. Texture Classification
Texture classification involves deciding what texture category an observed image belongs to. In order to accomplish this, one needs t o have a priori knowledge of the classes to be recognized. Once this knowledge is available and the texture features are extracted, one then uses classical pattern classification techniques in order to do the classification. Examples where texture classification was applied as the appropriate texture processing method include the classification of regions in satellite images into categories of land use [41]. Texture classification was also used in automated paint inspection by Farrokhnia [83]. In the latter application, the categories were ratings of the quality of paints obtained from human experts. These quality rating categories were then used as the training samples for supervised classification of paint images using texture features obtained from multi-channel Gabor filters.
240
M. Tzlceryan & A . K. Jain
4.3. Texture Synthesis Texture synthesis is a problem which is more popular in computer graphics. It is closely tied to some of the methods discussed above, so we give only a brief summary here. Many of the modelling methods are directly applicable to texture synthesis. Markov random field models discussed in Section 3.3.1 can be directly used to generate textures by specifying the parameter vector 8 and sampling from the probability distribution function [62,60]. The synthetic textures in Fig. 2(b) are generated using a Gaussian Markov random field model and the algorithm in [87]. Fractals have become popular recently in computer graphics for generating realistic looking textured images [88]. A number of different methods have been proposed for synthesizing textures using fractal models. These methods include midpoint displacement method and Fourier filtering method. The midpoint displacement method has become very popular because it is a simple and fast algorithm yet it can be used to generate very realistic looking textures. Here we only give the general outline of the algorithm. A much more detailed discussion of the algorithm can be found in [88]. The algorithm starts with a square grid representing the image with the four corners set to 0. It then displaces heights at the midpoints of the four sides and the center point of the square region by random amounts and repeats the process recursively. The iteration n 1 uses the grid consisting of the midpoints of the squares in the grid for iteration n. The height a t the midpoint is first interpolated between the endpoints and a random value is added t o this value. The amount added is chosen from a normal distribution with zero mean and variance 02 at iteration n. In order t o keep the self-similar nature of the surface, the variance is changed as a function of the iteration number. The variance at iteration n is given by 2 on = r 2 n H where r = 1/h . (4.1)
+
This results in a fractal surface with fractal dimension (3 - H ) . The heights of the fractal surface can be mapped onto intensity values to generate the textured images. The example image in Fig. 2(c) was generated using this method. Other methods include mosaic models [89,90].This class of models can in turn be divided into subclasses of cell structure models and coverage models. In cell structure models the textures are generated by tessellating the plane into cells (bounded polygons) and assigning each cell gray levels according to a set of probabilities. The type of tessellation determines what type of textures are generated. The possible tessellations include triangular pattern, checkerboard patterns, Poisson line model, Delaunay model, and occupancy model. In coverage models, the texture is obtained by a random arrangement of a set of geometric figures in the plane. The coverage models are also referred t o as bombing models. 4.4. Shape from Texture
There are many cues in images that allow the viewer to make inferences about the three-dimensional shapes of objects and surfaces present in the image. Examples
2.1 Texture Analysis 241 of such cues include the variations of shading on the object surfaces or the relative configurations of boundaries and the types of junctions that allow one t o infer three-dimensional shape from the line drawings of objects. The relation between the variations in texture properties and surface shape was first pointed out by Gibson [lo]. Stevens observed that certain properties of texture are perceptually significant in the extraction of surface geometry [91]. There are three effects that surface geometry has on the appearance of texture in images: foreshortening and scaling of texture elements, and a change in their density. The foreshortening effect is due to the orientation of the surface on which the texture element lies. The scaling and density changes are due to the distance of the texture elements from the viewer. Stevens argued that texture density is not a useful measure for computing distance or orientation information because the density varies both with scaling and foreshortening. He concluded that the more perceptually stable property that allows one to extract surface geometry information is the direction in the image which is not foreshortened, called the characteristic dimension. Stevens suggested that one can compute relative depth information using the reciprocal of the scaling in the characteristic dimension. Using the relative depths, surface orientation can be estimated. Bajcsy and Lieberman [92] used the gradient in texture element sizes to derive surface shape. They assumed a uniform texture element size on the threedimensional surface in the scene. The relative distances are computed based on a gradient function in the image which was estimated from the texture element sizes. The estimation of the relative depth was done without using knowledge about the camera parameters and the original texture element sizes. Witkin [93] used the distribution of edge orientations in the image to estimate the surface orientation. The surface orientation is represented by the slant (a) and tilt ( T ) angles. The Slant is the angle between a normal to the surface and a normal to the image plane. The Tilt is the angle between the surface normal’s projection onto the image plane and a fixed coordinate axis in the image plane. He assumed an isotropic texture (uniform distribution of edge orientations) on the original surface. As a result of the projection process, the textures are foreshortened in the direction of steepest inclination (slant angle). Note that this idea is related to Stevens’ argument because the direction of steepest inclination is perpendicular to the characteristic dimension. Witkin formulated the surface shape recovery by relating the slant and tilt angles t o the distribution of observed edge directions in the image. Let /3 be the original edge orientation (the angle between the tangent and a fixed coordinate axis on the plane S containing the tangent). Let a* be the angle between the x-axis in the image plane and the projected tangent. The a* is related to the slant and tilt angles by the following expression:
-)
tan /3 o* = atan( cos a
+7
242
M. Tuceryan & A . K. Jain
Here a* is an observable quantity in the image and ( a , ~are ) the quantities to be computed. Witkin derived the expression for the conditional probabilities for the slant and tilt angles given the measured edge directions in the image and then . A* = used a maximum likelihood estimation method to compute the ( a , ~ )Let { a ; ,. . . , a:} be a set of observed edge directions in the image. Then the conditional probabilities are given as: (4.3)
sin a where P(a,T ) = -. The maximum likelihood estimate of P ( a ,T ~ A *gives ) the 76 desired surface orientation. Blostein and Ahuja [57] used the scaling effect to extract surface information. They integrated the process of texture element extraction with the surface geometry computation. Texture element extraction is performed at multiple scales and the subset that yields a good surface fit is selected. The surfaces are assumed planar for simplicity. Texture elements are defined to be circular regions of uniform intensity which are extracted by filtering the image with V2G and &(V2G) operators and comparing the filter responses to those of an ideal disk (here a is the size of the Gaussian G). At, the extremum points of the image filtered by V2G, the diameter ( D ) and contrast (C) of the best fitting disks are computed. The convolution is done at multiple scales. Only those disks whose computed diameters are close to the size of the Gaussian are retained. As a result, blob-like texture elements of different sizes are detected. The geometry of the projection is shown in Fig. 19. Let a and T be the slant and tilt of the surface. The image of a texture element has the foreshortened dimension Fi and the characteristic dimension U,. The area A, of the image texel is proportional to the product FiU, for compact shapes. The expression for the area A, of the image of a texture element is given by:
Ai
= A c ( l - tanBtana)3
(4.4)
where Ac is the area that would be measured for the texel at the center of the image. The angle B is given by the expression B = atan((xcosr
+ ysinT)(r/f)) .
(4.5)
Here, r is the physical width of the image, r / f is a measure of the field of view of the camera, and (2,y) denotes pixel coordinates in the image. Ai can be measured in the image. To find the surface orientation, an accumulator array consisting of the parameters ( A c ,g,T ) is constructed. For each combination of parameter values, a possible planar fit is computed. The plane with the highest fit rating is selected as the surface orientation, and texture elements that support this fit are selected as
2.1 Texture Analysis 243
"m" 4
imane of texture element f length F,
The slant of the plane CT
I
\
Texture element of length Fp on the surface.
Fig. 19. The projective distortion of a texture element in the image.
Fig. 20. Examples of shape from texture computation using Blostein and Ahuja's algorithm [57]. (a) An image of a field of rocks and the computed slant and tilt of the plane. (6) An image of a sunflower field and the extracted slant and tilt values.
the true texture elements. Some example images and the computed slant and tilt values are shown in Fig. 20. 5 . Summary
This chapter has reviewed the basic concepts and various methods and techniques for processing textured images. Texture is a prevalent property of most
244
M. Tuceryan & A . K. Jain
physical surfaces in the natural world. It also arises in many applications such as satellite imagery and printed documents. Many common low level vision algorithms such as edge detection break down when applied to images that contain textured surfaces. It is therefore crucial that we have robust and efficient methods for processing textured images. Texture processing has been successfully applied to practical application domains such as automated inspection and satellite imagery. It is also going t o play an important role in the future as we can see from the promising application of texture to a variety of different application domains. Acknowledgment
The support of the National Science Foundation through grants IRI-8705256 and CDA-8806599 is gratefully acknowledged. We thank the Norwegian Computing center for providing the SAR images shown in Fig. 5. We also thank our colleagues Dr. Richard C. Dubes and Dr. Patrick J. Flynn for the invaluable comments and feedback they provided during the preparation of this document. References [l] J. M. Coggins, A Framework for Texture Analysis Based on Spatial Filtering, Ph.D. Thesis, Computer Science Department, Michigan State University, East Lansing, MI, 1982. [2] H. Tamura, S. Mori and Y. Yamawaki, Textural features corresponding to visual perception, IEEE Trans. Syst. Man Cybern., (1978) 46Ck473. [3] J. Sklansky, Image segmentation and feature extraction, IEEE Trans. Syst. Man Cybern. (1978) 237-247. [4] R. M. Haralick, Statistical and structural approaches to texture, Proc. IEEE 67 (1979) 786-804. [5] W. Richards and A. Polit, Texture matching, Kybernetic 16 (1974) 155-162. [6] S. W. Zucker and K. Kant, Multiple-level representations for texture discrimination, in Proc. IEEE Conf. on Pattern Recognition and Image Processing, Dallas, TX, 1981, 609-614. [7] J. K. Hawkins, Textural properties for pattern recognition, in B. Lipkin and A. Rosenfeld (eds.), Picture Processing and Psychopictorics (Academic Press, New York, 1969). [8] P. Brodatz, Textures: A Photographic Album for Artists and Designers (Dover Publications, New York 1966). [9] C. C. Chen, Markov Random Fields in Image Analysis, Ph.D. Thesis, Computer Science Department, Michigan State University, East Lansing, MI, 1988. [lo] J. J. Gibson, The Perception of the Visual World (Houghton Mifflin, Boston, MA, 1950) . [ll] B. Julesz, E. N. Gilbert, L. A. Shepp and H. L. Frisch, Inability of humans to discriminate between visual textures that agree in second-order statistics -revisited, Perception 2 (1973) 391-405. 1121 B. Julesz, Visual pattern discrimination, IRE Trans. Inf. Theory 8 (1962) 84-92. [13] B. Julesz, Experiments in the visual perception of texture, Sci. Am. 232 (1975) 34-43. [14] B. Julesz, Nonlinear and cooperative processes in texture perception, in T. P. Werner and E. Reichardt (eds.), Theoretical Approaches in Neurobiology (MIT Press, Cambridge, MA, 1981) 93-108.
2.1 Texture Analysis 245 [15] B. Julesz, Textons, the elements of texture perception, and their interactions, Nature 290 (1981) 91-97. [16] B. Julesz, A theory of preattentive texture discrimination based on first-order statistics of textons, Biol. Cybern. 41 (1981) 131-138. [17] T. Caelli, Visual Perception (Pergamon Press, 1981). [18] J. Beck, A. Sutter and R. Ivry, Spatial frequency channels and perceptual grouping in texture segregation, Comput. Vision Graph. Image Process. 37 (1987) 299-325. [19] F. W. Campbell and J. G. Robson, Application of Fourier analysis to the visibility of gratings, J. Physiol. 197 (1968) 551-566. [20] R. L. Devalois, D. G. Albrecht and L. G. Thorell, Spatial-frequency selectivity of cells in macaque visual cortex, Vision Res. 22 (1982) 545-559. [21] P. Dewaele, P. Van Goo1 and A. Oosterlinck, Texture inspection with self-adaptive convolution filters, in Proc. 9th Int. Conf. on Pattern Recognition, Rome, Italy, Nov. 1988, 56-60. [22] D. Chetverikov, Detecting defects in texture, in Proc. 9th Int. Conf. on Pattern Recognition, Rome, Italy, Nov. 1988, 61-63. [23] J. Chen and A. K. Jain, A structural approach to identify defects in textured images, in Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics, Beijing, 1988, 29-32. [24] R. W. Conners, C. W. McMillin, K. Lin and R. E. Vasquez-Espinosa, Identifying and locating surface defects in wood: Part of an automated lumber processing system, I E E E Trans. Pattern Anal. Mach. Intell. 5 (1983) 573-583. [25] L. H. Siew, R. M. Hodgson and E. J. Wood, Texture measures for carpet wear assessment, IEEE Trans. Pattern Anal. Mach. Intell. 10 (1988) 92-105. [26] A. K. Jain, F. Farrokhnia and D. H. Alman, Texture analysis of automotive finishes, in Proc. of S M E Machine Vision Applications Conf., Detroit, MI, Nov. 1990, 1-16. [27] R. Sutton and E. L. Hall, Texture measures for automatic classification of pulmonary disease, IEEE Trans. Comput. 21 (1972) 667-676. (281 H. Harms, U. Gunzer and H. M. Aus, Combined local color and texture analysis of stained cells, Comput. Vision Graph. Image Process. 33 (1986) 364-376. [29] G. H. Landeweerd and E. S. Gelsema, The use of nuclear texture parameters in the automatic analysis of leukocytes, Pattern Recogn. 10 (1978) 57-61. [30] ] M. F. Insana, R. F. Wagner, B. S. Garra, D. G. Brown and T. H. Shawker, Analysis of ultrasound image texture via generalized Rician statistics, Opt. Engin. 25 (1986) 743-748. [31] C. C. Chen, J. S. Daponte and M. D. Fox, Fractal feature analysis and classification in medical imaging, I E E E Trans. Medical Imaging 8 (1989) 133-142. [32] A. Lundervold, Ultrasonic tissue characterization-A pattern recognition approach, Technical Report, Norwegian Computing Center, Oslo, Norway, 1992. [33] D. Wang and S. N. Srihari, Classification of newspaper image blocks using texture analysis, Comput. Vision Graph. Image Process. 47 (1989) 327-352. (341 F. M. Wahl, K. Y. Wong and R. G. Casey, Block segmentation and text extraction in mixed text/image documents, Comput. Graph. Image Process. 20 (1982) 375-390. [35] J. A. Fletcher and R. Kasturi, A robust algorithm for text string separation from mixed text/graphics images, IEEE Trans. Pattern Anal. Mach. Intell. 10 (1988) 910918. [36] T. T a t , P. J. Flynn and A. K. Jain, Segmentation of document images, I E E E Trans. Pattern Anal. Mach. Intell. 11 (1989) 1322-1329. [37] A. K. Jain and S. K. Bhattacharjee, Text segmentation using Gabor filters for automatic document processing, Mach. Vision and Appl. 5 (1992) 169-184.
246
M. Tuceryan & A . K. Jain
[38] A. K. Jain and S. K. Bhattacharjee, Address block location on envelopes using Gabor filters, in PTOC.11th Int. Conf. o n Pattern Recognition, The Hague, Netherlands, Aug. 1992, Vol. B, 264-267. 1391 A. K. Jain, S. K. Bhattacharjee and Y. Chen, On texture in document images, in PTOC.IEEE Conf. o n Computer Vision and Pattern Recognition, Champaign, IL, Jun. 1992, 677-680. 1401 A. K. Jain and F. Farrokhnia, Unsupervised texture segmentation using Gabor filters, Pattern Recogn. 24 (1991) 1167-1186. [41] R. M. Haralick, K. Shanmugam and I. Dinstein, Textural features for image classification, I E E E Trans. Syst. M a n Cybern. 3 (1973) 61G621. [42] E. Rignot and R. Kwok, Extraction of textural features in SAR images: Statistical model and sensitivity, in Proc. Int. Geoscience and Remote Sensing Symp., Washington, DC, 1990, 1979-1982. [43] A. H. Schistad and A. K. Jain, Texture analysis in the presence of speckle noise, in Proc. IEEE Geoscience and Remote Sensing Symp. Houston, TX, May 1992, 147-152. [44] L. J. Du, Texture segmentation of SAR images using localized spatial filtering, in Proc. Int. Geoscience and Remote Sensing Symp., Washington, DC, 1990, 1983-1986. [45] J . H. Lee and W. D. Philpot, A spectral-textural classifier for digital imagery, in PTOC. Int. Geoscience and Remote Sensing Symp., Washington, DC, 1990, 2005-2008. [46] F. Tomita and S. Tsuji, Computer Analysis of Visual Textures (Kluwer Academic Publishers, Boston, 1990). (471 K. I. Laws, Textured Image Segmentation, Ph.D. thesis, University of Southern California, 1980. [48] R. Picard, I. M. Elfadel and A. P. Pentland, Markov/Gibbs texture modeling: Aura matrices and temperature effects, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Maui, Hawaii, 1991, 371-377. [49] M. Tuceryan and A. K. Jain, Texture segmentation using Voronoi polygons, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 211-216. [50] N. Ahuja, Dot pattern processing using Voronoi neighborhoods, I E E E Trans. Pattern Anal. Mach. Intell. 4 (1982) 336-343. [51] G. Voronoi, Nouvelles applications des p a r a m h e s continus 5 la thkorie des formes quadratiques. Deuxihme m6moire: Recherches sur les parallkllo2dres primitifs, J . Reine Angezu. Math. 134 (1908) 198-287. [52] M. I. Shamos and D. Hoey, Closest-point problems, in 16th Annual Symposium on Foundations of Computer Science, 1975, 131-162. 1531 F. P. Preparata and M. I. Shamos, Computational Geometry (Springer-Verlag, New York, 1985). [54] D. Marr, Vision (Freeman, San Francisco, 1982). [55] M. K. Hu, Visual pattern recognition by moment invariants, I R E Trans. Inf. Theory 8 (1962) 179-187. [561 H. Voorhees and T. Poggio, Detecting textons and texture boundaries in natural images, in Proc. First Int. c o n f. on Computer Vision, London, 1987 25G-258. [57] D. Blostein and N. Ahuja, Shape from texture: Integrating texture-element extraction and surface estimation, IE E E Trans. Pattern Anal. Mach. Intell. 11 (1989) 1233-1251. [58] S. W. Zucker, Toward a model of texture, Comput. Graph. Image Process. 5 (1976) 190-202. [59] K. S. Fu, Syntactic Pattern Recognition and Applications (Prentice-Hall, New Jersey, 1982).
2.1 Texture Analysis 247 [SO] G. C. Cross and A. K. Jain, Markov random field texture models, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 25-39. [61] R. Chellappa and S. Chatterjee, Classification of Textures using Gaussian Markov random fields, IEEE Trans. Acoust. Speech Signal Process. 33 (1985) 959-963. [62] A. Khotanzad and R. Kashyap, Feature Selection for texture recognition based on image synthesis, IEEE Trans. Syst. Man Cybern. 17 (1987) 1087-1095. [63] F. S. Cohen and D. B. Cooper, Simple parallel hierarchical and relaxation algorithms for segmenting noncausal Markovian random fields, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 195-219. [64] C. W. Therrien, An estimation-theoretic approach to terrain image segmentation, Comput. Vision Graph. Image Process. 22 (1983) 313-326. [65] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721-741. [66] J. Besag, Spatial interaction and the statistical analysis of lattice systems, J. Roy. Stat. SOC.B36 (1974) 344-348. [67] H. Derin and H. Elliott, Modeling and segmentation of noisy and textured images using Gihbs random fields, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 39-55. [68] B. B. Mandelbrot, The Fractal Geometry of Nature (Freeman, San Francisco, 1983). [69] A. Pentland, Fractal-based description of natural scenes, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1984) 661-674. [70] J. M. Keller, S. Chen and R. M. Crownover, Texture description and segmentation through fractal geometry, Comput. Vision Graph. Image Process. 45 (1989) 15G166. [71] R. Voss, Random fractals: Characterization and measurement, in R. Pynn and A. Skjeltorp (eds.), Scaling Phenomena in Disordered Systems (Plenum, New York, 1986). [72] B. J. Super and A. C. Bovik, Localized measurement of image fractal dimension using Gabor filters, J. Visual Commun. Image Represent. 2 (1991) 114-128. [73] P. P. Ohanian and R. C. Dubes, Performance evaluation for four classes of textural features, Pattern Recogn, Vol. 25, no. 8 (1992) pp. 819-833. [74] M. A. Georgeson, Spatial Fourier analysis and human vision, Chapter 2, in N. S. Sutherland (ed.), Tutorial Essays in Psychology, A Guide to Recent Advances, vol. 2 (Lawrence Erlbaum Associates, Hillsdale, NJ, 1979). [75] J. Malik and P. Perona, Preattentive texture discrimination with early vision mechanisms, J . Opt. SOC.Am. Series A 7 (1990) 923-932. [76] M. Unser and M. Eden, Nonlinear operators for improving texture segmentation based on features extracted by spatial filtering, IEEE Trans. Syst. M a n Cybern. 20 (1990) 804-815. [77] T. R. Reed and H. Wechsler, Segmentation of textured images and Gestalt organization using spatiallspatial-frequency representations, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 1-12 [78] M. Tuceryan, Moment based texture segmentation, in Proc. 11th Int. Conf. o n Pattern Recognition, The Hague, Netherlands, Aug. 1992, Vol. 111, 45-48. [79] J. M. Coggins and A. K. Jain, A spatial filtering approach to texture analysis, Pattern Recogn. Lett. 3 (1985) 195-203. [80] J. G. Daugman, Two-dimensional spectral analysis of cortical receptive field profiles, Vision Res. 20 (1980) 847-856. [81] M. R. Turner, Texture discrimination by Gabor functions, Biol. Cybern. 55 (1986) 71-82. 1821 M. Clark and A. C. Bovik, Texture segmentation using Gabor modulation/ demodulation, Pattern Recogn. Lett. 6 (1987) 261-267.
248
M. Tuceryan & A . K. Jain
[83] F. Farrokhnia, Multi-channel Filtering Techniques for Texture Segmentation and Surface Quality Inspection, Ph.D. thesis, Computer Science Department, Michigan State University, 1990. [84] J. G. Daugman, Uncertainty relation for resolution in space, spatial-frequency, and orientation optimized by two-dimensional visual cortical filters, J. Opt. SOC.A m . 2 (1985) 1160-1169. [85] Kie-Bum Eom and R. L. Kashyap, Texture and intensity edge detection with random field models, in Proc. Workshop on Computer Vision, Miami Beach, FL, 1987, 29-34. [86] J. M. Du Buf, H. M. Kardan and M. Spann, Texture feature performance for image segmentation, Pattern Recogn. 23 (1990) 291-309. [87] R. Chellappa, S. Chatterjee and R. Bagdazian, Texture synthesis and compression using Gaussian-Markov random field models, I E E E Trans. Syst. Man Cybern. 15 (1985) 298-303. [88] H. 0. Peitgen and D. Saupe, The Science of Fractal Images (Springer-Verlag, New York, 1988). [89] N. Ahuja and A. Rosenfeld, Mosaic models for textures, I E E E Trans. Pattern Anal. Mach. Intell. 3 (1981) 1-11. [go] N. Ahuja, Texture, in Encyclopedia of Artificial Intelligence (Wiley, 1987) 1101-1115. [91] K. A. Stevens, Surface perception from local analysis of texture and contour, MIT Technical Report, Artificial Intelligence Laboratory, no. AI-TR 512, 1980. [92] R. Bajcsy and L. Lieberman, Texture gradient as a depth cue, Comput. Graph. Image Process. 5 (1976) 52-67. [93] A. P. Witkin, Recovering surface shape and orientation from texture, Artif. Intell. 17 (1981) 17-45.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 249-282 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 2.2 1 MODEL-BASED TEXTURE SEGMENTATION AND CLASSIFICATION
R. CHELLAPPA Department of Electrical Engineering, Center for Automation Research Institute for Advanced Computer Studies, University of Maryland College Park, M D 20742, U S A R. L. KASHYAP School of Electrical Engineering, Purdue IJniversity W. Lafayette, IN 47907, U S A and
B. S . MANJUNATH Department of Electrical and Computer Engineering University of California, Santa Barbara, C A 93106, USA Over the last ten years, several model based methods have been proposed for segmentation and classification of textured images. Models based on random field representations and psychophysical/neurophysiological studies have been dominant. In this chapter, we present examples drawn from both approaches. Related issues on implementation of the various optimal/suboptimal algorithms are also addressed.
Keywords: Texture segmentation, texture classification, artificial neural networks, Markov random fields, fractional differencing model, preattentive segmentation.
1. Introduction Automatic segmentation and classification of textured images has several applications in landsat terrain classification [l],bio-medical applications [2] and aerial image understanding [3]. Previous approaches to segmentation have been based on correlation [4],Fourier transform features [5], Laws features and their extensions [6,7],fractal models [8], and features from co-occurrence matrix [9]. Recently, more emphasis has been given to methods using random field models such as the 2-D non-symmetric half plane models [lo] and non-causal Gauss Markov random field models and their variations [ll-181. Both supervised and unsupervised methods have been developed. Although significant progress has been made using these methods, several problems remain as the methods are sensitive to illumination and resolution changes and transformations such as rotation. Also, these methods do not explain the role of preattentive segmentation as applied to textures. Preattentive
249
250
R. Chellappa, R. L. Kashyap & B. S. Manjunath
segmen tation refers to the ability of humans to perceive textures without any sustained attention. Central to solving this problem are the issues of what features need to be computed and what kind of processing of these features is required for texture discrimination. Some of the early work in this field can be attributed to Julesz [19] for his theory of textons as basic textural elements. The spatial filtering approach has been used by many researchers for detecting texture boundaries not clearly explained by the texton theory [20]. Recently an elegant computational model for preattentive texture discrimination has been proposed by Malik and Perona [21]. Grossberg and Mingolla’s Boundary Contour System (BCS) [22] is one of the first attempts to model the early processing stages in the visual cortex. Texture classification refers to the problem of identifying the particular class label of the input texture and can operate on the output of the segmentation algorithm. Thus, standard pattern classification techniques may be applied by assuming that there is only one texture in the image, the image being constructed from a single segmented region. Features for texture classification have been derived from a variety of approaches such as co-occurrence matrices 11,231, textural features [5,24,25], runlength statistics [5], difference statistics [5],decorrelation methods [26], Fourier power spectrum [5], structural features [9,27], region based random fields [28-301, parametric Gaussian random field models [31-381, fractals and fractional models [36,39,40], etc. A major advantage of the features based on parametric Gaussian non-causal random field models is that they are information preserving in the sense that the features in conjunction with the discrete random field model can be used to synthesize an image which closely resembles the original. The chief disadvantage of the above model and all other related classification methods is that they are not rotation invariant, i.e. if we train the classifier with a set of texture data and test the classifier with a rotated version of the same image, the correct classification rate goes down. We give an approach for achieving rotational invariance in Section 4. We illustrate the different approaches to texture segmentation and classification mentioned above using several deterministic and stochastic algorithms. The first method we describe in Section 2 stems from the idea of using Markov random field (MRF) models for texture in an image. We assign two random variables for the observed pixel, one characterizing the underlying intensity and the other for labeling the texture corresponding to the pixel location. We use the Gauss Markov Random Field (GMRF) model for the conditional density of the intensity field given the label field. Prior information about the texture label field is introduced using a discrete Markov distribution. The segmentation can then be formulated as a n optimization problem involving minimization of a Gibbs energy function. Exhaustive search for the optimum solution is not possible because of the large dimensionality of the search space. For example, even for the very simple case of segmenting a 128 x 128 image into two classes, there are 2214possible label configurations. Derin and Elliott [13] have investigated the use of dynamic programming for obtaining the Maximum a posteriori (MAP) estimate while Cohen and Cooper [11] give a deterministic relaxation algorithm for the same problem. The optimal MAP
&. 2 Model-Based Texture Segmentation and Classification 251
solution can be obtained by using stochastic relaxation algorithms like simulated annealing [41]. However, the computational burden involved because of the theoretical requirements on the initial temperature and the impractical cooling schedules overweigh their advantages in many cases. Recently there has been considerable interest in using neural networks for solving computationally hard problems. Fast approximate solutions can be obtained by using a deterministic relaxation algorithm like the iterated conditional mode rule [42]. The energy function corresponding to this optimality criterion can be mapped into a Hopfield type network in a straightforward manner and it can be shown that the network converges to a n equilibrium state, which in general will be a local optimum. The solutions obtained using this method are sensitive to the initial configuration and in many cases starting with a maximum likelihood estimate is preferred. The second optimality criterion we discuss minimizes the expected percentage of classification error per pixel. This is equivalent t o finding the pixel labels that maximizes the marginal posterior probability given the intensity data [43]. Since calculating the marginal posterior probability is very difficult, Marroquin [44] suggested the Maximum Posterior Marginal (MPM) algorithm (see Section 4) that asymptotically computes the posterior marginal. Here we use this method t o find the texture label that maximizes the marginal posterior probability for each pixel. In Section 3 we discuss a simple biologically motivated approach to detect texture boundaries within a more general context of boundary detection [45]. The input image is first processed through a bank of orientation selective bandpass filters at various spatial frequencies. The convolution of the image with these filters yields a representation which is localized in space as well as in frequency. A special class of this decomposition is the wavelet transformation where the filter profiles are all self-similar. Wavelets are families of basis functions obtained through dilations and translations of a basic wavelet and such a decomposition provides a compact data structure for representing information. Following the wavelet decomposition we introduce local feature interactions. Three distinct types of interactions are considered: competition between spatial neighbors in each orientation channel, competition between orientations at each spatial location, and interscale interactions. Interscale interactions are used in localizing line ends and play an important role in boundary detection. The second stage of interactions groups similar features in the neighborhood. This cooperative processing helps in the boundary completion process. The receptive fields of the cells in this stage have the same orientation selectivity as their inputs and have a larger receptive field, and the filter profiles are modeled by oriented Gaussians. In Section 4, we discuss direct patt,ern classification strategies for classifying textures, assuming that there is only one texture in the image or in the segment of the image. The strategy is to fit varieties of parametric random field models, extract features from them, and use these features for classification using both standard algorithms and new procedures. In Section 5, a multi-level classification
252
R. Chellappa, R. L. Kashyap tY B. S. M a n j u n a t h
method based on fractional differencing models with a fractal scaling parameter is presented. This algorithm can handle arbitrary 3-D rotated textures. Since the fractal scale is known to be a rotational and scaling invariant parameter, the accuracy of classification from the procedure will not be affected by 3-D rotation of the test texture. In the first level of classification, the textures are classified by the first-order Fkactional Differencing model with a fractal scale parameter, and in the second level, classification is completed with the additional frequency parameters of the second-order Fractional Differencing periodic model. 2. Texture Segmentation via Optimization and Artificial Neural Networks The inherent parallelism of neural networks provides an interesting architecture for implementing many computer vision algorithms [46]. Some examples are image restoration [47], stereopsis [48] and computing optical flow [49-511. Networks for solving combinatorially hard problems like the Traveling Salesman problem have received much attention in the neural network literature [52]. In all these cases the networksare designed to minimize an energy function defined by the network architecture. The parameters of the network are obtained in terms of the cost function which is to be minimized and it can be shown that [52] for networks having symmetric interconnections, the equilibrium states correspond t o the local minima of the energy function. For practical purposes, networks with few interconnections are preferred because of the large number of processing units required in any image processing application. In this context MRF models for images play a useful role. They are typically characterized by local dependencies and symmetric interconnections which can be expressed in terms of energy functions using Gibbs-Markov equivalence. The artificial neural net (ANN) approach suggested here stems from the idea of using MRF models for textures in an image. 2.1. Markov Random Fields and the Image Model In modeling images consisting of more than one texture we have to consider two random processes, one for the texture intensity distribution and the second for the label distribution. Various models have been proposed in the literature for textured images. In this section we discuss one such model based on Markov random fields. In most image processing applications the input image is a rectangular array of pixels taking values in the range 0-255. Let R denote such a set of grid points on an M x M lattice, i.e. R = { ( i , j ) , 1 5 i , j L: M } . Let { Y s , s E R } be a random process defined on this grid.
Definition. The process {Ys}is said to be strictly Markov if P(YJal1 Yr,r # s) = P(Y,IY,,r is a neighbor of s)
(2.1)
The neighborhood set of site s can be arbitrarily defined. However in many image processing applications it is natural to consider neighbors which are also spatial
2.2 Model-Based Texture Segmentation and Classification 253 neighbors of the site. The Markov process can further be classified as causal or non-causal depending on the relationship of these neighbors with respect t o the site. The use of MRF in image processing applications has a long history (see for e.g. [53]) and MRF have been used inapplications such as image restoration, segmentation, etc. Cross and Jain [54] provide a detailed discussion on the application of MRF in modeling textured images. In the following we use {L,, s E a } t o denote the label s E } for the zero mean intensity process. process and {Ys,
Intensity Process. We model the intensity process { Y s } by a Gaussian Markov random field (GMRF). Depending on the neighborhood set one can construct a hierarchy of GMRF models as shown in Fig. 1. The numbers indicate the order of the GMRF model relative t o the center location x. Note that this defines a symmetric neighborhood set. We have used the fourth order model for the intensity process.
Fig. 1. Structure of the GMRF model. The numbers indicate the order of the model relative t o x [541.
Let N, denote the symmetric fourth order neighborhood of a site s. Let N* be the set of one-sided shift vectors corresponding to the fourth order neighborhood, i.e. N * is the set of shift vectors corresponding to a fourth order neighborhood system,
and
N, where s
+
7-
= (7- : 7- = S + T ,
7-
E N*}
is defined as s = ( i , j ) , 7- = (x,y), s +
7-
= (i
+ x , j + y).
254
R. Chellappa, R. L. Kashyap & B. S. Manjunath
Assuming that all the neighbors of s also have the same label as that of s, the conditional density of the intensity at the pixel s is:
Equation (2.3) is a Gibbs distribution function, V ( . )is often referred to as a Gibbs measure and Z(Z1yr,r E N,) is called the partition function. In (2.4), cq and 0' are the GMRF model parameters of the I-th texture class. A stationary GMRF model implies that the parameters satisfy O;,, = @-;, = @:.-T = 0;. There are several ways of estimating the GMRF parameters and a comparison of different schemes can be found in [53]. We have used the least squares method in our experiments. We view the image intensity array as composed of a set of overlapping Ic x Ic windows W,, centered at each pixel s E 0. In each of these windows we assume that the texture label L , is homogeneous (all the pixels in the window belong to the same texture) and model the intensity distribution in the window by a fourth order stationary GMRF. Let Yz denote the 2-D vector representing the zero mean intensity array in the window W,. Using the Gibbs formulation and assuming a free boundary model, the joint probability density in the window W, can be written as:
where Z l ( l ) is the partition function and
Label Process. The texture labels are assumed t o obey a first or second order discrete Markov model with a single parameter p, which measures the amount of clustering between adjacent pixels. If N , denotes the appropriate neighborhood for the label field, then we can write the distribution function for the texture label at site s conditioned on the labels of the neighboring sites as: P(L,(L, , where
22
T
E
Ns)=
is a normalizing constant and
e-u2(Ls 2 2
I L-1
2.2 Model-Based Texture Segmentation and Classification 255 In (2.6), p determines the degree of clustering, and S ( i Using the Bayes rule, we can write
-
j ) is the Kronecker delta.
Since YB is known, the denominator in (2.7) is just a constant. The numerator is a product of two exponential functions and can be expressed as (2.8) where Zp is the partition function and Up(.) is the posterior energy corresponding to (2.7). From (2.5) and (2.6) we write
Up(Ls I Y : , L,, r E
is) = w(L,) + U l ( Y , *1 L,) + U2(L, 1 L,,
rE
fi,) .
(2.9)
Note that the second term in (2.9) relates the observed pixel intensities to the texture labels and the last term specifies the label distribution. The bias term w(L,) = log Z1(L,) is dependent on the texture class and it can be explicitly evaluated for the GMRF model considered here using the toroidal assumption (the computations become very cumbersome if toroidal assumptions are not made). An alternate approach is to estimate the bias from the histogram of the data as suggested by Geman and Graffigne [15]. Finally, the posterior distribution of the texture labels for the entire image given the intensity array is (2.10) Maximizing (2.10) gives the optimal Bayesian estimate. Though it is possible in principle to compute the right-hand side of (2.10) and find the global optimum, the computational burden involved is so enormous that it is practically impossible to do so. However we note that the stochastic relaxation algorithms discussed in Section 2.3 require only the computation of (2.8) to obtain the optimal solution. The deterministic relaxation algorithm given in the next section also uses these values, but in this case the solution is only an approximation to the MAP estimate. 2.2. A Neural Network for Texture Classification We describe the network architecture used for segmentation and the implementation of deterministic relaxation algorithms. The energy function which the network minimizes is obtained from the image model discussed in the previous section. For convenience of notationlet U l ( i , j , I ) = U 1 ( Y i , L s = I ) w(Z) where s = ( i , j ) denotes a pixel site and Ul( . ) and w(I) are as defined in (2.9). The network consists of K layers, each layer arranged as an M x M array, where K is the number of texture classes in the image and M is the dimension of the image. The elements (neurons) in the network are assumed to be binary and are indexed by ( i ,j , I ) where
+
256
R. Chellappa, R. L. Kashyap & B. S. Manjunath
( i , j ) = s refers to their position in the image and 2 refers to the layer. The ( 2 , j , 2)-th neuron is said to be ON if its output Kjl is 1, indicating that the corresponding site s = ( i , j ) in the image has the texture label 1. Let Tijl;i~j~ be p the connection strength between the neurons (i, j , 2) and (i’, j’, 2’) and Iiji be the input bias current. Then a general form for the energy of the network is [52] M
E=
M
K
M
-~~~~~
c M
K
M
M
K
c T 2 3 i , 2 ~ 3 ~ i-~JExx123iK31. K3i~~3~i~ (2.11) 2
2=13=1 1=1 2’=13’=1l’=l
2=1 3=11=1
We note that a solution for the MAP estimate can be obtained by minimizing (2.10). Here we approximate the posterior energy by
U ( L ) Y *= ) c(U(Y,’lLs)
+ W L S+ U2(LS)}
(2.12)
S
and the corresponding Gibbs energy to be minimized can be written as
E
1
=
M
M
;ccc c
K
K
M
M
2=1
3=1
5 ~ y ~ u l ( z , ~ , 2 ) -K 3 1 2=13=1 I=1
1=1
K’3‘1K31
(2.13)
(2?,3’)Efi,3
where f i i j is the neighborhood of site ( i , j ) (same as the f i s in Section 2). In (2.13), it is implicitly assumed that each pixel site has a unique label, i.e. only one neuron is active in each column of the network. This constraint can be implemented in different ways. For the deterministic relaxation algorithm described below, a simple method is t o use a winner-takes-all circuit for each column so that the neuron receiving the maximum input is turned on and the others are turned off. Alternately a penalty term can be introduced in (2.13) to represent the constraint as in [52]. From (2.11) and (2.13) we can identify the parameters for the network, (2.14)
and the bias current
I.. = - UI(i,j,Z). 2.7 1
(2.15)
2.2.1. Deterministic relaxation The above equations (2.14) and (2.15) relate the parameters of the network to that of the image model. The connection matrix for the above network is symmetric and there is no self feedback, i.e. Tijl;ijl= 0, Vi,j,1. Let uijl be the potential of neuron (i,j,Z). (Note that 1 is the layer number corresponding to texture class I), then M
M
K
(2.16)
2.2 Model-Based Texture Segmentation and Classajication 257 In order t o minimize (2.13), we use the following updating rule: (2.17)
This updating scheme ensures that a t each stage the energy decreases. Since the energy is bounded, the convergence of the above system is assured but the stable state will in general be a local optimum. This network model is a version of the Iterated Conditional Mode algorithm (ICM) of Besag [42]. This algorithm maximizes the conditional probability P ( L , = Z(Y:,L,j1 s' E fi,) during each iteration. It is a local deterministic relaxation algorithm that is very easy to implement. We observe that in general any algorithm based on MRF models can be easily mapped onto neural networks with local interconnections. The main advantage of this deterministic relaxation algorithm is its simplicity. Often the solutions are reasonably good and the algorithm usually converges within 20-30 iterations. In the next section we study two stochastic schemes which asymptotically converge t o the global optimum of the respective criterion functions. 2.3. Stochastic Algorithms for Texture Segmentation
We look at two optimal solutions corresponding to different decision rules for determining the labels. The first one uses simulated annealing to obtain the optimum MAP estimate of the label configuration. The second algorithm minimizes the expected misclassification per pixel. The parallel network implementation of these algorithms is discussed in Section 2.3.3. 2.3.1. Searching f o r MAP solution
The MAP rule [15] searches for the configuration L that maximizes the posterior probability distribution. This is equivalent to maximizing P ( Y * I L) P(L) as P ( Y * )is independent of the labels and Y *is known. The right-hand side of (2.10) is a Gibbs distribution. To maximize (2.10) we use simulated annealing [41], a combinatorial optimization method which is based on sampling from varying Gibbs distribution functions e - + J p ( L , I Y:,L?,7-€N3) ZTk
In order to maximize
e - w L I Y')
z
1
Tk being the time varying parameter, is referred to as the temperature. We used the following cooling schedule Tk =
10
1
+ log,
lc .
(2.18)
258
R. Chellappa, R. L. Kashyap €9 B. 5’. Manjunath
where lc is the iteration number. When the temperature is high, the bond between adjacent pixels is loose, and the distribution tends to behave like a uniform distribution over the possible texture labels. As Tk decreases, the distribution concentrates on the lower values of the energy function which correspond to points with higher probability. The process is bound to converge to a uniform distribution over the label configuration that corresponds to the MAP solution. Since the number of texture labels is finite, convergence of this algorithm follows from [41]. In our experiment, we realized that starting the iterations with To = 2 did not guarantee convergence to the MAP solution. Since starting at a much higher temperature will slow the convergence of the algorithm significantly, we use an alternative approach, viz., cycling the temperature [43]. We follow the annealing schedule till Tk reaches a lower bound then we reheat the system and start a new cooling process. By using only a few cycles, we obtained results better than those with a single cooling cycle. Parallel implementation of simulated annealing on the network is discussed in Section 2.3.3. The results we present in Section 2.4 were obtained with two cycles. 2.3.2. Maximizing the posterior marginal distribution
The choice of the objective function for optimal segmentation can significantly affect its result. The choice should be made depending on the purpose of the classification. In many implementations the most reasonable objective function is the one that minimizes the expected percentage misclassification per pixel. The solution to the above objective function is also the one that maximizes the marginal posterior distribution of L,, given the observation Y * ,for each pixel s.
P{L,
= I,
I Y *= y*}
cc
c
P ( Y * = y* I L = 1) P ( L = 1)
lJL,=l,
The summation above extends over all possible label configurations keeping the label at site s constant. This concept was thoroughly investigated in [44]. Marroquin [55] discusses this formulation in the context of image restoration, and illustrates the performance on images with few gray levels. The possibility of using this objective function for texture segmentation is also mentioned. In [42] the same objective function is mentioned in the context of image estimation. To find the optimal solution we use the stochastic algorithm suggested in [44]. The algorithm samples out of the posterior distribution of the texture labels given the intensity. Unlike the stochastic relaxation algorithm, samples are taken with a fixed temperature T = 1. The Markov chain associated with the sampling algorithm converges with probability one to the posterior distribution. We define new random variables g$t for each pixel (s E a):
where Lt is the class of the s pixel, at time t , in the state vector of the Markov chain associated with the Gibbs sampler. The ergodic property of the Markov chain
2.2 Model-Based Texture Segmentation and Classification 259 [56] is used to calculate the expectations for these random variables using time averaging:
where N is the number of iterations performed. To obtain the optimal class for each pixel, we simply chose the class that occurred more often than the others. The MPM algorithm was implemented using the Gibbs sampler [41]. A much wider set of sampling algorithms such as Metropolis can be used for this purpose. The algorithms can be implemented sequentially or in parallel, with a deterministic or stochastic decision rule for the order of visiting the pixels. In order to avoid dependence on the initial state of the Markov chain, we can ignore the first few iterations. In the experiments conducted we obtained good results after five hundred iterations. The algorithm does not suffer from the drawbacks of simulated annealing. For instance we do not have to start the iterations with a high temperature to avoid local minima and the performance is not severely affected by enlarging the state space. 2.3.3. Network implementation of the sampling algorithms All the stochastic algorithms described in the Gibbs formulation are based on sampling from a probability distribution. The probability distribution is constant in the MPM algorithm [44] and is time varying in the case of annealing. The need for parallel implementation is due to the heavy computational load associated with their use. We now describe how these stochastic algorithms can be implemented on the network discussed in Section 2.2. The only modification required for the simulated annealing rule is that the neurons in the network fire according to a time dependent probabilistic rule. Using the same notation as in section 3, the probability that neuron (i, j,I ) will fire during iteration k is
where uijl is as defined in (2.16) and T k follows the cooling schedule (2.18). The MPM algorithm uses the above selection rule with T k = 1. In addition, each neuron in the network has a counter which is incremented every time the neuron fires. When the iterations are terminated the neuron in each column of the network having the maximum count is selected to represent the label for the corresponding pixel site in the image. 2.4. Experimental Results
The segmentation results using the above algorithms are given on two examples. The parameters ~1 and 01 corresponding to the fourth order GMRF for each texture
260
R. Chellappa, R. L. Kashyap & B. S. M a n j u n a t h
class were pre-computed from 64 x 64 images of the textures. The local mean (in an 11 x 11 window) was first subtracted to obtain the zero mean texture and the least square estimates [53] of the parameters were then computed from the interior of the image. The parameter values for the different textures used in our experiments is given in Table 1. Table 1. GMRF texture parameters.
$1
e2
e3 e4 e5 06
e7 08
e9 $10
2
calf
grass
pigskin
sand
wool
wood
0.5689 0.2135 -0.1287 -0.0574 -0.1403 -0.0063 -0.0052 -0.0153 0.0467 0.0190 217.08
0.5667 0.3780 -0.2047 -0.1920 -0.1368 -0.0387 0.0158 0.0075 0.0505 0.0496 474.72
0.3795 0.4528 -0.1117 -0.1548 -0.0566 -0.0494 -0.0037 0.0098 0.0086 0.0233
0.5341 0.4135 -0.1831 -0.2050 -0.1229 -0.0432 0.0120 0.0111 0.0362 0.0442
0.4341 0.2182 -0.0980 -0.0006 -0.0836 0.0592 -0.0302 -0.0407 0.0406 -0.0001
79.33
91.44
126.22
0.5508 0.2498 -0.1164 -0.1405 -0.0517 0.0139 -0.0085 -0.0058 -0.0008 0.0091 14.44
The first step in the segmentation process involves computing the Gibbs energies U1(Y*,IL,) in (2.5). This is done for each texture class and the results are stored. For computational convenience these U 1 ( . ) values are normalized by dividing by k2, where k is the size of the window. To ignore the boundary effects, we set U1 = 0 at the boundaries. We have experimented with different window sizes and larger windows result in more homogeneous texture patches but the boundaries between the textures are distorted. The results reported here are based on windows of size 11 x 11 pixels. We obtained w(Zs) by trial and error. The choice of ,B plays an important role in the segmentation process and its value depends on the magnitude of the energy function UI(.). Various values of ,B ranging from 0.2-3.0 were used in the experiments. In the deterministic algorithm it is preferable to start with a small ,B and increase it gradually. Large values of beta usually degrade the performance. We also observed that slowly increasing p during the iterations improves the results for the stochastic algorithms. It should be noted that using a larger value of p for the deterministic algorithm (compared to those used in the stochastic algorithms) does not improve the performance. The nature of the segmentation results depends on the order of the label model. It is preferable to choose the first order model for the stochastic algorithms if we know a przori that the boundaries are either horizontal or vertical. However for the deterministic rule and the learning scheme the second order model results in more homogeneous classification. The MPM algorithm requires the statistics obtained from the invariant measure of the Markov chain corresponding to the sampling algorithm. Hence it is preferable
2.2 Model-Based Texture Segmentation and Classification 261 to ignore the first few hundred trials before starting to gather the statistics. The performance of the deterministic relaxation rule of Section 2.2 also depends on the initial state and we have looked into two different initial conditions. The first one starts with a label configuration L such that L, = I , if Ul(Y,*IZ,) = minl,{Ul(Y,* I l b ) } . This corresponds to maximizing the probability P ( Y * I L) [12]. The second choice fur the initial configuration is a randomly generated label set. Results for both the cases are provided and we observe that the random choice often leads t o better results.
Example 1. This is a 256 x 256 image (Fig. 2(a)) having six textures: calf, grass, wool, wood, pigskin and sand. This is a difficult problem in the sense that three of the textures (wool, pigskin and sand) have almost identical characteristics and are not easily distinguishable even by the human eye. The ICM result obtained with the maximum likelihood estimate (MLE) as the initial condition is in Fig. 2(b). The MAP solution using simulated annealing is shown in Fig. 2(c). As mentioned before, cycling of temperature improves the performance of simulated annealing. The segmentation result was obtained by starting with a n initial temperature To = 2.0 and cooling according to the schedule (2.18) for 300 iterations. Then the system was reset to To = 1.5 and the process was repeated for 300 more iterations. In the case of the MPM rule the first 500 iterations were ignored and Fig. 2(d) shows the result obtained using the last two hundred iterations. As in the previous example the best results were obtained by the simulated annealing and MPM algorithms. For the MPM case there were no misclassifications within homogeneous regions but the boundaries were not accurate and in fact, as indicated in Table 2, simulated annealing has the lowest percentage error in classification. Table 2. Percentage misclassification for Example 1 (six class problem). Algorithm
Percentage Error
Maximum Likelihood Estimate Neural network (MLE as initial state) Neural network (Random initial state) Simulated annealing (MAP) MPM algorithm
22.17 16.25 14.74 6.72 7.05
3. Preattentive Segmentation In this section we discuss a simple biologically motivated approach to detect texture boundaries within a more general contex of boundary detection [45]. Previous approaches to this problem are discussed in [21,22]. In [21], Malik and Perona propose a three stage model involving convolution with even symmetric filters followed by half wave rectification, local inhibition, and texture boundary detection using odd symmetric filters. The BCS processes the intensity data and performs
262
R. Chellappa, R. L. Kashyap & B. S. M a n j u n a t h
Fig. 2. Texture segmentation results for a six class problem. (a) original image. Segmentation results using ICM, MAP and MPM are given in (b)-(d), respectively.
preattentive segmentation of the scene. The first stage of the BCS consists of oriented contrast filters at various scales and orientations and extracts the contrast information from the scene. The outputs of the filters are then fed to a two-stage competitive network whose main goal is to generate end-cuts. Subsequent long range cooperative interactions and a positive feedback to the competitive stage help in boundary completion. The boundary detection takes place independently in different spatial channels. The input image is first processed through a bank of orientation selective bandpass filters at various spatial frequencies. The convolution of the image with these filters yields a representation which is localized in space as well as in frequency. We then introduce three distinct types of local feature interactions for consideration: competition between spatial neighbors in each orientation channel, competition between orientations at each spatial location, and interscale interactions. Interscale interactions are used in localizing line ends and play an important role in boundary
2.2 Model-Based Texture Segmentation and Classification 263 detection. The second stage of interactions groups similar features in the neighborhood. This cooperative processing helps in the boundary completion process. The final step involves identifying image boundaries.
3.1. Gabor Functions and Wavelets Gabor functions are Gaussians modulated by complex sinusoids. Consider a wavelet transform where the basic wavelet is a Gabor function of the form gx(z, Y,0) = e-(X
2x’2+g’2)+i7Tx’
x‘ = xcosO + 1~sinO y’= -asin6+ycos6
where X is the spatial aspect ratio and 0 is the preferred orientation. To simplify the notation, we drop the subscript X and unless otherwise stated assume that X = 1. For practical applications, discretization of the parameters is necessary. The discretized parameters must cover the entire frequency spectrum of interest. Let the orientation range [0,7r]be discretized into N intervals and the scale parameter a be sampled exponentially as 02. This results in the wavelet family
where 81, = k.rr/N. The Gabor wavelet transform is then defined by
3.2. Local Spatial Interactions Following feature extraction using Gabor wavelets, we now consider local competitive and cooperative processing of these features. Competitive interactions help in noise suppression, and in reducing the effects of illumination. These interactions are modeled by non-linear lateral inhibition between features. Two types of such interactions are considered. The first type includes competition between spatial neighbors within each orientation and scale. The second type involves competition between different orientations at each spatial position. For simplicity the transfer function g(z) of all feature detectors is assumed to be the same. The following notation is used in explaining the int,eractions: The output of a cell at position s = (x,y) in the ith spatial frequency channel with a preferred orientation 0 is denoted by yZ(s, O), with I t ( s ,0) being the excitatory input t o that cell from the previous processing stage. For example, I,(s,O) could be the energy in the filter output corresponding to feature ( s ,-9) in the ith frequency channel. For convenience we will drop the subscript i indicating the frequency channel whenever there is no ambiguity. Let N, be the local spatial neighborhood of s. The
264
R . Chellappa, R. L. Kashyap & B. S. Manjunath
competitive dynamics is represented by:
where ( a , b, c ) are positive constants. In our experiments we have used a sigmoid non-linearity of the form h ( z ) = 1/(1+ exp(-,&)). The dynamics of (3.4) can be visualized as follows : At each location within a single frequency channel, the corresponding cell receives an excitatory input from a similarly oriented feature detector (of the same spatial frequency). Further it also receives inhibitory signals from the neighboring cells within the same channel. We assume that all these interactions are symmetric (bs,+t = bsI,+ and c g , v = ce,,e). The competitive dynamics of the above system can be shown to be stable. The Lyapunov function for the system [52] can be written as
Under the assumptions that the interactive synapses are symmetric and that g ( . ) is monotone non-decreasing, the time derivative of E is negative and the system represented by (3.4) always converges. 3.3. Local Scale Interactions We now suggest a simple mechanism to model the end-inhibition property of hypercomplex cells. Hypercomplex cells in the visual cortex differ from simple and complex cells in that they respond to small lines and line endings [57]. For this the hypercomplex cell receptive field must have inhibitory end zones along the preferred orientation. Such a profile can be generated either by modifying the profile of the simple cell itself or through interscale interactions, discussed below. The fact that both simple and complex cells often exhibit this end-stopping behavior further suggests that both these mechanisms are utilized in the visual cortex. If Qij(z,y, 8) denotes the output of an end-inhibited cell at position (2, y) receiving inputs from two frequency channels i and j (ai< a j ) with preferred orientation 0, then Q i j ( z , ~ I 0= )
h(llW(z,~,Q) -YW~(~,Y,~)II)
(3.7)
where y = ( ~ - ~ ( 2 - - 3 ) is the normalizing factor. The logic behind this is simple. At line ends, cells with shorter receptive fields will have a stronger response than those with larger fields, and consequently will be able t o excite the hypercomplex cells. At other points along the line, both small and large receptive field cells are equally
2.2 Model-Based Texture Segmentation and Classification 265 excited and in the process the response of the hypercomplex cells is inhibited. It appears that such scale interactions to generate end inhibition do exist in the visual cortex. Bolz and Gilbert [58] observe that connections between layers 6 and 4 in the cat striate cortex play a role in generating end inhibition. The cells in layer 4 are of hypercomplex type exhibiting end inhibition. Layer 6 cells have large receptive fields and require long bars (or lines) to activate them. In addition, cells in both layers show orientation selectivity. Inactivating layer 6 cells resulted in the loss of end-inhibition property of layer 4 cells, while preserving other properties such as orientation selectivity. Thus, in the absence of layer 6 activity, cells in layer 4 could be excited by short bars and their response did not decrease as the bar lengths increased, suggesting that layer 6 cells have an inhibitory effect on the cells of layer 4. 3.4. Grouping and Boundary Detection
The final stage involves grouping similar orientations. The grouping process receives inputs both from the competitive stage (3.4) and from the end detectors (hypercomplex cells) described in Section 3.3. Note that the orientation of the activating end-detector is orthogonal to the actual orientation of the grouping process. This incorporates the observation made in [59,60] that hypercomplex cells are responsible for detecting illusory contours. Abrupt line endings signal a n occluding boundary almost orthogonal to the edge orientation, and this is represented by these end-inhibited cells providing input t o the grouping process nearly orthogonal in their orientation preference. If Zi(s,8) represents the output of this process, then
ZZ(S,8) = h
(J
di(S -
s’, B)(yZ(sl,8)
+ Q i j ( S ’ , 8’)ds’ ) .
(3.8)
d i ( s , 8) represents the receptive field of Zi(s,19)and in our experiments we have used
+
d(s = (2,y), e) = e x p ( - ( 2 ~ ~ ) - ~ [coso ~ ~ ( zy sine)'
+ (-z
sin0
+ y C O S Q ) ~ ] ) (3.9)
where 8 is the preferred orientation, 8‘ is the corresponding orthogonal direction, and X is the aspect ratio of the Gaussian. The Z cells thus integrate the information from similar oriented cells within each frequency channel and from hypercomplex cells of appropriate orientation, and thus help in grouping the features and in boundary completion. Since the various frequency channels are sampled, the effective standard deviation of the Gaussian is a / & , where ai is the scale parameter for channel a. To summarize, this approach consists of three distinct steps: (a) feature detection using Gabor wavelets, (b) local interactions between features and (c) scale interactions to generate end inhibition. The output Z(.) from different frequency channels is now used to detect edges and texture boundaries.
3.5. Experiments The performance of our approach is illustrated on several images. The following parameter values were used in our experiments described here: p = 4.0 in the
266
R. Chellappa, R. L. K a s h y a p & B. S. M a n j u n a t h
transfer function g(.). The strengths of the inhibitory synapses in (3.4) are b,,s, = 1/ 11 n, 11 and c = 1/N, where 11 N , 11 is the cardinality of the neighborhood set and n is the number of discrete orientations used. Unless otherwise stated, N = 4 and N s consists of the four nearest neighbors of s. The aspect ratio of the Gaussian in both the Gabor wavelets (3.l),and in the receptive field of 2 cells (3.8) is set to 0.5. If more than one channel is mentioned then the result shown is a superposition of the boundaries detected in the individual channels. Regarding implementing the dynamics of competition, we used a simple gradient descent on the corresponding energy function (3.6) instead of solving the set of differential equations. The equilibrium points in general for these two methods will be different, but gradient descent on E in (3.6) will be much faster (typically it takes less than 50 iterations to converge on a 256 x 256 image).
Example 2 (Intensity edges). Figure 3 shows two examplesof edge detection using the energy measures. Figures 3(a) and (c) show the original 256 x 256 images. The edges shown in Figure 3(b) are detected in channels a' = [l/fi,1/21 and in (d) they correspond to the channel a' = 1/&. In both cases (T is set to 1. Example 3. Figure 4 shows the boundaries detected in an aerial image consisting of four textures, grass, water, wood and raffia. The wood texture is present a t two regions at different orientations. The parameter values used are a' = {1/2,1/2&, 1/4} and g = 5.0. Example 4. Figure 5 shows the results on a synthetic texture which is often used in psychophysical experiments. The boundary between L and T s is not easily perceived whereas that between straight and oriented Ts clearly stands out. This boundary can be easily detected in almost all frequency channels and the parameters values used are the same as in the previous example. Example 5. This example illustrates the importance of end inhibition in texture boundary detection. Figure 6 shows another commonly used texture consisting of randomly oriented Ls and +s. Unlike the previous example, orientation information can be used for segmentation. The line segments forming Ls and +s have the same length (seven pixels). The two regions differ in the distribution of corners, line-ends and interactions. As we discussed in Sec. 3.2.1, scale interactions play an important role in detecting these features. None of the scales by themselves contain enough information to segment the two regions, but using these interscale interactions the boundary between the Ls and +s can be detected (Fig. 6(b)). The boundary shown is for the case of using the interactions between scales corresponding t o {1/2,1/4} with u = 16. Example 6 (Illusory contours). The usefulness of scale interactions in detecting line endings and their subsequent grouping to detect illusory contours is illustrated in Figure 7. For the line (Fig. 7(d)) and sine wave (Fig. 7(e)) contours the results shown are for a' = {1/2,1/4},0 = 8. For the circle (Fig. 7(f)) a' = { l / f i , 1/2} and (T = 2.
2.2 Model-Based Texture Segmentation and Classification 267
Fig. 3. (a) and (c) show two 256 x 256 images and the corresponding edges detected are shown in (b) and (d). In (b) the edges are from two channels ai= {1/&, 1/2} and in (d) a2 = 1/*. For both examples u = 1.
4. Rotational Invariant Texture Classification
We discuss direct pattern classification strategies for classifying textures, assuming that there is only one texture in the image. The strategy is t o fit varieties of parametric random field models, extract features from them, and use these features for classification using both standard algorithms and new procedures.
4.1. Rotation Invariant Non-Causal A R Model The model used here is a modified version of a second-order non-causal autoregressive (NCAR) model, where the nearest eight neighbors are interpolated on a circle according to the following formula [61]:
2.2 Model-Based Texture Segmentation and Classification 269
+.
Fig. 6. Texture consisting of randomly oriented L and The boundary shown in (b) is detected using the output of the scale interactions with CT = 16. The scales used in this example are a i= {1/2,1/4}, and figures (c) and (d) show the result of convolution and (e) shows the output after the interactions.
The parameters (Y and p can be estimated by least-squares technique and the estimates 6, can be used as discriminating features. can be interpreted as a measure of the roughness of the texture. Classification experiments performed using only these two features showed that there is room for improvement. Textures like wood have a strong degree of directionality, not captured by & or A feature which measures the degree of directionality can be obtained by fitting to the image two different simultaneous autoregressive
a
6
b.
R. Chellappa, R. L. Kashyap tY B. S. Manjunath
270
._
..........
~
._
..
................_
_
....
........................
::_I
I
........... ._ .
-
i
( Fig. 7. Some examples of illusory contours formed by line terminations (a), (b), and ( c ) , and the corresponding detected contours (d), (e) and (f).
(SAR) models [31,62,61],having the following form: Y(S) =
C 8rY(s
r ) + fiw(s)
TEN
where 8, = O-, and w(riz) is (0,l) an identical, and independently distributed (IID) sequence. N is a neighbor set excluding (0,O). Let us first choose the neighbor set N to consist of four nearest neighbors, namely N , = [(0, l),(0, -l), (1,0), (-1, O)]. Let 8io,l)and 8rl,o)be the ML estimates of 8 ( 0 , ~ and ) 8(l,o). Next, let us fit another SAR model with neighbor set Nb having the four nearest diagonal members, namely
2.2 Model-Based Texture Segmentation and Classification 271 be the ML estimates Nb = [(I,I), (1, -l), (-1, I ) , (-1, -1)]. Let 6Tl,l) and of the corresponding parameters. Consider the feature [ defined as
i measures the extent of variation in the orthogonal directions. For a texture having strong directionality like wood or straw, i will be very large. From a directionless texture like sand, it will be very small. Thus, our feature set is (8,a,i). A supervised recognition approach is used for the classification of textures. The inputs to the system are the digitized images from one of the m texture classes. The images are separated into test and training sets. The class of textures in the training set is known a priorz. In the feature selection state, 8, ,6 and [ are extracted from the processed images. The class parameterization phase computes the sample mean and standard deviation of each category training feature. The classifier is a distance classifier which measures a weighted distance between the features of the test image denoted by X ( t ) = [8(t),/?(t),((t)] and the mean feature of each of the m classes. The texture is then classified to class C: for which such a distance is minimum, i.e.
i*
= mind(X(t),i),i =
1 , ... , m ,
9
where
and f(i) and [a;](i)correspond to the sample mean and variance of class (i) feature, obtained from the training set, respectively. 4.2. Experiments
Twelve different textures, namely, calf leather (D24), wool (D19), beach sand (D29), pigskin (D92), plastic bubbles (D112), herringbone weave (D17), raffia (D84), wood grain (D68), grass (D9), straw (D15), brick wall (D95) and bark of tree (D12) were chosen from the photo album by Brodatz [63]. This selection includes both macrotextures (e.g. brick wall) and microtextures (e.g. sand). Seven rotated 512 x 512 - 8 bit (0-225) digitized images with relative angles of rotation of Odeg, 30deg, 6Odeg, 90deg, 120deg, 150deg, and 200deg are taken from each class of texture. Each 512 x 512 image was first reduced t o a 128 x 128 one by averaging every 4 x 4 window into a single pixel. Each 128 x 128 image is then segmented into four 64 x 64 images. Thus the database has 28 64 x 64 images from each texture. One 64 x 64 digitized window of Odeg orientation of each texture is shown in Fig. 8. Figure 9 shows a 64 x 64 sample of raffia texture for all of the seven orientations.
272
R. Chellappa, R. L. Kashyap & B. S. Manjunath
Fig. 8. A 64 x 64, Odeg digitized sample of each texture of the database. From left to right, first row: calf leather, wool, sand; second row: pigskin, plastic bubbles, herringbone weave; third row: raffia, wood, grass; fourth row: straw, brick wall, bark of tree.
Fig. 9. A 64 x 64 digitized sample of each of the seven orientationsof raffia texture. From top and from left to right: Odeg, 30deg, GOdeg, 9Odeg, l20deg, 150deg, and 200deg.
To remove the variability in the image caused by illumination or quantization schemes, all the 64 x 64 images were first subjected to a gray scale normalization procedure and then normalized so that each image has zero empirical mean and unit empirical variance. To illustrate the discriminating power of each individual feature, a range plot is presented for each of them in Fig. 10. The classes are first ordered for each feature according to the mean value of the respective features. Then the actual range of values that each feature takes for each category is plotted. Underneath
2.2 Model-Based Texture Segmentation and Classification 273 Raff la Pigskin
Sand
L Y c
Tree Brick
-
Bubbles
m-a c
VaOl
w
Wood Grass Leather Strau
-
-
Herring
Y
Lcat her
Snd
-
Strw
I
Raffia
cc w
Wood
c_Lc_
. . I
Piqskin
,130
--M-U
a13
.sin
.3+)
.*m
.uo .un
moo
.7ia1
82
(b) Fig. 10. (a) Range plots of h. (b) Range plots of *U.
p.
x: mean value.
0:
range extrema. *: mean
274
R. Chellappa, R. L. Kashyap & B. S. Manjunath
each range plot the distance corresponding t o two empirical standard deviations of the feature for that texture is also given. These range plots indicate how packed each respective feature is. The amount of vertical overlap between category range plots is an indication of the classification power of the causing feature. The less such an overlap, the better the feature. On the average, the & values of each texture class overlap with & values of four other classes. Raffia texture is an exception with very distinct &. Herringbone and leather textures have very distinct features while the of the rest of the categories overlap with an average of three other classes. Note that the mean value of ,6 for highly circularly nonsymmetric features like herringbone is much higher than the mean value for plastic bubbles which has more of a circular symmetry property. Several experiments were carried out [64] and only one is described here. Recall that we have 28 images with 7 orientations for each texture. The classifier for each texture is trained on samples from the images of three orientations (i.e. 12 images) and the classifier is tested using all other images (i.e. 16 images of that texture and 28 of all other textures). Thus the classifier has not “seen” the orientations it encounters in the test phase. The results are presented in Table 3. A total of ten experiments were carried out and the average classification accuracy obtained is 89 percent.
a
a
Table 3 . Classification results for 12 classes in the database using (&,fJ,() feature vector. In each experiment the available 28 samples from each class are divided into 12 training and 16 test samples. Angle of rotation (degree) Training samples 0, 30, 60 30, 60, 90 60, 90, 120 90, 120, 150 120, 150, 200 0, 60, 120 30, 90, 150 0, 90, 200 0, 150, 200 30. 150. 200
Testing samples 90, 120, 150, 200 0, 120, 150, 200 0, 30, 150, 200 0, 30, 60, 200 0, 30, 60, 90 30, 90, 150, 200 0, 60, 120, 200 30, 60, 120, 150 30, 60, 90, 120 0. 60. 90. 120
Classification accuracy rate 87% 88% 88% 89% 86% 91% 90% 90% 91% 90%
AVERAGE
89%
5 . Classification Using Fractional Models
A multi-level classification method which can handle arbitrary 3-D rotated samples of textures is developed based on fractional differencing models with a fractal scaling parameter. In the first level of classification, the textures are classified by the first-order F’ractional Differencing model with a fractal scale parameter, and in the second level, classification is completed with the additional frequency
2.2 Model-Based Texture Segmentation and Classification 275 parameters of the second-order Fractional Differencing periodic model. This multilevel classification scheme has several advantages over the conventional approaches [31-391.
5.1. Fractional Difference Models (FDM) The Fractional Difference model in one dimension is the discrete version of the continuous fractional Brownian motion process (FBM) introduced by Mandelbrot and Van Ness [65]. The FBM differs from the GMRF models introduced earlier in two respects, namely (i) its correlation function decays with lag much slower than that in the parametric GMRF models, and (ii) it has considerable power at low frequencies and can account for large periodicities unlike the GMRF models which have little power a t low frequencies. The FDM possesses both these properties possessed by FBM. In many images, widely separated image pixels seem to display relatively high degrees of correlation. We will first generalize the first-order FDM given in [65] for two dimensions as follows [66]: y(ml,m2)=
[(l- z1-1 )(1- ~ ; ~ ) ] - ‘ / ~ [ ( m 1 , m 2 )ml ,
, m2 = 0 , 1, . . . , N
-
1 (5.1)
where z1,z2 are the unit lead variables in the two-dimensions, c is the fractional parameter, 0 < c < 1, and S(riz,rjz.) is a two-dimensional independent, identically distributed sequence of random variables with zero mean and finite variance p . The above model is stationary even though it has a zero unit circle. By taking the factor (1 - ~ ; l ) - ‘ / ~ ( l - Z ; ~ ) - ‘ / ~ on the left-hand side and expanding it in an infinite power series, one can interpret the above model as an infinite order (non-Markov)autoregressive model in two dimensions. The above model has only two “tunable” parameters, c and p . Sometimes they are not enough to provide the level of classification. Then we go to the second-order FDM given below [66]: y(ml,m2) = [(l- 22L1 cosw1
+ zL2)(1 - 2 2 i 1 C O S W ~+ ~ , ~ ) ) - “ / ~ ] ] r ( m l(5.2) ,m2).
(cosw1) and ( C O S Ware ~ ) the two additional parameters. In addition, we can choose different scaling parameters c1, c2 in the two directions instead of one parameter c. The corresponding DFTs of these functions are
Y ( k ,k2)
=
“1 - e
-ja *N1)(I -
e - j 2 ” % ) ] - ~ / 2 ~ ( k lI ,C ~ ) ,
and
+
Y ( / ~ , I c=~ [(I ) - 2 c o s w 1 e - j 2 ~ 3 e - j 4 a 3 11 where zi is the delay operator associated with mi, [(ml,m2) is an IID Gaussian sequence, and W(lc1,IC2) is the corresponding DFT. A key property of both these models is that the structure of their DFT given above and the associated parameters are unaltered even if the images are rotated,
276
R. Chellappa, R. L. Kashyap tY B. S. Manjunath
tilted and slanted. The details are in [67]. For any given image, the parameters c and p in the first-order model and the parameters c, w1,w2, p in the second-order model can be estimated [67,68]. 5.2. Multi-level 3 - 0 Rotational Invariant Classification Scheme For this classification scheme, the images are separated into test and training sets. The class of textures and the number of classes in the training set is assumed to be known a priori. In the first level, the different 3-D rotated texture images are classified into M different classes depending on their estimated values of the fractal scale. Actual classification is achieved by applying a distance classifier d(c, i), which measures a weighted distance between the extracted feature of the test image denoted by 2 and the mean feature of each of M classes. Then the texture is classified to class Ai for which such a distance is minimum. That is.
i*
2 .
= min[C - Ci] z = i
1 , . . ., M
where ei corresponds to the sample mean of feature c in class Ai. The class Ai can consist of several different texture classes since several different textures share the same fractal scale (the roughness of the surface). This means that the fractal scale only is not enough to distinguish the different textures. Thus, we need an additional classification scheme to distinguish textures contained in the same class Ai. In the second level, the textures which were already classified to the same class in the first-level are split t o the different subclasses, based on the values of pattern features w1, wz in the second-order fractional differencing periodic function (5.2).
and f ( i ) and [o!](~) correspond to the sample mean and variance of subclass (k) features, respectively. Here, it should be noticed that since we have at most several subclasses from a first-level class, we need to compare only a small number of subclasses to complete the classification, instead of checking the feature distance of whole other texture classes. 5 . 3 . Experiments
For these experiments, nine different classes of texture were taken from Brodatz’s standard texture album for the training set. These are, namely, grass [D9],tree bark [D12],straw [D15], herringbone weave [D17], woolen cloth [D19], calf leather [D24], beach sand [D29], water [D37], and raffia [D84].
2.2 Model-Based Texture Segmentation and Classification 277 Table 4. The sample mean and variance of parameters c, w 1 , w 2 : 16 64 x 64 sample image data are taken for each different texture classes, and the parameter values are extracted from the first and second-order fractional differencing models. ~
Textures
C
-
2
grass tree bark straw herringbone weave woolen cloth calf leather beach sand water raffia
w1
U2
-
w2
X
U2
X
0.636 0.601 1.209 1.175 0.793 0.935 0.571 0.972 0.988
1.209 1.530 0.923 1.003
0.057 0.073 0.053 0.072
0.744 0.691 0.387 1.263
0.078 0.199 0.068 0.114
0.809 1.064 1.195 1.074 1.547
0.024 0.044 0.038 0.055 0.062
0.852 1.175 0.665 0.083 1.042
0.095 0.114 0.107 0.064 0.153
62 __ 0.082 0.324 0.070 0.167
0.098
0.122 0.129 0.132 0.165
For the actual training, 16 64 x 64-sized sample image data were taken for each different texture pattern, and the sample mean and variance of parameters, c, w1, and w2 were obtained for each texture class, based on the first- and second-order fractional differencing models (Table 4). As we can see from Table 5, fractal scale c itself is not enough to classify the different textures, because some of the textures have similar values of c, even though they are different texture patterns. Based on these sample mean and variance values of the parameters c, nine textures are grouped into five classes as indicated in Table 6, which also indicates the corresponding sample mean and variance of each class. Notice that the herringbone weave texture belongs to classes 2 and 3, because of its high value of variance. The second level gives the recognized texture from each class.
2-D rotated texture case. In this experiment, the test input images were taken from the 2-D raffia textures rotated by various angle 0s. Then, each 64 x 64 texture was classified by the proposed multi-level classification scheme. For the first level, the fractal scale parameter c was extracted based on the first-order Fractional Differencing model (5.1), and the parameters w1 and w2 were extracted from the second-order Fractional Differencing periodic model (5.2). Actual classification of the test images was done in each level by comparing weighted distances between the extracted features and the data base. The classification results are presented in Table 7, which shows the parameter values extracted from each rotated texture pattern and demonstrates the perfect result of classification based on these values. Rotated and projected texture case. In this experiment, six 64 x 64 test input images were taken from the straw textures rotated and projected orthographically from the various tilted and slanted texture surfaceslike in previous experiments, for the first level, the fractal scale parameter c was extracted based on the firstorder Fractional Differencing model (5.1), and the parameters, w1 and w2, were
278
R. Chellappa, R. L. Kashyap €9 B. S. M a n j u n a t h
Table 5. Database of the first level of classification. 4 and variance of class i, respectively.
are the sample mean and the
Class
Textures
Ci
Ul
1
woolen cloth
0.809
0.024
2 3 4
straw, herringbone weave herringbone weave, calf leather, water grass, beach sand tree bark, raffia
0.963 1.047 1.202 1.539
0.063 0.055 0.045 0.067
5
Table 6. Classification results from the 2-D rotated texture images. (Result indicates the result class after applying two-level classification method.) Andes
2
Gl
G,
Result
20 40 60 80 100 120 140 160 180
1.523 1.517
1.132 1.144 1.138 1.142 1.139 1.138 1.135 1.140 1.133
1.098 1.102 1.119 1.120 1.118 1.120 1.097 1.113
raffia raffia raffia raffia raffia raffia raffia raffia raffia
1.535 1.537 1.532 1.529 1.527 1.533 1.525
1.099
Table 7. Classification results from the rotated and orthographically projected straw texture images. (Result indicates the result class after applying two-level classification method.) Angles
0 = Odeg, T 0 = 45deg, T 0 = 90deg, T 0 = Odeg, T 0 = 45deg, T 0 = 9Odeg, T
= Odeg, u = Odeg, u = Odeg, u = 45deg, u = 45deg, u = 45deg, u
= 15deg = 30deg
= 45deg = 15deg = 30deg = 45deg
C
bl
G2
Result
0.914 0.932 0.928 0.918 0.922 0.927
0.365 0.371 0.373 0.368 0.375 0.377
1.189 1.224 1.218 1.156 1.191 1.202
straw straw straw straw straw straw
extracted from the second-order F'ractional Differencing periodic model (5.2). The classification results from this experiment are presented in Table 7. Table 7 shows the parameter values extracted from each rotated and projected texture pattern and demonstrates the perfect result of classification based on these values. 6. Summary
In this chapter we presented a number of techniques for texture segmentation and classification. Although significant progress has been made over the last thirty years, completely automated, unsupervised texture segmentation and classification
2.2 Model-Based Texture Segmentation and Classification 279 algorithms that are invariant to transformations such as rotation scaling, illumination, etc. remain elusive.
References [l] R. M. Haralick, Statistical and structural approaches to textures, Proc. I E E E 67 (1979) 786-804. [2] G. H. Landerweerd and E. S. Gelsema, The use of nuclear texture parameters in the automatic analysis of leukocytes, Pattern Recogn. 10 (1978) 57-61. [3] M. Nagao and T. Matsuyama, A Structural Analysis of Complex Aerial Photographs (Plenum Press, New York, 1980). [4] P. C. Chen and T. Pavlidis, Segmentation by texture using correlation, I E E E Trans. Pattern Anal. Mach. Intell. 5 (1983) 64-69. [5] J. Weszka, C. R. Dyer and A. Rosenfeld, A comparative study of texture measures for terrain classification, I E E E Trans. Syst. Man Cybern. 6 (1976) 269-285. [6] K. Laws, Textured image segmentation, Ph.D. Thesis, University of Southern California, 1978. [7] A. Ikonomopoulos and M. Unser, A directional filtering approach to texture discrimination, in Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, Jul. 1984, 87-89. [8] A. P. Pentland, Fractal-based descriptions of natural scenes, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 661-674. [9] S. W. Zucker and D. Terzopoulos, Finding structure in co-occurrence matrices for texture analysis, in Azriel Rosenfeld (ed.), Image Modeling (Academic Press, New York, 1981) 423-445. [lo] C. W. Therrien, An estimation-theoretic approach to terrain image segmentation, Comput. Vision Graph. Image Process. 22 (1983) 313-326. [ll] F. S. Cohen and D. B. Cooper, Simple parallel hierarchical and relaxation algorithms for segmenting noncausal Markovian fields, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 195-219. [12] S. Chatterjee and R. Chellappa, Maximum likelihood texture segmentation using Gaussian Markov random field models, in Proc. Computer Vision and Pattern Recognition Conf., San Francisco, CA, Jun. 1985. [13] H. Derin and H. Elliott, Modeling and segmentation of noisy and textured images using Gibbs random fields, IEEE. Trans. Pattern Anal. Mach. Intell. 9 (1987) 39-55. [14] P. B. Chou and C. M. Brown, Multi-model segmentation using Markov random fields, in Proc. Int. Joint Conf. on Artificial Intelligence, Seattle, WA, 1987, 663-670. [15] S. Geman and C. Graffigne, Markov random fields image models and their application to computer vision, in A. M. Gleason (ed.), Proc. Int. Congress of Mathematicians 1986, Providence, RI, 1987. [16] Z. Fan and F. S. Cohen, Textured image segmentation as a multiple hypothesis test, IEEE Trans. Circuits and Syst. 35 (1988) 691-702. [17] B. S. Manjunath, T. Simchony and R. Chellappa, Stochastic and deterministic networks for texture segmentation, I E E E Trans. Acoust. Speech Signal Process. 38 (1990) 1039-1049. [18] B. S. Manjunath and R. Chellappa, A note on unsupervised texture segmentation, I E E E Trans. Pattern Anal. Mach. Intell. 13 (1991) 472-483. [19] B. Julesz, Visual pattern discrimination, I R E Trans. Inf. Theory 8 (1962) 84-92. [20] J. R. Bergen and E. H. Adelson, Early vision and texture perception, Nature 333 (1988) 363-364.
280
R. Chellappa, R. L. Kashyap & B. S. Manjunath
[21) J . Malik and P. Perona, Preattentive texture discrimination with early vision mechaAm. A 7 (1990) 923-932. nisms, J. Opt. SOC. [22] S. Grossberg and E. Mingolla, Neural dynamics of surface perception: Boundary webs, illuminants, and shape-from-shading, Comput. Vision Graph. Image Process. 37 (1987) 116-165. 1231 L. S. Davis, M. Clearman and J. K. Aggarwal, An empirical evaluation of generalized co-occurrence matrices, IEEE Trans. Pattern Anal. Mach. Intell. 3 (1981) 214-221. [24] D. Chetverikov, Experiments in the rotation-invariant texture discrimination using anisotropy features, in Proc. 6th Int. Conf. on Pattern Recognition, Munich, Germany, Oct. 1982, 1071-1073. [25] L. S. Davis, Polograms: A new tool for image texture analysis, Pattern Recogn. 13 (1981) 219-223. [26] 0.D. Faugeras and W. K. Pratt, Decorrelation methods of texture feature extraction, IEEE Trans. Pattern Anal. Mach. Intell. 2 (1980) 323-332. [27] F. Vilnrotter, Structural analysis of natural textures, Ph.D. Thesis, University of Southern California, 1981. [28] B. J. Schacther, A. Rosenfeld and L. S. Davis, Random mosaic models for textures, IEEE Trans. Syst. Man Cybern. 9 (1978) 694-702. [29] N. Ahuja and A. Rosenfeld, Mosaic models for textures, I E E E Trans. Pattern Anal. Mach. Intell. 3 (1981) 1-11. [30] J. W. Modestino, R. W. Fries and A. L. Vickers, Texture discrimination based upon an assumed stochastic texture model, I E E E Trans. Pattern Anal. Mach. Intell. 3 (1981) 557-580. [31] R.L. Kashyap, R. Chellappa and A. Khotanzad, Texture classification using features derived from random field models, Pattern Recogn. Lett. 1 (1982) 43-50. [32] P. M. Lapsa, New models and techniques for synthesis, estimation and segmentation of random fields, Ph.D. Thesis, Purdue University, 1982. [33] R. Chellappa and S. Chatterjee, Classification of textures using Gaussian-Markov random fields, IEEE Trans. A c o u t . Speech Signal Process. 33 (1985) 959-963. [34] A. Khotanzad and R. L. Kashyap, Feature selection for texture recognition based on image synthesis, I E E E Trans. Syst. Man Cybern. 17 (1987) 1087-1095. [35] P. DeSouza, Texture recognition via autoregression, Pattern Recogn. 15 (1982) 471475. (361 R. L. Kashyap and K.-B. Eom, Texture boundary detection based on the long correlation model, IEEE Trans. Pattern Anal. Much. Intell. 11 (1989) 58-67. [37] J. Zhang and J. W. Modestino, Markov random fields with applications to texture classification and discrimination, Conf. on Information Sciences and Systems, Princeton, NJ, 1986. [38] S. Chatterjee, Classification of natural texture using Gaussian Markov random field models, in R. Chellappa and A. K. Jain (eds.), Markov Random Fields: Theory and Application (Academic Press, 1992). [39] C. Chen, J. S. Daponte and M. D. Fox, Fractal feature analysis and classification in medical imaging, I E E E Trans. Medical Imaging 8 (1989) 133-142. [40] R. L. Kashyap and Y. Choe, Multilevel 3-D rotation invariant classification, in Proc. 11th Int. Conf. on Pattern Recognition, The Hague, Sept. 1992. [41] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721-741. [42] J. Besag, On the statistical analysis of dirty pictures, J. Roy. Statist. SOC.B48 (1986) 259-302. 1431 U . Grenander, Lectures in Pattern Theory (Springer-Verlag, New York, 1981).
2.2 Model-Based Texture Segmentation and Classification 281 [44] J. L. Marroquin, Probabilistic solution of inverse problems, Ph.D. Thesis, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1985. (451 B. S. Manjunath and R. Chellappa, A unified approach to boundary perception: Edges, textures and illusory contours, IEEE Trans. Neural Networks 4 (1992). [46] T. Poggio, V. Torre and C. Koch, Computational vision and regularization theory, Nature 317 (1985) 314-319. [47] Y. T. Zhou, R. Chellappa, A. Vaid and B. K. Jenkins, Image restoration using a neural network, IEEE Trans. Acoust. Speech Signal Process. 36 (1988) 1141-1151. [48] Y. T. Zhou and R. Chellappa, Stereo matching using a neural network in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, New York, NY, Apr. 1988, 940-943. [49] C. Koch, J. Luo, C. Mead and J. Hutchinson, Computation motion using resistive networks, in D. Z. Anderson (ed.), Proc. Neural Information Processing Systems, Denver, CO, 1987. [50] Y. T. Zhou and R. Chellappa, Computation of optical flow using a neural networkjn Proc. IEEE Int. Conf. on Neural Networks, San Diego, CA, Jul. 1988, 71-78. [51] H. Bulthoff, J. Little and T. Poggio, A parallel algorithm for real-time computation of optical flow, Nature 337 (1989) 549-553. 1521 J. J. Hopfield and D. W. Tank, Neural computation of decisions in optimization problems, Biol. Cybern. 52 (1985) 114-152. [53] R. Chellappa, Two-dimensional discrete Gaussian Markov random field models for image processing, in L. N. Kana1 and A. Rosenfeld (eds.), Progress in Pattern Recognition 2 (Elsevier Science Publishers, North Holland, 1985) 79-112. [54] G. R. Cross and A. K. Jain, Markov random field texture models, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 25-39. [55] J. Marroquin, S. Mitter and T. Poggio, Probabilistic solution of ill-posed problems in computer vision, in Proc. Image Understanding Workshop, Miami Beach, FL, Dec. 1985, 293-309. (561 B. Gidas, Non-stationary Markov chains and convergence of the annealing algorithm, J. Stat. Phys. 39 (1985) 73-131. [57] D. H. Hubel and T. N. Wiezel, Functional architecture of macaque monkey visual cortex, in Proc. Royal SOC.of London, Ser. B 198 (1977) 1-59. 1581 J. Bolz and C. D. Gilbert, Generation of end-inhibition in the visual cortex via interlaminar connections, Nature 320 (1986) 362-365. 1591 E. Peterhans and R. von der Heydt, Mechanisms of contour perception in monkey visual cortex. 11. Contour bridging gaps, J. Neuroscience 9 (1989) 1749-1763. 1601 R. von der Heydt and E. Peterhans, Mechanisms of contour perception in monkey visual cortex. I. Lines of pattern discontinuity, J. Neuroscience 9 (1989) 1731-1748. [61] R. L. Kashyap and R. Chellappa, Estimation and choice of neighbors in spatial interaction models, IEEE Trans. Inf. Theory 29 (1983) 6G72. [62] R. L. Kashyap and K.-B. Eom, Robust image models and their applications, in P. Hawkes (ed.), Advances in Electronics and Electron Physics, Vol. 70 (Academic Press, 1988) 79-158. [63] P. Brodatz, Texture: A Photographic Album f o r Artists and Designers (Dover, New York, 1956). [64] R. L. Kashyap and A. Khotanzad, A model-based method for rotation invariant texture classification, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1986) 472-481. [65] B. B. Mandelbrot and J. W. Van Ness, Fractional Brownian motions, fractional noises and applications, SIAM Rev. 10 (1968) 422-437.
282
R. Chellappa, R. L. Kashyap & B. S. Manjunath
[66] R. L. Kashyap and P. M. Lapsa, Synthesis and estimation of random fields using long correlation models, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1991) 800-808. [67] Y . Choe and R. L. Kashyap, 3-D shape from a shaded and textural surface image, ZEEE Tmns. Pattern Anal. Mach. Intell. 13 (1991) 907-918. [68] R. L. Kashyap and K.-B. Eom, Estimation in long-memory time series model, J. Time Series Analysis 9 (1988) 35-41.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 283-312 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 2.3 I COLOR IN COMPUTER VISION: RECENT PROGRESS
GLENNHEALEY Electrical and Computer Engineering, University of California, Iruine, CA 92697, USA E-mail:
[email protected] QUANG-TUAN LUONG Artificial Intelligence Center, SRI International, 333 Rauenswood Awe., Menlo Park, CA 94025, USA E-mail:
[email protected] The use of color in computer vision has received growing attention. This chapter introduces the basic prinrjples underlying the physics and perception of color and reviews the state-of-the-art in color vision algorithms. Parts of this chapter have been condensed from [58] while new material has been included which provides a critical review of recent work. In particular, research in the areas of color constancy and color segmentation is reviewed in detail. The first section reviews physical models for color image formation as well as models for human color perception. Reflection models characterize the relationship between a surface, the illumination environment, and the resulting color image. Physically motivated linear models are used to approximate functions of wavelength using a small number of parameters. Reflection models and linear models are introduced in Section 1 and play an important role in several of the color constancy and color segmentation algorithms presented in Sections 2 and 3. For completeness, we also present a concise summary of the trichromatic theory which models human color perception. A discussion is given of color matching experiments and the CIE color representation system. These models are important for a wide range of applications including the consistent representation of color on different devices. Section 1 concludes with a description of the most widely used color spaces and their properties. The second section considers progress on computational approaches to color constancy. Human vision exhibits color constancy as the ability to perceive stable surface colors for a fixed object under a wide range of illumination conditions and scene configurations. A similar ability is required if computer vision systems are to recognize objects in uncontrolled environments. We begin by reviewing the properties and limitations of the early retinex approach to color constancy. We describe in detail the families of linear model algorithms and highlight algorithms which followed. Section 2 concludes with a subsection on recent indexing methods which integrate color constancy with the higher level recognition process. Section 3 addresses the use of color for image segmentation and stresses the role of image models. We start by presenting classical statistical approaches to segmentation which have been generalized to include color. The more recent emphasis on the use of physical models for segmentation has led to new classes of algorithms which enable the
283
284
G. Healey €4 Q.-T. Luong accurate segmentation of effects such as shadows, highlights, shading, and interreflection. Such effects are often a source of error for algorithms based on classical statistical models. Finally, we describe a color texture model which has been used successfully as the basis of an algorithm for segmenting images of natural outdoor scenes. Keywords: Color, computer vision, modeling, reflectance, color constancy, multispectral, segmentation, recognition, physics-based vision, intrinsic properties, features.
1. Modeling This section introduces physical and perceptual models for color. Much of the work based on the physics of color image formation is covered in more detail in an edited collection of papers published in 1992 [38]. More details on perceptual color models are given in the reference books by Wyszecki and Stiles [89] and Judd and Wyszecki [47]. The first book focuses on a few topics a t great depth and contains a large amount of reference data. The second book is more recent and covers more topics with a practical approach. An interesting book on the physics and chemistry of color formation is [76]. 1.1. Physical Color Models 1.1.1. Sensing the color signal The reflected spectral radiance of a surface point r in the direction of a n observer can be expressed as
where ( i ,e, 9 ) are the photometric angles of incidence, observation, and phase (Fig. l), E(A,r) is the spectral power distribution of the incident illumination a t point r, and p ( i , e, 9 , A, r) is the spectral reflectance of the surface at point r in the direction of the observer. The reflected radiance I ( A ,r) over the visible wavelengths (400 nm-700 nm) is an example of a color signal. In general, the term color signal refers to any spectral power distribution of light.
Fig. 1. Photometric angles.
2.3 Color in Computer Vision: Recent Progress 285 The reflected spectral radiance function I ( X , r) is not represented explicitly by a color imaging system. Instead, the system stores scalars
where (s,y) is the image location corresponding t o scene point r and f j ( X ) is the spectral sensitivity of the j t h sensor. The image sj(z, y) formed using a single f j ( X ) is called a color band. A color camera system uses a set of three color filters with different spectral sensitivities to capture three color bands which represent the red, green, and blue components of the scene. Various technical solutions include: Black-and-white camera with colored optical filters such as the Kodak Wratten 25, 58, and 47B.a Single CCD color camera. A striped or mosaic color filter pattern is affixed to a single CCD so that different CCD cells have different effective spectral sensitivities. This is an inexpensive approach, but sacrifices spatial resolution in each color band. Three CCD color camera. A high precision prism makes three copies of the incoming image which are independently filtered and sensed by three separate CCDs. There is no loss of spatial resolution. Some of the challenges of obtaining accurate color images are presented in [65]. The properties of CCD cameras are reviewed in [37]. 1.1.2. Reflection models Color reflection models play the important role of describing the relationship between the properties of an object and its image. These models are exploited by several of the color constancy and segmentation algorithms presented later in this chapter. Reflection can be modeled as two distinct physical processes (Fig. 2). Surface (interface) reflection describes light which is reflected directly from a surface. Body scattering refers to light which penetrates some depth into a material and emerges following scattering by inhomogeneities. General physical models predict that spectral reflectance is a complicated function of wavelength and geometry [30]. Consequently, an important goal of computer vision is the selection of models which are both accurate and tractable. The dichromatic reflection model [78] characterizes the reflected radiance from any point on an inhomogeneous dielectric material as the sum of a surface reflection aThe transmission curves are in the first edition of the book by Wyszecki and Stiles [89]. They can also be obtained from Kodak.
286
G. Healey 6 Q.-T. Luong
Incident light
Surface component
Body component
Fig. 2. Surface and body components of reflection.
term and a body reflection term according to
where 0 denotes the photometric angles. The relationship in (1.4) assumes that both the surface and body reflection components factor into the product of a term which depends only on geometry and a term which depends only on wavelength. The spectral composition terms c, (A) and c, (A) depend on the spectral properties of the illumination and the spectral properties of the reflectance functions
where p,(A) and pB (A) describe the dependence of the surface and body reflectance on wavelength and E(X) is the spectral power distribution of the illumination. Healey [30] used the Torrance-Sparrow model [85] for surface reflection and the Reichman extension [74] of the Kubelka-Munk model [47] for body scattering to show that the dichromatic model is accurate for a wide range of inhomogeneous dielectric materials. He further showed that the reflected radiance for metals can be modeled using
I ( @ ,A) = m ( @ ) c ( A )
(1.7)
over a wide range of geometries. A common extension to the dichromatic reflection model for inhomogeneous dielectric materials is the neutral interface reflection (NIR) model. The NIR model assumes that p s ( A ) has a constant value independent of wavelength. This implies that the surface reflection (highlight) from inhomogeneous dielectrics has the same
2.3 Color in Computer Vision: Recent Progress 287 spectral composition as the illumination. The NIR model has been verified experimentally by Tominaga and Wandell [83] and Lee et al. [57]. 1.1.3. Linear models Linear models represent functions of wavelength from 400 nm to 700 nm as an additive combination of a small number of basis functions. The functions that are typically represented are the spectral reflectance and the illuminant spectral distribution. These models allow representation of the spectral functions with a small number of parameters. The use of linear models by color constancy algorithms will be discussed in detail in Section 2. Spectral reflectance data is available for analysis from many sources. Spectral reflectance functions of some materials are given in the book by Wyszecki and Stiles [89]. These materials include different building materials (brick, shingles, sheet metals, rocks, and enamel paints) and some natural objects. Other data is taken from the work of Krinov [52] who measured the spectral reflectance for samples of 370 natural materials, such as trees, shrubs, grasses, mosses, crops, soils, roads, water surfaces, and snow. Spectral reflectance data also exists for the Munsell color chips. The first study of finite dimensional linear models for reflectance functions was performed by Cohen [9]. The first analysis for the illuminant is due to Judd et al. [46] (the tables are also in [89])Both studies used a characteristic vector statistical analysis. The resulting first three basis functions for representing the illuminant I l ( X ) , Iz(X), and I 3 ( X ) and the first three basis functions for representing the reflectance R1(X),&(A), and &(A) are presented in Fig. 3. More recent experimental
Fig. 3. Basis functions for the representation of illuminants (Judd) and reflectances (Cohen).
288
G. Healey €9 Q.-T. Luong R3
R2
Ri black/white
red/geen
E
blue/yellow
> A
Fig. 4. Basis functions compatible with the NIR model for (A) body and (B) surface reflectance.
work is due to Maloney [59] and Parkkinen and co-workers [45,70,71]. The conclusion is that generally a t least three basis functions are necessary t o represent accurately spectral reflectance functions. A set of three basis functions compatible with the NIR model is used by D’Zmura and Lennie [16] and is also examined by Wandell [86]. These functions are the first three basis functions of a Fourier analysis (Fig. 4): Rl(X) is constant, &(A) is a red/green function, &(A) is a blue/yellow function. 1.2. Perceptual Color Models 1.2.1. Colorimetry: the trichromatic theory The human vision system uses three classes of color photoreceptors called cones and their spectral sensitivity curves can be found in many references including [89]. The three classes of cones respond respectively to the short (blue), medium (green), and long (red) wavelengths giving a decomposition of the visible spectrum. Similar to Eq. (1.2) for an electronic imaging system, the human vision system can be modeled as storing three scalars s1, s2, and s3 at each retina location to represent an incoming color signal I ( X ) . Two color signals I ( X ) and I ’ ( X ) will appear identical if they give rise to the same values of S ~ , S Z and , s3. Color signals which appear identical but for which I ( X ) # I ’ ( X ) are called metamers. Physically identical color signals ( I ( X ) = I ’ ( X ) ) are called isomers. The trichromatic theory summarizes experimental studies of human color vision. The theory predicts that over a wide range of conditions, most colors can be matched by a single additive mixture of three fixed primary colors. The primary colors can be broadly chosen provided that one cannot be obtained as an additive mixture of the other two. The theory further predicts that proportionality and additivity of color matches holds over a wide range of conditions.
2.3 Color in Computer Vision: Recent Progress 289 Consider a set of primary colors with spectral distributions PI (A), & ( A ) , and P3(A).A monochromatic stimulus is a spectral distribution with energy at a single wavelength. According to the trichromatic theory, a unit energy monochromatic stimulus &(A) can be matched by a n additive combination RPl(A) GP2(A) BP3(A)for some values R , G, and B . A value of R , G, or B may be negative if some amount of a primary must be added t o &(A) to obtain a match with an additive combination of the other two primaries. As we allow the wavelength A of the unit energy monochromatic stimulus &(A) to vary, the values of R , G, and B required for a match vary. The resulting functions R(A),G(A),and B(A) are called the color matching functions for these primaries. It follows from the additivity of color matches that a general stimulus I ( A ) will be matched by an additive combination R,Pl(X) G,P2(A) B,P3(A) where
+
+
R, =
s,
+
R(X)I(A)dX, G , = LG(A)I(A)dA, B ,
=
s,
B(A)I(X)dA.
+
(1.8)
The scalars R,, G I , and B, are called the tristimulus values of I ( A ) with respect to the primaries Pl(A),& ( A ) , and P3(A).For any set of tristimulus values R, GIB , the corresponding chromaticity coordinates T , g, b are defined by
R G B b= g= R+G+B R+G+B R+G+B Since T g b = 1, the chromaticity coordinates are typically specified using only two of the coordinates, say r and g in this case. Chromaticity coordinates specify the color quality of a stimulus independent of its absolute intensity. Thus, stimuli I ( A ) and K I ( X ) with significantly different brightnesses will have the same chromaticity coordinates. A common problem is to represent a stimulus I ( A ) using different sets of primaries. This problem arises, for example, when attempting t o preserve the appearance of a color on two different color monitors. Let PI (A), P2 (A), and P3 (A) be a first set of primaries and let Pl (A), Pi (A), and Pi (A) be a second set of primaries. Since each primary in one set can be matched by an additive combination of primaries in the other set, we can write T =
+ +
(1.10) where A denotes a perceptual match and not necessarily a physical match. It follows from (1.10) that for any color the tristimulus values R , G, and B corresponding to the primaries Pl(A),P 2 ( X ) , and P3(A) are related to the tristimulus values R’, G’, and B’ corresponding to the primaries P:(A),Pi(X),and P$(A)by all
a21
a31
(1.11)
290
G. Healey & Q.-T. Luong
Thus, although the set of possible color signals lies in a n infinite dimensional space, from the standpoint of human color perception three-dimensional representations and transformations may be used. 1.2.2. The CIE colorimetric system In 1931, the CIE (Commission Internationale d’Eclairage) defined a standard set of color matching functions X ( X ) ,Y(X),and Z(X) corresponding to a set of hypothetical primaries which lie outside the space of physically realizable colors. The CIE color matching functions are plotted in Fig. 5. These standard color matching functions have the advantage that they are nonnegative so that the resulting tristimulus values X , Y , and Z will always be nonnegative. The function Y(A)has the additional property that it is the same as the relative luminous efficiency function. This means that the tristimulus value Y is a measure of perceived brightness and two stimuli will match in brightness, but not necessarily color, if they have the same Y tristimulus value. The X Y Z system also has the property that a white stimulus with equal energy at all wavelengths will have equal tristimulus values X = Y = 2. The CIE chromaticity ( x ,y) diagram is shown in Fig. 6. The horseshoe shaped curve in Fig. 6 bounds the set of physically achievable chromaticities known as the color locus. The outer boundary of the color locus corresponds t o the chromaticities of monochromatic stimuli (pure colors) except for the purple line near the bottom which connects the blue and red ends of the spectrum. The coordinates of the standard illuminants are plotted in Fig. 6 . The left side of Fig. 7 shows how the dominant wavelength and excitation purity, psychophysical quantities corresponding respectively to hue and saturation, can be obtained from the chromaticity diagram. For stimuli I ( X ) and I’(X) with respective chromaticities (x,y) and (x’,y’), the chromaticity corresponding to an additive combination a I ( X ) bI’(X) will lie on the line
+
Wavelength (nm)
Fig. 5 . CIE color matching functions.
2.3 Color in Computer Vision: Recent Progress 291
X
Fig. 6 . The CIE chromaticity (z, y) diagram.
-
z. Right:
Fig. 7. Left: The dominant wavelength of color A has chromaticity B, the purity is EB Colors that can be obtained by a mixture of I , J , and K .
in the chromaticity diagram connecting ( 2 ,y) and (x’,y’). The right side of Fig. 7 shows more generally that the set of chromaticities resulting from the mixture of three colors lies inside a triangle in the chromaticity diagram with vertices corresponding to the chromaticities of the three colors. Thus, the chromaticity gamut which can be obtained using an RGB display system can be determined from the chromaticity coordinates of the primaries.
292
G. Healey €4 Q.-T. Luong
1.3. Color Spaces Many color coordinate systems have been proposed. Most of the following formulas are in Pratt [73].
RGB : In the RGB space, a color is specified by its tristimulus values with respect to given red, green, and blue primary colors. The RGB space is frequently used with color display monitors, but the associated set of primaries often varies from device to device. A procedure for calibrating color monitors is described in [ l l ] . The CIE defined a standard set of RGB primaries using the monochromatic stimuli R : 700 nm, G : 546.1 nm, and B : 435.8 nm. In terms of these primaries, the tristimulus values RGB can be converted to the CIE tristimulus values X Y Z using 0.73467 0.27376 0.16658 Y = 0.26533 0.71741 0.00886 (1.12) 0.00000 0.00883 0.82456
I:[
I .I:[
[
CMY: The C M Y system is used for color hard-copy devices such as printers and copiers. In this system, the cyan, magenta, and yellow primaries are the complements of red, green, and blue, respectively. These are known as subtractive primaries because depositing an ink or paint on a surface subtracts some of the wavelengths from incident light. For example, depositing cyan ink on white paper causes the paper to absorb red wavelengths (the complement of cyan) and reflect blue and green wavelengths. If RGB values are normalized to the range 0-1, then RGB can be converted to C M Y using
E] [i] [3 =
(1.13)
-
Additional information on color printing can be found in [43]. Work on calibrated color reproduction with application t o the printing of digital images is given in [801. IHS : The intensity, hue, and saturation color space is an intuitive representation which is related to how humans describe colors. Intensity ( I ) is associated with brightness and is defined by I = R+G+B. (1.14)
If we fix intensity I , then we obtain a constant intensity plane P in the three dimensional RGB space. The point in P for which R = G = B is called the white point or gray point. All points on a line in P emanating from the gray point in one direction have the same hue and correspond to adding some amount of white to a pure color. Red and pink, for example, have the same hue. The
2.3 Color in Computer Vision: Recent Progress 293 hue ( H ) of a color (R,GIB ) is thus defined as an angle in the constant intensity plane containing ( R ,GIB ) . Hue is undefined for any gray point ( R = G = B ) . Otherwise, if G 2 B , then hue is given by
h = arccos
2J(R
( ( R- G ) + ( R - B ) ) - G)2+ ( R - B ) ( G - B )
(1.15)
and, in general, if G 2 B and not ( R = G = B ) undefined
if G 5 B and not ( R = G = B ) if ( R = G = B ) .
(1.16)
The geometric derivation for H is given in many places including [27]. For a given hue, saturation describes the purity or amount of white added t o a color. Pure colors such as red have full saturation, while colors such as pink (red plus white) are less saturated. Gray points have zero saturation. Saturation ( S ) is defined bv 3 [min(R,GIB ) ] . S=l(1.17)
(R+G+B)
The quantities I , HIS have an approximate correlation with luminance, dominant wavelength, and excitation purity. The transformation from RGB to I H S is nonlinear and has drawbacks including singularities, instability, and nonuniformity as detailed by Kender [48]. YIQ : The Y I Q color space defines the coordinates which are encoded in the NTSC color television signal. The Y (luminance) component is a measure of brightness and carries the information used for black-and-white television. The I (in-phase) and Q (quadrature) components carry the chromatic information. The Y I Q system takes advantage of the fact that the human vision system is more sensitive to luminance changes than to chromatic changes. Consequently, in the NTSC signal more bandwidth is devoted to the Y component than to the I and Q components. The RGB to Y I Q transformation is linear and invertible and is defined by
[a] [ =
0.587 0.114 0.299 0.596 -0.275 -0.3211 0.212 -0.523 0.311
[i]
.
(1.18)
Perceptually Uniform Spaces: One property shared by all of the color spaces presented in this section is that a fixed Euclidean distance does not correspond to a fixed perceptual distance. To address this problem, the CIE has proposed two separate uniform color spaces in which the Euclidean metric approximates perceived color differences. Each of these representations is based on computing nonlinear functions of the tristimulus
294
G. Healey
6Y Q.-T. Luong
values. The CIE L*u*u* space is used to represent color lights as might be used in color monitors. The CIE L*a*b* space is used t o represent colorant mixtures as might be used in the formation of dyes. Uniform color spaces are discussed in detail in [89]. Perceptually uniform spaces are not particularly useful for computer vision because the magnitude of measurable differences achievable with a color camera is not directly related to perceptual differences. Properties of several color coordinate systems are summarized in Fig. 8.
color system
transformation
RGB
1 I I
r
normalization
uniformity
no
no
rgb
nonlinear, non-one-to-one
Yes
no
XYZ
linear
no
no
zyz
1 nonlinear, non-one-to-one 1
Yes
CMY
linear
no
no
IHS
nonlinear
yes (2 coordinates)
no
YIQ
I
linear
L*u*v*
I
nonlinear
L*u*b*
I
nonlinear
I I
I
no
no no
I
I I
I
no
no
yes
Yes
I I
I
I
Fig. 8. Properties of classical color spaces.
2. Color Constancy Color constancy refers to the ability of humans to perceive surface colors which are relatively stable under large variations of illumination and scene composition. Such an ability demonstrates that color vision requires significant processing beyond the measurement of the physical quantities described in Section 1. Without elaborating, psychophysical experiments reveal strong spatial effects. From a computational standpoint, color constancy is an underdetermined problem and can be posed as the computation of spectral reflectance or another stable surface descriptor from sensor measurements. Color constancy is an important problem because a measured color signal does not by itself indicate anything very reliable about the world. On the other hand, spectral reflectance is an intrinsic property of a surface which can be used for recognition. The color constancy problem has been studied extensively [38]. However, the work that has been done in an algorithmic framework has the characteristics that: 0 0
much of the work is based on restrictive hypotheses many of the algorithms have been demonstrated only on simplified scenes.
2.3 Color in Computer Vision: Recent Progress 295 For a review of color constancy algorithms with particular attention t o retinex algorithms, the reader is referred to Forsyth [19]. Also of interest is a chapter of Hurlbert [44].
2.1. The Retinex Algorithm Land has published numerous papers on the retinex algorithm which have been both influential and criticized. Nevertheless, it is interesting to consider his experiments which illustrate the issues of color constancy and the sophistication of color vision in complex scenes [53-551. Land’s retinex algorithm was based on three principles: 0
0
0
The color perceived at a point does not depend only on the color signal at this point. The color perceived at a point depends only on the combination of three descriptors. The descriptors can be computed independently in the three color bands.
The retinex algorithm is based on the coefficient rule or von Kries model of color constancy. Under this model of color constancy, surface descriptors are obtained by scaling the measurements in each color band independently. The assumption underlying this model is that if s1(z7y),s2(x,y), and s ~ ( z , y )are the sensor measurements (Eq. (1.2)) for a surface under illuminant l and si(x,y), sa(x,y), and sb (2, y) are the corresponding sensor measurements under illurninant 2, then the sensor measurements are related by a diagonal matrix:
[
si(x1 Y)
]I.,.I:.s$(x,Y)
=
[
mll
0
m,2
m!3]
[
s1
(x7
Y) (2.1)
s 32((xx, Y) , y ) ].
It follows from (1.1)and (1.2) that this diagonal model is an approximation. Equation (2.1) holds exactly, however, for the case of narrowband sensors fj(X). The computational problem addressed by retinex is to recover an approximation to surface reflectance by discarding the effects of the illuminant. From (1.1) we have that I ( X , r) = A X ,r)E(X,r) . (2.2)
A basic assumption of retinex is that there exists an asymmetry between the reflectance p and the illumination E which enables the solution of (2.2): p consists of uniform patches with abrupt changes whereas E varies smoothly over the scene. This is called the Mondrian World. The two conditions required for the retinex algorithm to work are: 0
Hypothesis 1 (Mondrian world): the scene is a flat Mondrian world and the illumination varies slowly and smoothly.
296 0
G. Healey
tY Q.-T.
Luong
Hypothesis 2 (Gray world): The mean value of the scene reflectances in each color band is the same.
The first hypothesis enables the separation, at each spatial location, of the reflectance and illumination components of the sensor measurements. The second hypothesis guarantees that a spectral normalization will give a triplet of color constant surface descriptors. Although many related specifications of the retinex algorithm exist, Brainard and Wandell [4] have shown that for a representative version in the limiting case the measurements in each band are simply scaled by the mean of the sensor measurements in that band. One variation involves weighting the measurements so that spatially distant measurements have less influence in determining the scale factor at a point. Several studies, e.g. [60], have demonstrated that the descriptors computed by the retinex algorithm are relatively stable in the presence of illumination changes. Brainard [4] has shown, however, that the retinex descriptors for a fixed surface can be quite unstable with respect to changes in the color of surfaces surrounding the fixed surface. This instability is often related to the violation of Hypothesis 2 above. The dependence of the descriptors on scene composition renders the retinex algorithm an inadequate model of human color vision and severely limits its usefulness for computer vision.
2.2. Algorithms Based on Linear Models Many approaches to color constancy are based on the use of a finite dimensional linear model for spectral reflectance as described in Section 1.1.3. Using such a model, the spectral reflectance p(X, r) at each location r is approximated by
where the Rj(X) are a set of n fixed basis functions. Several empirical studies [9,59,70] have shown that at least three basis functions are required to approximate accurately naturally occurring spectral reflectance functions. Similarly, the illuminant spectra can be expressed as a linear combination of basis functions using
The derivation of color constancy algorithms from linear models allows explicit characterization of the set of scenes for which a n algorithm will exhibit color constancy. This ability to specify the domain of applicability of an algorithm is a primary advantage of this approach. 2.2.1. Known average reflectance algorithms
The idea of the algorithms of Buchsbaum [7] and Gershon et al. [26]is to estimate the illuminant color using the assumption (similar to the Gray world hypothesis)
2.3 Color in Computer Vision: Recent Progress 297 that some average property of the scene is known. The Buchsbaum algorithm assumes that the mean value of the reflectance over the scene is a gray value. Gershon has slightly improved this algorithm by assuming that the average reflectance is the mean taken over Krinov’s [52] data of reflectances of natural surfaces. These algorithms assume that the spectral reflectance is approximated by a three parameter (n = 3) linear model as in (2.3) and that the illuminant is approximated by a three parameter (m = 3) linear model as in (2.4). The average spectral reflectance is assumed to be a known function p(X) and the illumination E(X) is assumed to be spatially constant over the scene. The first step in the algorithm is to compute the average sensor response vector (Si,S2,S3) over the image. In Gershon’s case, the average is computed using segmented regions so that each region in the image is counted once independent of size. Substituting the linear illumination model into Eqs. (1.1) and (1.2) we have 31 =
s,p(X)[fiIi(X) + €212(X) + f3k(X)]fi(X)dX
(2.5)
giving a system of three linear equations which can be solved for the unknowns €1,€ 2 , €3 which specify the illurninant E(X). Using this estimate of the illurninant leads to three linear equations at each image location
which can be solved for the unknowns al(r ), aa(r), and a3(r) which specify the spectral reflectance. The primary limitation of these algorithms is that for many applications the average reflectance for a scene is not known.
2.2.2. Dimensionality bused algorithms Dimensionality based algorithms do not require any assumptions about the average properties of surfaces in a scene. Instead, the primary assumptions concern the general structure of the sets of reflectance functions and illuminants. If an n parameter linear model for spectral reflectance is used, then the sensor measurements are related t o the components of the spectral reflectance by a linear
298
G. Healey & Q.-T. Luong
transformation M (2.11)
where the elements of M are given by (2.12) Note that for n = 3, (2.11) is equivalent to (2.8)-(2.10). It follows from (2.11) that the sensor measurement vector for any surface is a linear combination of the n column vectors of M . Thus, the set of sensor vectors falls in a n n-dimensional subspace of the space of possible sensor measurements. This subspace depends only on the illurninant, whereas the positions of the responses in this subspace depend only on the reflectances. If n is less than the number of color bands, then this subspace is of lower dimension than the sensor measurement space and can be determined using standard techniques. Figure 9 illustrates the idea for three-classes of sensors with n = 2. The reflectance functions are specified by two parameters which the matrix M maps t o a plane in the sensor space. This plane can be recovered if there are enough distinct surfaces in the scene. Knowledge of this plane and the assumption of a three-dimensional linear model for the illuminant allows recovery of the illumination parameters €1, € 2 , and €3 giving the matrix M . The reflectance parameters al(r) and a2(r) can then be computed at each image location using the pseudoinverse of M . Although the Maloney-Wandell algorithm represents a n important theoretical advance, the restriction for trichromatic ( N = 3) color constancy that reflectance functions be modeled by two parameters prevents the procedure from being useful for most applications. In fact, experiments [18] have demonstrated that in many situations it is better to do nothing than to apply the Maloney-Wandell algorithm. Forsyth [20] has extended the Maloney-Wandell algorithm by developing a procedure MWEXT which recovers more parameters in the linear illumination model
Y
Fig. 9. The idea behind the Maloney-Wandell method.
2.3 Color in Computer Vision: Recent Progress
299
than the original Maloney-Wandell formulation. As with the original MaloneyWandell algorithm, however, MWEXT requires a similar limiting restriction on the dimension of reflectance functions. 2.2.3. Gamut mapping algorithms
Following the work of Maloney-Wandell, several researchers have developed trichromatic color constancy algorithms by introducing additional constraints which allow spectral reflectance models with more than two degrees of freedom. Forsyth [20] developed a n algorithm CRULE which is a form of coefficient rule in that each band is scaled separately to achieve color constancy. This algorithm works for arbitrary reflectance functions provided that they are viewed by narrowband sensors. For conventional color sensors, CRULE requires corresponding constraints on the spectral properties of the illumination. The goal of CRULE is to transform the sensor values measured under an unknown illuminant to the values which would be measured under a known canonical illuminant. The algorithm begins by representing the set of sensor values which can be observed for all reflectances under the canonical illuminant. Given a set of sensor measurements for a scene, there exists a set of triplets of coefficients which transform these sensor measurements into possible corresponding measurements under the canonical illuminant. The coefficients chosen for normalization are selected from the possible set using a maximum volume heuristic. Forsyth shows using real Mondrian images that CRULE performs significantly better than retinex when the composition of the Mondrian is changed. Finlayson [17] presents an insightful review of Forsyth’s work and shows how CRULE can be extended to incorporate prior constraints on illumination color. By working in a two-dimensional normalized color space, Finlayson further suggests how the performance of CRULE can be improved in the presence of curved surfaces and highlights. 2.2.4. Local algorithms
Each of the previous approaches t o color constancy has depended on the use of information obtained over several surfaces with different reflectance functions. Brainard et al. [5] and Ho et al. [42] have examined conditions under which color constancy is possible locally by considering only a single point in an image. The underlying idea is that if reflectance is represented by a linear model as in (2.3) and illumination is represented by a linear model as in (2.4)’ then the color signal received at the sensor is described by P ( X , r)E(X,r) =
cc
a j ( r k i(+j
(Wi (4.
(2.13)
l
If the product functions Rj(X)Ii(X) are linearly independent, then it becomes possible to separate the color signal into illumination and reflectance components. In both [5] and [42], three parameter models are used for both reflectance (n = 3)
300
G. Healey €4 Q.-T. Luong
and illumination ( m = 3). Brainard showed that it is possible to design the sensor spectral sensitivities so that the vector direction of the sensor responses does not depend on the illumination. The practical difficulty of this approach is the expense of manufacturing sensors with specific spectral sensitivities. Ho developed a method for directly separating the color signal into illumination and spectral reflectance components. This method requires that an entire color signal composed of a dense set of spectral measurements is available at a point. Such measurements are not available in color images, but might be obtained from a small number of spectral measurements using chromatic aberration [23]. 2.2.5. Multiple view algorithms D’Zmura [13] showed that additional information for color constancy can be obtained by viewing a scene under more than one illuminant. He established the basic result that given three-band color images of a scene with three or more surfaces under two or more illumination conditions, it is possible to recover three parameters of a linear surface reflectance model (n = 3) and three parameters of a linear illumination model ( m = 3). The method, of course, requires some method for determining the correspondence between surfaces in images obtained under different illuminants. D’Zmura and Iverson [14,15] have thoroughly examined under what conditions this theory allows recovery of n-dimensional reflectance and rn-dimensional illumination descriptors for arbitrary numbers of color bands, illuminants, and surfaces. Finlayson et al. [18] developed a related algorithm which combines views of surfaces under different illuminants with a constrained illumination model to map sensor measurements to corresponding measurements under a canonical illuminant. 2.2.6. Interreflection algorithms
Interreflection occurs when light reflected from a surface illuminates another surface. Thus, interreflection induces changes in the measured sensor values for a surface compared to directly illuminated regions of the same surface. Funt et al. [21] have shown that the constraints generated by interreflection regions can be used for color constancy. For a trichromatic system, Funt shows that these constraints allow recovery of a three-dimensional model for reflectance (n = 3) and a three-dimensional model for illumination ( m = 3) by the multistage solution of a set of nonlinear equations. The recovery method has worked well on simulated data. The use of the algorithm is restricted to scenes where interreflection is observed and requires a method for the reliable segmentation of interreflection regions.
2.3. Algorithms using Highlights
As presented in Section 1.1.2, the NIR reflection model predicts that the spectral distribution of the surface reflection (highlight) component for inhomogeneous
2.3 Color in Computer Vision: Recent Progress 301 dielectrics is the same as the spectral distribution of the illumination. The accuracy of this model has been verified experimentally for a range of materials [57,83]. Using this model, recovery of the highlight color provides the illumination color which significantly simplifies the color constancy problem. The primary drawback of this approach is the requirement that highlights be visible in a scene. This is not always the case and fails, for example, for the Mondrian world. 2.3.1. Estimating the illurninant D’Zmura [16] and Lee [56] reported similar methods for recovering the illumination chromaticity using the NIR extension to the dichromatic model. The dichromatic model predicts that the locus of chromaticities for an object with body and surface reflection components will lie on a line in chromaticity space connecting the chromaticity of the surface component and the chromaticity of the body component. According to the NIR model, the chromaticity of the surface component is equivalent to the chromaticity of the illuminant. Both D’Zmura and Lee observed that if two or more surfaces share the same illurninant, then the intersection point of the linear loci of chromaticities for these surfaces gives the chromaticity of the illuminant. Tominaga and Wandell [83] experimentally examined this approach using measured color signals for sets of surface points. In the corresponding higher dimensional color space, color measurements for a surface lie on a plane. For two surfaces sharing a common illurninant, the associated planes will intersect along a line which defines the spectral distribution of the illuminant. Tominaga and Wandell showed using several objects that this approach can be used to recover the illumination spectral distribution. Klinker et al. [49] developed a general method for separating the body and surface reflection contributions at each pixel in a color image. This approach permits the illumination chromaticity to be recovered from the color image of a single surface. 2.3.2. Recovering spectral reflectance While the methods in Section 2.3.1 use highlights to recover spectral properties of the illurninant, the goal of color constancy is more generally to recover stable descriptors of a surface. A useful intrinsic description of a surface is a representation of the spectral composition of the body reflectance p B (A) defined by (1.6). Tominaga and Wandell [84] extended their previous technique [83] to develop a procedure for estimating p,(A) using sets of measured color signals for two objects under a common illuminant. The procedure uses the physical constraints that reflectance functions are nonnegative and the surface and body reflectance components are each nonnegative. They showed that these constraints yield a set of possible body reflectance functions. Healey [31] developed an algorithm for approximating the body reflectance p B ( X ) from a color image using polynomial basis functions for the visible spectrum. The method requires only a single surface and has been demonstrated for plastic objects.
302
G. Healey €9 Q.-T. Luong
2.4. Indexing Methods
Although most of the color constancy algorithms which have been presented use information measured over several surfaces, the goal of each of these algorithms is to assign illumination-invariant descriptors t o each surface individually. The previous discussion has shown that achieving this goal typically requires at least one of several possible undesirable assumptions. Another approach is to consider the higher level problem of recognizing objects under unknown illumination conditions. This approach provides the additional constraint that the surfaces of an object are often characterized by several different colors in a specified geometric arrangement. Several illumination-invariant indexing approaches to recognition have been developed which use descriptors based on the distribution and spatial arrangement of colors on an object. 2.4.1. Color indexing
Swain [Sl] developed an object recognition method called color indexing based on the comparison of color histograms. The color histogram representation is attractive for recognition because it is invariant to translation and rotation in an image. Color indexing computes the similarity of an image color histogram I and a model color histogram M according t o the histogram intersection defined by (2.14)
The intersection value is normalized by the number of pixels in the model histogram to generate a n intersection value between 0 and 1. Swain has shown that color indexing can be used for the accurate recognition of objects from a large database in the presence of some occlusion and geometric variation. Color indexing recognizes objects by directly comparing model histograms to observed image histograms. Since illumination changes alter the observed image colors, color indexing typically performs poorly for varying illumination [22,39]. To improve performance, Swain suggested that a color constancy procedure be applied to an input image before computing histogram intersections. As we have observed, however, the capability of available color constancy algorithms is still quite limited. This suggests that it might be useful to consider integrating the color constancy processing with the distribution comparison process. 2.4.2. Color constant color indexing
Funt and Finlayson [22] extended color indexing by proposing a set of features which are less sensitive to illumination changes than the sensor measurements themselves. Under the diagonal model of illumination change defined by (2.1), the components s i of the sensor vector under illuminant E’(X) will be related to the components si of the sensor vector under illuminant E(X) by
2.3 Color in Computer Vision: Recent Progress 303
s:(z2, Y2) = mizsi(z2,Y2)
(2.16)
where ( X I , y1) and ( 2 2 ,y2) are image locations corresponding to surfaces illuminated by E(X) and E’(X) in the two images. Using this model, the ratios (2.17) do not depend on the illumination. In color constant color indexing, F’unt and Finlayson represent objects using a histogram of color ratios taken for adjacent pixels. The use of adjacent pixels ensures that the illumination is the same for the pixel locations involved in the ratio. The histogram intersection computation is then applied to the color ratio histograms. Several studies [22,39] have shown that color constant color indexing significantly outperforms color indexing in scenes with varying illumination. Although color ratio histograms are more stable than the original color histograms with respect t o illumination changes, color constant color indexing has important limitations. The diagonal model of illumination change only holds exactly in a restricted set of situations including, for example, the case of narrowband sensors. In addition, the use of ratios can be quite sensitive to noise in low intensity regions and adjacent pixel ratios provide limited information in homogeneous regions. These limitations are demonstrated in [39]. 2.4.3. Color distribution invariants For a trichromatic color imaging system, the use of a three parameter linear model for spectral reflectance leads under general conditions to a linear relationship between a sensor vector S(z, y) = (sl(z, y), s2(z, y), sg(z, Y ) ) ~obtained under illuminant E(X) and a sensor vector S’(z, y) = (si(z,y), si(z, y), si(z,Y ) ) ~obtained under illuminant E’( A) (2.18) S’(z, Y) = M S ( 2 ,Y) where A4 is a 3 x 3 matrix. The corresponding color histograms H’ and H are related by a linear coordinate transformation
H’(MS) = H ( S ) .
(2.19)
Efficient methods have been developed for computing vectors of invariants of distributions which do not depend on the transformation M [82]. These invariant vectors, therefore, do not depend on the illumination and can be used for efficient illumination-invariant recognition. Since this illumination change model is more general than the diagonal model of (2.1) and the use of ratios is eliminated, illumination invariants can be used for recognition under more general conditions than color constant color indexing. Healey and Slater [39] have shown that a vector of
304
G. Healey & Q.-T. Luong
six illumination invariants significantly outperforms both color indexing and color constant color indexing in the presence of illumination changes. Distribution invariants have many applications. Slater [79] developed a system based on local distribution invariants which has been demonstrated for the recognition of small surface regions in cluttered three-dimensional scenes. Distribution invariants have also been demonstrated as a tool for image database annotation [72] and as a filter for the content-based retrieval of multispectral satellite images [35]. Recent work [40] has shown that the relationship in (2.19) also holds for filtered color images. Thus, similar invariants can be used for illumination-invariant recognition following the extraction of specific spatial properties using a linear filter. Filtered distribution invariants have been used t o distinguish objects having color distributions which are nearly identical [40]. 2.4.4. Color texture indexing Since histograms ignore spatial information in a n image, an alternative approach to recognition uses a n explicit representation for color texture. Healey and Wang 1411 have carefully analyzed the response of a multiband correlation color texture model [51] to illumination -changes. For three band color images, this model represents a texture using six functions which characterize spatial correlation within each band and between each pair of bands. For a three-dimensional linear model for surface reflectance (n = 3), the matrix of six correlation functions undergoes a linear transformation in response to an illumination change. This relationship leads to an efficient metric which can be used for the illumination-invariant comparison of multiband correlation functions. This metric has been demonstrated for the illumination-invariant recognition of a large set of color textures in the presence of significant changes in illumination [41]. An extension of this method based on the use of moments of correlation functions has been demonstrated for recognition invariant to illumination, rotation, and scale [87].
3. Color Segmentation Image segmentation is an important application which benefits from the use of color information. The goal of segmentation is t o partition an image into regions which correspond to surfaces or meaningful regions in a scene. An accurately segmented image is a useful intermediate representation for systems which describe scenes or recognize objects. Segmentation algorithms are based on image models which characterize the expected distribution and spatial structure of pixel measurements for surfaces or regions in the scene [33,75]. Typically, increasingly accurate image models lead t o increased segmentation algorithm functionality at the expense of increased algorithm complexity. Early approaches to segmenting color images used traditional statistical image models in conjunction with algorithms which were straightforward generalizations of grayscale techniques. The additional information in color images often enabled
2.3 Color in Computer Vision: Recent Progress 305 these approaches to improve performance over the associated grayscale procedures. Algorithms based on traditional models, however, were often confounded by optical phenomena such as shadows, highlights, shading, and interreflection. Color image models which account for these effects emerged with the expanding emphasis on physics-based modeling in computer vision [36,38]. The use of these models has led to segmentation algorithms which can be applied to color images of complex scenes.
3.1. Classical Approaches Most local segmentation methods for grayscale images have been generalized and applied to color images. In applications for which a scene contains objects of known color, pixels can be classified as instances of these objects to generate a segmentation. This approach is used frequently in remote sensing [3], biomedical imaging [63], and industrial inspection [34]. Region growing is a classical local approach to segmentation which recursively merges small regions t o arrive at a segmented image. The use of color features during region growing has been used for the segmentation of color aerial images [61]. Edge detectors are used t o segment images by first locating the boundaries of significant regions. Several approaches have been introduced for color edge detection, e.g. [29,62]. Most studies have concluded, however, that the large majority of edges detected in a color image will also be detected in the corresponding grayscale image. More global approaches to segmentation have also been applied to color images. Region-splitting algorithms recursively split the regions of a n image until each region is uniform according to an underlying image model. For each current region in a color image, Ohlander’s algorithm [66] first computes nine single-dimensional histograms. The features used for histogramming are the sensor RGB values, the corresponding I H S values, and the color television Y I Q coordinates. The algorithm determines upper and lower thresholds for the most prominent peak in any of the histograms. Pixels falling within these thresholds for the selected feature are removed from the current region to form smaller regions. The segmentation is complete when no prominent histogram peaks can be found for any of the current regions. Since the nine color features used by Ohlander have a significant degree of redundancy, Ohta et d. [67] examined the possibility of using fewer color features. Using regions generated by Ohlander’s algorithm for several images, Ohta used a Karhunen-Loeve transform to determine color features which capture the most = information. The result of the study was that the features I1 = -,I2 (2G-R-B) R - B , 13 = are the most effective in this order. The image model underlying the Ohlander I661 and Ohta [67] segmentation methods characterizes regions as having localized distributions in color space. This assumption is often appropriate for flat surfaces under uniform illumination. The segmentation result shown for a cylinder in [67], however, shows clearly that this simple model can lead t o undesirable results for curved surfaces.
306
G. Healey
d Q.-T. Luong a
I
3 Fig. 10. Two color clusters which are difficult to separate using individual color bands.
One disadvantage of the Ohlander algorithm is that the color representation is based on multiple one-dimensional histograms. Since color measurements are inherently multidimensional, image distributions corresponding t o distinct surfaces which are easily separable in multiple dimensions may be difficult to separate using individual color bands (Fig. 10). Consequently, several approaches t o color segmentation using multidimensional clustering have been proposed [1,10,28,77]. These techniques endeavor to divide the color space into volumes which correspond to the color distributions of regions in the scene. Usually, no prior knowledge about the contents of the scene is used. Techniques differ in terms of the color space used and the statistical model assumed for individual clusters.
3.2. Using Physical Models Classical approaches to color image segmentation are based on statistical models for distributions in color space. These methods frequently fail when image distributions are not accurately approximated by the assumed statistical models. Inaccuracies in statistical models often occur for systematic structure in color images due to shadows, highlights, shading, and interreflection. Research in physics-based vision, however, has led t o increasingly accurate models for color image distributions which account for much of this structure [24,30,49,64,78].Many of these models have been exploited by segmentation algorithms. Shadows are a source of confusion for many segmentation algorithms and often cause single surfaces t o be split a t shadow boundaries. Gershon et al. [25] developed a model for the spectral properties of shadows using a color reflection model. Gershon defined ideal shadows as shadow regions having the same spectral distribution of illumination up t o a multiplicative factor as directly illuminated regions. This enables the computation of a pull factor which measures the extent to which a shadow is ideal. Gershon showed that image features related to the pull factor can be used to distinguish material changes from shadow boundaries.
2.3 Color in Computer Vision: Recent Progress 307
Fig. 11. The T-cluster in the dichromatic plane (Klinker-Shafer-Kanade).
Highlights and shading often cause errors for segmentation algorithms which use statistical models. Using the dichromatic reflection model I781 and a color camera model, Klinker et al. [49] showed that the pixel distribution for a dielectric object with shading and highlights will form a skewed-T shaped cluster in color space (Fig. 11). The skewed-?' lies in a plane in color space known as the dichromatic plane. This cluster has also been variously referred to as a dogleg [24] or rank-2 field [6]. Further analysis of the structure of these clusters was performed by Novak [64]. In [50], Klinker et al. used this color space model for dielectrics as the basis of a clustering algorithm for segmentation. The algorithm uses the dimensionality of the image data in color space to generate hypotheses about the possible physical causes of a set of measurements in an image window. Neighboring windows are merged using the hypotheses according to physical compatibility. This is followed by additional clustering based on higher level hypotheses t o complete the segmentation. The approach also allows separation of the highlight reflection component from the body reflection component at each pixel. The algorithm has been demonstrated for the segmentation of scenes containing plastic objects in the presence of highlights and shading. Healey [32] combined a reflection model for metals and dielectric materials [30] with a statistical distribution model. The statistical model is derived in normalized color space to reduce the dependence of model parameters on surface geometry. This combined model was used for the derivation of a Bayesian segmentation algorithm which combines region statistics with detected edges for the segmentation of images with highlights and shading [32]. The procedure was demonstrated on images of plastics, metals, and painted surfaces. Interreflections are another optical process which can cause difficulty for traditional segmentation algorithms. Bajcsy, Lee, and Leonardis [2] have used the NIR extension to the dichromatic reflection model to examine the shape of distributions in the I H S color space for shadows, highlights, shading, and interreflection. A segmentation algorithm has been devised from this model which isolates surfaces by analyzing histograms of hue [2]. The algorithm depends on the availability of a surface of known reflectance in the scene to allow correction of the image t o its
308
G. Healey & Q.-T. Luong
appearance under a white canonical illuminant. After this correction, highlights will cause a reduction in the saturation of regions and interreflections will cause changes in both hue and saturation. These constraints have been used for the segmentation of simple scenes containing objects with saturated colors.
3.3. Using Tezture Models The segmentation algorithms presented in Sections 3.1 and 3.2 employ models which characterize the distribution of color pixel values over a region. These models, however, fail to capture spatial structure in a color image. Spatial structure is particularly important for images of natural outdoor scenes and can be used to discriminate regions with similar color distributions. Markov random field (MRF) models have been used extensively for modeling grayscale images [8]. These models represent the spatial interaction observed in stochastic textures. The specification of MRF models permits maximum likelihood techniques to be used for parameter estimation and image segmentation. Early attempts a t applying MRF models to color image segmentation [12,88] considered the individual bands separately except at region boundaries. Many natural color textures have significant structure which cannot be captured by considering the color bands separately [68]. In [69], Panjwani and Healey introduced a Markov random field model for color images which captures spatial interaction both within and between color bands. Thus, for example, this model captures the interaction of the red measurement at a pixel with neighboring red, green, and blue pixel measurements. At the same time, an efficient method for estimating the parameters of the model was established. Figure 12 depicts the interaction of a pixel R(x,y) in the red band with neighbors in both the red and blue bands. Panjwani and Healey [69] used the color texture model to develop an unsupervised segmentation algorithm. The method uses a stepwise optimal
Red Band
Blue Band
Fig. 12. Modeling spatial interaction in color images.
2.5’ Color in Computer Vision: Recent Progress 309 agglomerative clustering procedure which maximizes the likelihood of the segmented image. The algorithm has been demonstrated on several color images of natural scenes. These experiments illustrate not only the importance of representing spatial structure in color images, but also the significance of representing both within and between color band interactions.
Acknowledgements We thank David Slater for his suggestions. Glenn Healey has been supported in part by the Office of Naval Research under grant N00014-93-1-0540. Quang-Tuan Luong has been supported in part by ARPA grant DACA 76-92-C-008.
References [l] M. Ali, W. Martin and J. K. Aggarwal, Color-based computer analysis of aerial photographs, Comp. Graphics and Image Proc. 9 (1979) 282-293. [2] R. Bajcsy, S. W. Lee and A. Leonardis, Detection of diffuse and specular interface reflections and inter-reflections by color image segmentation, Int. J. Comp. Vision 17 (1996). [3] T. Bell, Remote sensing, IEEE Spectrum (1995) 24-31. [4] D. Brainard and B. Wandell, Analysis of the retinex theory of color vision, J. Opt. SOC.Am. A 3, 10 (1986) 1651-1661. [5] D. Brainard, B. Wandell and W. Cowan, Black light: How sensors filter spectral variation of the illurninant, IEEE Trans. Biomed. Eng. 36,1 (1989) 140-149. [6] M. H. Brill, Image segmentation by object color: a unifying framework and connection to color constancy, J. Opt. SOC.Am. A 7,10 (1990) 2041-2047. [7] G. Buchsbaum, A spatial processor model for object colour perception, J. Franklin Inst. 310 (1980) 1-26. [S] R. Chellappa and A. K. Jain (eds.), Markov Random Fields, Theory and Applications (Academic Press, San Diego, 1993). [9] J. Cohen, Dependency of the spectral reflectance curves of the munsell color chips, Psychon. Sci. 1 (1964) 369-370. [lo] G. Coleman and H. Andrews, Image segmentation by clustering, Proc. IEEE 67,5 (1979) 773-785. [ll]W.B. Cowan, An inexpensive method for the CIE calibration of color monitors, Comput. Graph. 11,3 (1983) 314-321. [la] M. Daily, Color image segmentation using markov random fields, in Proc. IEEE Conf. Comp. Vision Putt. Rec., 1989, 304-312. [13] M. D’Zmura, Color constancy: surface color from changing illumination, J . Opt. Soc. Am. A 9, 3 (1992) 490-493. [14] M. D’Zmura and G. Iverson, Color constancy. i. basic theory of two-stage linear recovery of spectral descriptions for lights and surfaces, J. Opt. SOC.Am. A 10, 10 (1993) 2148-2165. [15] M. D’Zmura and G. Iverson, Color constancy. ii. results for two-stage linear recovery of spectral descriptions for lights and surfaces, J. Opt. SOC.A m . A 10 (1993) 2166-2180. [16] M. D’Zmura and P. Lennie, Mechanisms of color constancy, J. Opt. SOC.Am. A 3,10 (1986) 1662-1672. [17] G. Finlayson, Color in perspective, IEEE Trans. Pattern Anal. Mach. Intell. 18, 10 (1996) 1034-1038.
310
G. Healey & &.-T. Luong
[l8] G. Finlayson, B. Funt and K. Barnard, Color constancy under varying illumination, in Proc. Int. Conf. on Comp. Vision, 1995, 72C725. [19] D. Forsyth, Colour Constancy and its Applications in Machine Vision, PhD Thesis, University of Oxford, 1988. [20] D. Forsyth, A novel algorithm for color constancy, Int. J. Comp. Vision 5, 1 (1990) 5-36. [21] B. Funt, M. Drew and J. Ho, Color constancy from mutual reflection, Int. J. Comp. Vision 6 , 1 (1991) 5-24. [22] B. Funt and G. Finlayson, Color constant color indexing, IEEE Trans. Pattern Anal. Mach. Intell. 17 (1995) 522-529. [23] B. Funt and J. Ho, Color from black and white, Int. J. Comp. Vision 3 (1989) 109-1 17. (241 R. Gershon, A. Jepson and J. Tsostsos, Highlight identification using chromatic information, in Proc. Int. Conf. on Cornp. Vision, 1987, 161-170. [25] R. Gershon, A. Jepson and J. Tsotsos, Ambient illumination and the determination of material changes, J. Opt. SOC.A m . A 3,10 (1986) 170C1707. [26] R. Gershon, A. Jepson and J. Tsotsos, From [r,g,b]to surface reflectance: Computing color constant descriptors in images, Perception 1988, 755-758. (271 R. C. Gonzalez and R. E. Woods, Digital Image Processing (Addison-Wesley, Reading, MA, 1992). [28] R. Haralick and G. Kelly, Pattern recognition with measurement space and spatial clustering for multiple images, Proc. IEEE 57 (1969) 654-665. [29] G. Healey, Color discrimination by computer, IEEE Trans. Syst. Man Cybern. 19,6 (1989) 1613-1617. [30] G.Healey, Using color for geometry insensitive segmentation, J. Opt. SOC. Am. A 6 (1989) 920-937. [31] G. Healey, Estimating spectral reflectance using highlights, Zmage and Vision Comp. 9,5 (1991) 333-337. [32] G. Healey, Segmenting images using normalized color, IEEE Trans. Syst. Man Cybern. 22, 1 (1992) 64-73. [33] G. Healey, Modeling color images for machine vision, in Advances in Image Processing and Machine Vision (Springer-Verlag, 1996). 1341 G. Healey and B. Dom, Pattern classification algorithms for real-time image segmentation, in PTOC.Int. Conf. Pattern Rec., Atlantic City, 1990. [35] G. Healey and A. Jain, Retrieving multispectral satellite images using physics-based invariant representations, IEEE Trans. Pattern Anal. Mach. Intell. 18, 8 (1996) 842-848. [36] G. Healey and R. Jain, Physics-based machine vision: Introduction, J. Opt. SOC.Am. A 11, 11 (1994). [37] G. Healey and R. Kondepudy, Radiometric CCD camera calibration and noise estimation, IEEE Trans. Pattern Anal. Mach. Intell. 16, 3 (1994) 267-276. [38] G. Healey, S. Shafer and L. Wolff (eds.), Physics-Based Vision: Principles and Practice, COLOR (Jones and Bartlett, Boston, 1992). (391 G. Healey and D. Slater, Global color constancy: recognition of objects by use of illumination-invariant properties of color distributions, J. Opt. SOC.Am. A 11, 11 (1994) 3003-3010. [40] G. Healey and D. Slater, Computing illumination-invariant descriptors of spatially filtered color image regions, IEEE Trans. on Image Proc., to appear. [41] G. Healey and L. Wang, Illumination-invariant recognition of texture in color images, J. Opt. SOC.Am. A 12,9 (1995) 1877-1883.
2.3 Color in Computer Vision: Recent Progress 311 [42] H. Ho, B. Funt and M. Drew, Separating a color signal into illumination and surface reflectance components: Theory and applications, ZEEE Trans. Pattern Anal. Mach. Zntell. 12, 10 (1990) 966-977. [43] R. W. G. Hunt, The Reproduction of Color 3rd edn. (John Wiley, 1975). [44] A. Hurlbert, The Computation of Color, P h D Thesis, MIT - A1 Lab, 1989. [45] T. Jaaskelainen, J. Parkkinen and E. Oja, Color discrimination by optical pattern recognition, in Proc. Znt. Conf. Pattern Rec., 1986, 766-768. [46] D. Judd, D. MacAdam and G. Wyszecki, Spectral distribution of typical daylight as a function of correlated color temperature, J. Opt. SOC.Am. 54 (1964) 1031-1040. [47] D. Judd and G. Wyszecki, Color in Business, Science and Industry (John Wiley and Sons, 1975). (481 J. Kender, Saturation, hue, and normalized color: calculation, digitization effects, and use, Master’s Thesis, Dept of CS. Carnegie-Mellon University, 1976. [49] G. J. Klinker, S. A. Shafer and T. Kanade, The measurement of highlights in color images, Znt. J. Comp. Vision 2, 1 (1988) 7-32. [50] G. J. Klinker, S. A. Shafer and T. Kanade, A physical approach to color image understanding, Znt. J. Comp. Vision4, 1 (1990) 7-38. [51] R. Kondepudy and G. Healey, Use of invariants for recognition of three-dimensional color textures, J. Opt. SOC.Am. A 11, 11 (1994) 3037-3049. [52] E. L. Krinov, Spectral reflectance properties of natural formations, Technical translation TT-439, Technical report, National Research Council of Canada, 1947. [53] E. H. Land, Color vision and the natural image, in Proc. Natl. Acad. Sci. 45 (1959) 115-129, 636-645. [54] E. H. Land, The retinex theory of color vision, Scient. Am. 237 (1977) 108-128. [55] E. H. Land and J. J. McCann, Lightness and retinex theory, J. Opt. SOC.Am. 61 (1971) 1-11. 1561 H.-C. Lee, Method for computing the scene-illuminant chromaticity from specular highlights, J. Opt. SOC.Am. A 3, 10 (1986) 1694-1699. [57] H.-C. Lee, E. J. Breneman and C. P. Schulte, Modeling light reflection for computer color vision, IEEE Trans. Pattern Anal. Mach. Zntell. 12,4 (1988) 402-409. [58] Q.-T. Luong, Color in computer vision, in C. H. Chen, L. F. Pau and P. S. P. Wang (eds.), Handbook of Pattern Recognition and Computer Vision (World Scientific, 1993) 311-368. [59] L. T. Maloney, Evaluation of linear models of surface spectral reflectance with small numbers of parameters, J. Opt. SOC.Am. A 3, 10 (1986) 1673-1683. [60] J. J. McCann, S. P. McKee and T. H. Taylor, Quantitative studies in retinex theory, Vision Res. 16 (1976) 445-458. [61] M. Nagao, T. Matsuyama and Y. Ikeda, Region extraction and shape analysis in aerial photographs, Comp. Graphics and Image Proc. 10 (1979) 195-223. [62] R. Nevatia, A color edge detector and its use in scene segmentation, ZEEE Trans. Syst. Man Cybern. 7, 11 (1977) 82G826. [63] Y. Noguchi, Y. Tonjin and T. Sugishita, A method for segmenting a clump of cells into cellular characteristic parts using multispectral information, in Proc. Znt. Conf. Pattern Rec., 1978, 872-874. [64] C. Novak and S. Shafer, Method for estimating scene parameters from color histograms, J. Opt. SOC.Am. A 11, 11 (1994) 3020-3036. [65] C. Novak, S. Shafer and R. Willson, Obtaining accurate color images for machine vision research, in SPIE Proc. Vol. 1250 on Perceiving, Measuring, and Using Color, 1990 54-68, Santa Clara. Also appears in Physics-Based Vision: Principles and Practice, COLOR (Jones and Bartlett, Boston, 1992).
312
G. Healey & Q.-T. Luong
[66] R. Ohlander, K. Price and D. R. Reddy, Picture segmentation using a recursive region splitting method, Comp. Graphics and Image Proc. 8 (1978) 313-333. [67] Y.-I. Ohta, T. Kanade and T. Sakai, Color information for region segmentation, Comp. Graphics and Image Proc. 13 (1980) 222-241. [68] D. Panjwani and G. Healey, Selecting neighbors in random field models for color images, in Proc. IEEE Inter. Conf. Image Proc., Austin, 1994. [69] D. Panjwani and G. Healey, Markov random field models for unsupervised segmentation of textured color images, IEEE Trans. Pattern Anal. Mach. Intell. 17,10 (1995) 939-954. [70] J. Parkkinen, J. Hallikainen and T . Jaaskelainen, Characteristic spectra of munsell colors, J. Opt. SOC.A m . A 6 (1989) 318-322. [71] J. Parkkinen, T. Jaaskelainen and M. Kuittinen, Spectral representation of color images, in Proc. Int. Conf. Pattern Rec., 1988, 933-935. [72] R. W. Picard and T. P. Minka, Vision texture for annotation, Multimedia Syst. 3 (1995) 3-14. [73] W. Pratt, Digital Image Processing (John Wiley and Sons, 1978). [74] J. Reichman, Determination of absorption and scattering coefficients for nonhomogeneous media. 1: Theory, Applied Optics 12,8 (1973) 1811-1815. [75] A. Rosenfeld and L. Davis, Image segmentation and image models, Proc. IEEE 67,5 (1979) 253-261. [76] H. Rossotti, Colour: Why the World isn’t Grey (Princeton University Press, 1983). [77] A. Sarabi and J. K. Agganval, Segmentation of chromatic images, Pattern Recogn. 13 (1981) 417-427. [78] S . Shafer, Using color to separate reflection components, COLOR Research and Application 10,4 (1985) 21&218. [79] D. Slater and G. Healey, The illumination-invariant recognition of 3D objects using local color invariants, IEEE Trans. Pattern Anal. Machine Intell. 18, 2 (1996) 206-210. [SO] M. C. Stone, W. B. Cowan and J. C. Beatty, Color gamut mapping and the printing of digital color images, ACM Trans. on Graphics 7,4 (1988) 249-292. [81] M. Swain and D. Ballard, Color indexing, Int. J. Comp. Vision 7 (1991) 11-32. [82] G. Taubin and D. Cooper, Object recognition based on moment (or algebraic) invariants, in J. Mundy and A. Zisserman (eds.), Geometric Inuariance i n Computer Vision (MIT Press, Cambridge, Mass, 1992) 375-397. [83] S. Tominaga and B. Wandell, Standard surface-reflectance model and illuminant estimation, J. Opt. SOC.Am. A 6, 4 (1989) 576-584. [84] S. Tominaga and B. Wandell, Component estimation of surface spectral reflectance, J. Opt. SOC.Am. A 7,2 (1990) 312-317. [85] K. Torrance and E. Sparrow, Theory for off-specular reflection from roughened surfaces, J. Opt. SOC.A m . 57 (1967) 1105-1114. [86] B. Wandell, The synthesis and analysis of color images, IEEE Trans. Pattern Anal. Mach. Intell. 9, 1 (1987) 2-13. [87] L. Wang and G. Healey, Illumination and geometry invariant recognition of texture in color images, in Proc. IEEE Conf. Comp. Vision Patt. Rec., 1996, 419-424. [88] W. A. Wright, A markov random field approach to data fusion and colour segmentation, Image and Vision Computing 7,2 (1989) 144-150. [89] G. Wyszecki and W. S. Stiles, Color Science: Concepts and Methods, Quantitative Data and Formulas (John Wiley and Sons, 1967/1982).
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 313-337 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 2.4 1 PROJECTIVE GEOMETRY AND COMPUTER VISION
ROGER MOHR Inriu, 655Av. de 1’Europe 38330 Montbonnot, France
Gruvir
-
This chapter surveys the contributions of projective geometry to computer vision. Projective geometry deals elegantly with the general case of perspective projection and therefore provides interesting understanding of the geometric aspect of image formation. It also provides useful tools like perspective invariants. First the major definitions and results of this geometry are presented. Applications are then provided for several domains of 3-D computer vision, including location of the viewer for uncalibrated cameras, properties of epipolar lines in stereovision and object recognition.
Keywords: 3-D vision, geometry, stereovision, invariant.
1. Introduction The objectives of this chapter are to provide the reader with the geometric background which will enable him to understand the imaging system and to derive tools for working in 3-D vision. Here we consider the image formation system to be a pure perspective projection, i.e. the camera model we adopt is the pin-hole model. This is a good approximation of existing image acquisition systems. Sometimes this model has to be corrected by using radial corrections [l],and the photogrammetrists use even more sophisticated corrections [2]. It has to be mentioned that these last models mainly correct the image in order to simulate the pure pin-hole model. Therefore the properties of central projection have to be studied. This will be done in Section 2 for general geometrical consideration, and in Section 3 for the camera calibration problem. Applications to 3-D vision problems will then follow in Sections 4, 5 and 6. Larger developments of what is presented here can be found in the book [3] and for the particular case of invariants in [4]. 1.1. Some Historical Considerations
The ancient Greek school of mathematicians already knew some mathematical properties of the projections. Using Thales’ theorem (600 BC) they were able to prove that the cross ratio of four points on a line remains invariant after a perspective 313
314
R. Mohr
projection (see Fig. 2). In fact the cross ratio is the key invariant in projective geometry as we will see in Section 6. It is not clear when this result was established, but Appolonius of Perga (200 BC) was already using it. The Italian painters of the Renaissance in the 16th century mainly studied the properties of geometry in order to reproduce correctly on their pictures the projection of the three-dimensional world they were observing. They made large use of the vanishing points, and derived some geometric construction techniques for their practical use, as for instance how to split a projected square into four equal subsquares, or how t o find the projection of the corner of a parallelogram when the projection of two side surfaces are known. Photography was discovered in France in 1839 (Nibpce, Daguerre, Arago). At the same period of time, people studied independently how to make measurements using perspective drawings of scenes. The meeting of the two techniques led naturally towards photogrammetry. The mathematician Lambert headed a committee stemming from Acadkmie des Sciences de Paris for studying how three-dimensional measures could be made using two photographs (1859). From there Europe became a place of active development, particularly in Austria and Germany. Meydenbauer is considered as the one who got the first successful results and opened up the wide area of photogrammetry applications. During the same century, the mathematicians developed a new kind of geometry which was able to deal with points a t infinity and perspective projections, i.e. projective geometry. This study was almost completed at the beginning of the 20th century and it is now no longer considered as an investigation domain for mathematicians. However the 20th century mathematical contribution was to present it in a very clear algebraic way which can now be read in the excellent book by Semple and Kneebone [5]. 1.2. What the Chapter is About All of us have experienced, when looking a t the image of a straight road, the feeling that the parallel borders meet at “the end of the road” (see Fig. 1). Walking outside in the night, we can also observe that the moon follows us, i.e. being far enough, its position in our coordinate frame is not modified by our limited translation. These are some of the properties of the strange world of projective geometry that we are going t o address. Other differences are that the lengths and even the ratios of length are not preserved. Our goal will be to explore what remains invariant with perspective projections, and what are the properties that can be computed in order to provide quantitative information for computer vision. As we shall see, the underlying mathematics are easy enough to do symbolic computation, from which can be derived also qualitative results.
2.4 Projective Geometry and Computer Vision 315
*
. 4
*
*
Fig. 1. Infinity points may be very present in perspective projection.
So after a first section on basic results and notations, we will show how they can be used to solve some 3-D vision problems. First we will discuss the geometry of a stereovision system. In Section 5, 3-D positioning from images is investigated and finally a short presentation of how project,ive invariants can be used for object recognition is given in Section 6 . 2. A Few Results from Projective Geometry 2.1. Preliminary Definitions We provide here a short introduction to projective geometry definitions and vocabulary. The reader is referred to [5]for a gentle introduction or t o [6] for advanced vision-oriented considerations on projective geometry. A new book [3] covers parts of what is presented here and presents many other geometric considerations on vision. We consider the n+l dimensional space Rn+' -{ (0, . . . , 0)) with the equivalence relation:
-
(XI,. . . , ~ + 1 )
3X
# 0 such that
(21,
(xi,, . . ,
. . . , x',,,)
XI+,>iff
= X(x1,
. . . , %,+I).
(2.1)
The quotient space obtained from this equivalence relation is the projective space Pn. Thus the (n+l)-tuples of coordinates (21,. . . , %,+I) and (xi,. . . , z;+~) represent the same point in the projective space. The usual n-dimensional affine space R" is mapped into Pn by the correspondence 9: 9 : (21, . . ., 2 , ) + ( X I , . . . , X n , l > . (2.2) Notice that \z' is a one-to-one mapping and that only the points represented by (21, . . . , z , 0) are not reached. 9 provides us with an understanding for the points
316
R. Mohr
( X I , . . . , x,, l), which caan be viewed as the usual points in the Euclidean space; it also provides us an intuitive understanding for the remaining points, if we consider (y1, . . . , yn,O) as the limit of (yl, . . . , yn, A) while X + 0, i.e. the limit of (yl/X, . . . , y,/X, 1). This is the limit of a point in R" going to infinity in the direction ( y l , . . . , yn). Therefore we will consider in the remainder (91, . . . , yn, 0) as the point at infinity in this direction. A hyperplane H in Pn is defined by the n+l-tuple of its homogeneous coefficients H = (al, . . . ,an+l). It defines the set of points whose coordinates satisfy N
c
n+l
U 2. X2.
-HXt =0.
-
1
A particular case is the hyperplane xn+l = 0: this is the hyperplane with all points at infinity. A collineation or projective transformation is any mapping from Pn into Pn defined by a regular (n+1) x (n+1) matrix A such that the image of ( X I , . . . , xn+l) is defined in the usual way by:
( ) ( .': ) y'. Yn+l
=A
.
Xn+1
Notice that as the column vector is defined up t o a scaling factor, so is the matrix A. Collineations map hyperplanes on hyperplanes and therefore lines on lines. If B is the scalar vector defining a projective hyperplane, i.e. the set of points X such that Bt . X = 0 , the image of this hyperplane is defined by A-lt . B where Mt denotes the transpose of M. A basis of the projective space P" is given by n 2 points, with no n + 1 of them lying in the same hyperplane. The canonical basis usually chosen is (1, 0, . . . , 0), . . . , (0, 0, . . . , 0, 1) augmented with the "unity point" (1, 1, . . . , 1). For the regular 3-D projective spaces, these four first points are respectively the point at infinity on the x-axis, on the y-axis, on the z-axis and the origin. It has to be mentioned that a collineation has ( n 1) x (n 1) - 1 degrees of freedom. Knowing the image of each point of the basis provides us with n 1 equations up to a scaling factor, i.e. only n independent equations. So for n 2 points in the basis, this provides us with n2 2n equations, exactly the number of unknowns for the collineation matrix. For a proof of the uniqueness of the solution, see [ 5 ] . Standard a f i n e transformation maps easily into projective transformation. For instance in the 3-D space the translation by the vector ( a , b, c), is extended in P 3 by the collineation defined by the matrix
+
+
+
+
+
+
2.4 Projective Geometry and Computer Vision 317 In the general case an affine transformation in R3 is defined by a translation
t = ( a ,b, c), and a linear 3 x 3 matrix M in the vectorial 3-D space. The associated collineation is then defined by its matrix: (0To
7).
Notice in such a case that each point at infinity is mapped onto a point at infinity. Reversely all the collineations mapping the infinity points on infinity points are affine transformations. 2.2. The Basic Projective Invariant
The cross ratio is the basic invariant in projective geometry: all other projective invariants can be derived from it [7]. Theorem 1. Let A , B , C , D be four collinear points; their cross ratio defined as:
ACBD
[ A , B ; C , D= ] = x =, AD BC
is invariant under any collineation. The notation AC stands for the algebraic measure of the segment AC. This result was already established by the ancient Greek mathematicians. It can be extended in the projective space by using the following computation rules which deal with infinity: co a 00 - - 1-, - = 0 -=moo. co 03 a The cross ratio [A,B ; C, D] does not rely on the choice of the unity vector taken on the line; in fact, changing the origin and the unity vector is just an affine transformation on this line. It is often easier to take barycentric coordinates, that is considering each point as row vector of dimension n 1 we can write:
+
B = lim A + X B B XB+Cc
:XA=OO
and the cross ratio (2.3) is then rewritten: xc-0 xD-03 xc - -c. - m - A D A D - 0 x -x
(2.5)
This theorem has an immediate application for locating a point on a line. Knowing three points, the position of the fourth one is uniquely defined by the cross ratio of these four points. So three points are a projective basis for the projective line.
318 R. M O ~ T 0
A
A'
Fig. 2. Cross ratio of a pencil of lines.
Let us now consider a pencil of four lines Li, i = 1, . . . , 4 (see Fig. 2). Let A, B, C, D be the intersection of this pencil with a first line L and A', B', C', D' be the intersection with a second line L'. The central projection mapping from A onto A' is a projective mapping and therefore the cross ratios are the same:
[A, B; C, D] = [A', B'; C', D'] . Therefore the cross ratio of a pencil of lines can be defined as:
A more geometric proof can be established considering only the magnitude of segment lengths and using standard Euclidean relations on triangles; it is proved that (see [ 5 ] ) :
-AC x== AD
BC
sin(OA,OG) sin(OA, O D )
sin(OB,OD) sin(OAB, OD)
.
(2.6)
Such a ratio is obviously not related to the secant line. Computing the cross ratio of a pencil of lines in the way suggested by this theorem is tedious. It is generally preferred to compute it using the following theorem.
Theorem 2. Let 0 be the origin of a pencil of lines L1, La, L3, Lq. Let Ai be points on Li, Ai # 0. Then
where I OAiAj 1 stands for the determinant of the 3 x 3 matrix where each column is the column of homogeneous coordinates of the points 0,Ail A j .
2.4 Projective Geometry and Computer Vision 319
a x u b y v 1 1 1
=
a x-a b y-b 1 0
u-a V-b 0
=OkixOAj.
which, using (2.6) leads t o the final result. 2.3. Projective Coordinates
We already mentioned that if three points A, B, C lie on a line, each point D of L is uniquely defined by the cross-ratio [A, B; C, D]. So this cross ratio is the projective coordinate for D with respect to the basis (A, B, C ) . In a projective plane P2,any four points A, B, C , D (no three are collinear), define a projective coordinates system (see Fig. 3). Given a point P of P2,let ( 5 1 , 2 2 , 5 3 ) be a triplet of real numbers defined up to a scaling factor, and such that - = [CA, C B ; C D , CP]
x2 x2 = [AB, AC; AD, A P ] . 23
Fig. 3. Projective coordinates in the plane.
(2.7)
320
R. Mohr
( X I , x2, "3) are called the projective coordinates of P in the coordinate system ( A , B , C, D). Naturally we also have x3/x1 = [BC,BA; BD, BPI. In a projective plane with four known points we can uniquely reference any point of the plane by their projective coordinates so defined. In fact only the two cross ratios Icl = x1/x2 and Icz = xa/x3 are necessary to uniquely define a point as long as this point does not lie on the line AC.
2.4. Cross Ratio and Conics
We consider here the projective plane P2. The general equation for a conic is the second degree homogeneous polynomial 2 allxl
+ a22222
2
a3323
+ a122122 + a132123
a232223 = 0
(2.9)
which can be rewritten 2 UllZl
+ u;2x122 + &XlX3 + u;1x2Z1 + U22.Z + a;3Z2x3
+ a;1x3x1 + UL2X3X2 + u33x32 = 0 .
(2.10)
Equation (2.10) can be written as a matrix product
XAXt
=0
with A being a 3 x 3 symmetrical homogeneous matrix; its degree of freedom is therefore 5 .
Theorem 3. [Chasles] In the projective plane, let A, B , C, D be four points, no three of them being collinear. The locus of the center of a pencil of a line passing through these four points and having a given cross ratio is a conic. Reciprocally, if M lies on a conic passing through A, B , C and D , the cross ratio of the four lines M A , M B , M C , MD is independent of the point M on the conic. Figure 4 illustrates this result: if P and Q are two points on a conic, we have
[PA,P B ; PC, PD] = [QA,QB; QC, QD]. This conic can easily be derived from the data. Let LAB be the first degree polynomial equation of the line going through A and B:
LAB= (ZA
- X)(YA - YB) - (YA
-
Y)(ZA
-
ZB)
.
Notice then that
+
CA= XLABLCD (1 - X)LACLBD
(2.11)
defines a second degree polynomial in x and y which has A, B , C and D as roots. CAdefines exactly the family of conics passing through these four points. A direct computation proves that C k is Chasles' conic for the cross ratio k.
2.4
Projective Geometry and Computer Vision 321
Fig. 4. Invariance of cross ratio on a conic.
3. Camera Calibration “Camera calibration” is the process of computing the projective mapping (i.e. a 3 x 4 homogeneous matrix) from the 3-D space onto the image. Once the camera parameters are known, the problem of the 3-D reconstruction becomes much easier (see for instance [8,9,10]for recent contributions). Notice that each 3-D point with coordinates X = (XI,X2, X3, 1) projects in the image on a point x = ( A z l , A z 2 , A). The projection matrix M = (mij) is defined up to a scaling factor and this leads to the following two equations: x1
=
22
=
So if M is known, the image point provides us with the view line defined by the two linear equations (3.1) in the space coordinates. On the other side, when a calibration point is known with its image, it provides us with two linear equations relating the camera parameters. There are 12 unknowns up to a scaling factor, and therefore the degree of freedom is 11. So at least 11 independent equations are needed, i.e. at least six calibration points. Notice that finding M is a linear problem and therefore it can be easily approached with standard least squares methods using redundant data in order t o correct by measure noise. It is often proposed to use methods that separate what are called the intrinsic or interior parameters which depend only on the camera from the parameters
322
R. Mohr
scene frame
> image
plane
Fig. 5. The intrinsic or interior parameters.
which depend only on position; the latter are therefore called extrinsic or exterior parameters. The first terminology comes from the computer vision community (see for instance [ll]or [12], but the second is more than 50 years old and comes from photogrammetrists. Figure 5 depicts the standard reference frame for a pure perspective imaging system. The image plane has its own reference frame (27 2. The image plane is at a distance f from the principal point 0. f is called the principal axis distance, but also sometimes improperly called focal length. 0 is the origin of the camera reference frame (2, g, 4, which is oriented as in the figure. If the 3-D word reference frame coincides with the camera frame and if the camera coordinate axis (?, $ is parallel to the axis of the image frame, the perspective projection matrix P can be written as:
k , and kv are the scaling factor; they rely on the value o f f and on the image scaling factor on each image axis. uo and wo are the coordinates of the projection of the optical center onto the image plane. It has t o be noted that if the image frame is not orthogonal as it is in Fig. 5, the total number of these camera parameters are then five: U O ,V O , k,, k , and the angle a. So if we add the six degrees of freedom for the three-dimensional positioning of this camera frame in a world frame, we reach the total number of 11 degrees of freedom for the camera, which is to be related to the 11 degrees of freedom of a projective transformation from P 3 on P2.
2.4 Projective Geometry and Computer Vision 323
The general projection matrix P’ can be written as the product of a Euclidean transformation with the projection matrix P, the Euclidean transformation mapping the world coordinate frame in the camera coordinate frame:
(i 2 ”) . ( 0
P’=
uo 210
1
0
000
T1 t ) .
(3.2)
Given an estimation for P’, the computation of the estimation for P and for the rotation R and translation T satisfying (3.2) is a less easy and stable problem. References [ll]or [12] provide solutions and other new methods have been developed using for instance the properties of vanishing points [13]. It has t o be pointed out that, even in the case of a moving camera, such a decomposition is unnecessary. For instance if the camera is calibrated in two positions with the resulting projection matrices P’ and P”, the matrix associated to the motion to be estimated is just a rotation and translation matrix M containing a rotational orthogonal matrix R and a translation T such that
p” = p ’ x
(
000
’“)1
.
4. Application to Stereovision As it was presented in the previous section, a point in a image corresponds to a line in the space, and this line in completely determined if the system is calibrated. Let us now consider the case of two cameras observing the same scene. If we are able to find the projection in the two images of a 3-D point M , then its position in space is simply defined by the intersection of the two lines associated with each image. Usually such an intersection is computed using least square methods, as errors are always introduced in all the steps of the process: calibration, image acquisition, image point determination. Stereovision thus implies several steps: 0 camera calibration, 0 determination of points correspondence from one image t o the other, 0 3-D reconstruction. The reader is referred to books concerned with the subject like the one describing the pioneer work of Marr and Grimson [14] or the book written by Ayache where he describes the use of three cameras [15].
4.1. The Epipolar Geometry We only address here the problem of the geometry of such a stereoscopic system. Figure 6 displays the configuration. The 3-D point A4 observed by the two cameras defines a plane which intersects the image planes respectively by line 1 and 1’. Notice that when M moves, all these lines 1 pass through the intersection of the
324
R. Mohr
0
0 Fig. 6 . Epipolar lines for two images.
image plane P and the intersection of the line 00‘. This point e is called the epipole of image 1 with respect to image 2. Similarly, e’ is the epipole of image 2 with respect to image 1. 1 and I’ are the corresponding epipolar lines for these two images. Let m and m’ be the projection of M on each image. It has to be noticed that the line Om is projected on image 2 as the line l’, so each possible point m’ corresponding to the projection m of M has to be on the epipolar line 1’ associated t o 1 . Such a geometrical constraint reduces nicely the search for corresponding matches in the two images. Consider now the two pencils of epipolar lines when M moves through the space. Let us consider four points M , providing four distinct corresponding epipolar lines I , and 1:. As these pencils are obtained by the intersection of the pencil of planes passing through OO’, their cross ratios are the same; in fact this cross ratio is by definition the cross ratio of the pencil of planes. Therefore the pencil of the corresponding epipolar lines are in a projective correspondence. In conclusion, if three of the corresponding epipolar lines are known, the epipolar line correspondence for a fourth line is deduced in a straightforward manner from the cross ratio of the four lines in the pencil. The problem of epipolar geometry has seven degrees of freedom: 2 x 2 for the coordinates of the epipoles e and el, and three for the three epipolar lines in the second image corresponding to three arbitrarily epipolar lines going through e in the first image. So, the correspondence of between seven points in the two images is enough for defining the epipolar geometry. How it can be done effectively was recently established by Maybank and Faugeras [16].
2.4 Projective Geometry and Computer Vision 325 4.2. A Linear Computation of the Epipolar Geometry [3]
In the case of eight point matches between the two images, the computation of the epipolar geometry becomes much simpler. This elegant construction is inspired by [17]. Let m = ( 5 , y, t ) be a point in the first image and let e = (u,v, w) be the epipole point with respect t o image 2. The three homogeneous coordinates ( a , b, c) of the epipolar line 1 going through e and m are m x e where x denotes the cross product: obviously l.mt = l.et = 0. The mapping m = (5, y, z ) + ~ mx e = ( a , b, c ) ~is linear and can be represented by a matrix C of rank 2:
(H)
yw - zv
=
(-)
=
(-YW u! ); (i)
(4.1)
The mapping of each epipolar line 1 from image 1 to its corresponding epipolar line 1' in image 2 is a collineation defined in the dual space of lines in P2.Let A be one such collineation: 1lt = Alt. A is defined by the correspondence of three distinct epipolar lines. The first two correspondences provide four constraints as the degree of freedom of a line is 2. As the third line in correspondence belongs to the pencils defined by the two first ones, the third correspondence only adds one more constraint. So A only has five constraints for eight degrees of freedom. Let E = AC. Using (4.1) we get
As A has rank 3 and C has rank 2, E has rank 2. As the kernel of C is obviously Xe e , the epipole is the kernel of E. Let m' be the corresponding point of m. Using (4.1) the epipolar constraint mlllt - 0 can be rewritten m'Em = 0. So each matching between the two images provides a constraint on E, and as E is defined up t o a scaling factor, eight independent constraints will allow us to linearly compute E and therefore get the epipolar geometry. Notice that E is defined by seven degrees of freedom: C has two (the epipole position) and A has five. But allowing a redundant set of constraints provides a unique solution which can be linearly computed. N
4.3. Bringing Epipolar Lines i n Parallel: Image Rectification
A visually interesting case occurs when two image planes are parallel: the epipoles are at infinity and the epipolar lines are therefore parallel. Many people working in computer vision use image planes which are the same for the two images, which is an even stronger constraint than parallelism. This makes computation a bit simpler, but adds an unnecessary technical constraint to the vision system. As these constraints are hardly satisfied with the needed precision, this has to be avoided for real applications where the goal is precision in reconstruction.
326
R. Mohr
If parallel epipolar lines are wanted for easy human visual matching, one would prefer the following method: 0 calibration of the stereovision system, 0 computation of the epipoles, 0 computation of the image transformation which provides parallel epipolar lines, 0 reconstruction of all the image features after such a transformation. The transformation computation is easy (see [15] for another explanation and illustration). Let e = ( a ,b, l)t be the epipole of the first image. We have to find a projective mapping which sends e to infinity along the image 2 axis, i.e. a 3 x 3 homogeneous matrix A such that
(t)
=A(!)
The images of four points completely define A. As only one is presently set, we can add some more constraints, for instance leave three of the four corners of the image invariants. Such a choice allows the features to move not too far out of the image border lines. The same process can then be applied on the second image. However people prefer to have the epipolar not only parallel t o the 2 axis, but also that the corresponding epipolar line have the same coordinates. Therefore we chose three epipolar lines in the first rectified image, each having the equation y = yi. On the corresponding epipolar lines on image 2 three points can be chosen and their image specified with coordinates (xi,yi), where the xi are arbitrarily chosen. This allows enough freedom to have the rectified second image with reasonable coordinates. Figure 7 displays such rectified images. It is interesting to notice that such rectifications were already done optically with old photogrammetrist material a hundred years ago: as a perspective projection is just the general case of a projective transformation, this was done by choosing interactively a new projection of the previous image. 4.4. The Dansfer Problem
Having located a 3-D point in two images, the transfer problem is to determine how it can be located in a third one. In order to solve it, we need some knowledge of the three images, and here we assumed that we have the matches of several features within the three images. Computing the epipolar geometry of the imaging systems provides a direct solution. Let pi be the location of the considered point in image i. p l and p2 are already located. So p3 has t o be on the epipolar line corresponding to p l in image 3 , and on the epipolar line corresponding to pa. Thus it lies at their intersection. This fact is widely used in trinocular stereovision 1151. The fact that epipolar geometry can be computed with at least seven point matches was already mentioned in the previous subsection. However a simple case
2.4 Projective GeometnJ and Computer Vision 327
Cmaal
Fig. 7. Rectification of two stereo images. The corresponding epipolar lines are now horizontal lines with equal coordinates (courtesy of N. Ayache).
can be considered here when only six such matches are known, in the three images, and four of these points correspond to coplanar 3-D points. In what remains in this section we suppose that we shall never encounter degenerate cases (for instance two lines coinciding instead of the general case of two different lines). Let A , B , C, D be the four coplanar 3-D points and F, G the two remaining reference points. Oil i = 1, 2, 3 are the principal points of each of the three imaging systems we consider. a’, a”, a”’ are the projections of A in the images 1, 2, 3. The intersection F’ of the view line 0 1 F with the plane ABCD is defined by its projective coordinates measured in image 1, taking the projections a’, b’, c’, d’ as reference frame (see Fig. 8). Now consider image 2. Using a’‘, b”, c”, d” and the projective coordinate of F‘, we can locate its projection 4 in the second image. As we also have the image f” of F , we have therefore in image 2 the projection of two points from the line 0 1 F , i.e. we have the epipolar line associated with F (see Fig. 8). If we proceed similarly with G, the intersection of these two epipolar lines provides the epipole e12 of image 2 with respect t o image 1. Of course the process is symmetrical and allows us to find the epipole e21 of image 1with respect to image 2. Three epipolar lines are needed to complete the epipolar correspondence: using the
328
R. Mohr
Fig. 8. Reconstruction of the epipolar geometry.
reference point matches we have plenty of them: e21a1 with e12a2, e21bl with e12b2, and so on. Now consider the third image. From the epipolar geometry the position of each point matched in the two first images is straightforward. Using the previous construction, the epipolar geometry between images 1 and 3 and between images 2 and 3 is constructed, and the epipolar lines corresponding to the location of the point in images 1 and 2 intersects in only one possible position. 5. Application to 3-D Positioning 5.1. Relative Positioning Let us consider first the simple case of four points Pi lying on a line and viewed on an image where we can compute the cross ratio. We know that the fourth point is uniquely defined from the position of the first three and from the computed cross ratio. So, taking the first point as the origin, the position of the fourth point can be expressed, using as parameters the position of the second one and the third one, and using the cross ratio. The resulting expression is
2.4 Projective Geometry and Computer Vision 329
This simple example shows how relative positioning is possible. Similar construction can be done in the plane using the projective coordinates defined by the cross ratios (2.7) and (2.8). In the more simple case when the four points are the vertices of a parallelogram, the relation simplifies as we can easily choose two sides as reference axis (see Fig. 9).
Fig. 9. Relative positioning using a parallelogram.
In such a case the position of P in the frame (A>, A?7) is easily deduced:
x=-
-k1
kl y=-
+ k2 1
kl + k2 kl = [AB,AC; AD, A P ] , k2 = [BA,BD; BC, BPI.
(5.2)
As no 3-D position can be deduced from a single image [18],extra assumptions have t o be added: the alignment in the space of four points for (5.1) or coplanarity for (2.7) and (2.8). 5.2.
Where is the Camera?
We consider here the problem of finding the location of the principal point (sometimes called optical center) of the viewing system. We first consider the case of seeing six points in the scene, with four of them coplanar. First we are derive the view line associated with an image point relative to a reference point in the scene. Let m be the projection of a point M on a n image where the projection of the planar configuration ABCD is projected as abcd (see Fig. 10). As we mentioned (Section 2.3), we can compute the projective coordinates of f with respect t o the basis a, b, c, d. From the definition, these coordinates are the same for F', where the view line OF intersects with the plane ABCD. So the view line goes through F and F' and is defined. Proceeding in a similar way, we can compute the view line coming through E , and therefore the principal point 0 is the intersection of these two lines.
330
R. Mohr
center of projection
Fig. 10. The back projection of the image point m.
Having the 3-D position for 0, we deduce easily the viewline associated with each point m in the image. Such a computation can also be done using non-coplanar points, but the demonstration is a bit tedious and the reader is referred to [19] for the details. The hint of the technique can however be provided using a planar configuration: we suppose here that the image is restricted to a line and that we are taking such a picture in a planar world. We observe five reference points A , B , C, D, E with their images a, b, c, d, e. Measuring the cross ratio [a,b; c, d ] , we deduce from Chasles’ theorem that the principal point lies on a conic passing through A , B , C and D and completely defined by [a, b; c , d] (see Fig. 11). We can do it again with A , B , C, E and the two conics intersect in four points, three of them already known: A , B , C. So the remaining intersection is the desired position and is computed algebraically from formula (2.11).
Fig. 11. The camera location lies at the two conic intersections.
2.4 Projective Geometry and Computer Vision 331 5.3. Choosing Points a s References in the Scene
The techniques presented may lead to 3-D position estimation of points in the scene. However such an estimation needs at least two images when no constraints on the scene are given. We describe here a simple experiment of computation of the location with two views.
_.
I! I
!i
j(
ij
..-'i
---.a
Fig. 12. Contour image of a scene.
Let us consider the scene described in Fig. 12. It displays contours of a n image taken approximately a t a distance of 1 m with a regular Pulnix camera. Contours were fitted with straight lines and corner point coordinates were computed as intersections of these lines. The same process was applied on a second image. Taking as reference points six points from the background rectangles, the view line associated with each contour was computed using the technique presented in the previous section. Matches between the images was performed by hand as matching was not the primary concern of the present study. Intersection of the view lines associated with two corresponding image points was then computed using least squares and this provided us with the 3-D coordinates of the corresponding point in the scene. Table 1describes the results for the cube vertices. The exact location has no real meaning, it corresponds to the reference frame of the chosen point whose locations were measured with a standard ruler. Much more interesting are the edge lengths computed from these coordinates. As the exact size of the cube is 50 mm, the computed results are accurate within 4% of the value, and this without camera modeling and subpixel edge extractor.
332
R. Mohr Table 1. Experimental results for 3-D reconstruction of the cube. Points 0 1 2 3 4 5 6
z 78.9 79.1 81.3 82.0 33.2 34.4 30.3
Y 140 141 189 188 195 194 145
z
48.5 -2 47.5 -1.5 48.5 -1.5 49.0
Edges 0-1 0-2 0-6 1-3 2-3 2-4 3-5 4-5 4-6
Length 50.5 49.1 48.9 47.1 49.5 48.9 48.0 49.8 50.1
6. Recognition Using Projective Invariant In order to classify patterns, standard pattern recognition techniques use numerical measures which are invariant under the experimental conditions like for instance the movement of the observing camera (see [20] or Chapter 1.2 “Statistical Pattern Recognition” in this book). Therefore such invariants can be applied directly in classification methods. More recently researchers developed indexing for selecting a subset of possible candidate models using hashing techniques based on geometric invariants [21,221. The interesting point with the geometrical approach is that partial information on the image is sufficient to recover points, straight lines, conics, etc. and therefore to compute the invariants. This is not the case for standard global invariant measures like moments. We will first explore some results of the invariant theory and derive from there some invariants in the second subsection.
6.1. Results on Invariant Theory Only a simple introduction of this theory can be provided in this chapter. The reader is referred to the standard textbook on the subject like the second part of [5] or to the more vision application-oriented ones [4]. Let G be a group which acts on a set E l and o the composition operator of G. For instance 4 can be the Euclidean transformations in the plane and E the set of circles in this plane. o is in this case the composition of such transformations. 6 acts on E means that
Finding an invariant for E means computing a measure m that is constant for all z E E : ’v’x, z’ E E m ( % )= rn(z’). Of course if m(.) is such a function, so is f ( m (.)). We are only interested in such independent invariant functions.
2.4 Projective Geometry and Computer Vision 333 Let us consider the example of E being the set of circles with radius r. E is generated from a single circle by applying all the Euclidean transformations. There is an obvious invariant here: the radius r. The area a is also an invariant, but it is not independent with r: a = 2 m 2 . On the other hand if we consider the set of all points in the plane under the Euclidean transformation, there are no invariants. Before stating the basic result, we need few more notations. Let Dc be the degree of freedom of (i.e. more formally its dimension). Let D, be the degree of freedom of the subgroup which leaves an element z E E invariant (such a subgroup is called the isotropic subgroup of z), and let D E be the degree of freedom of E , then the number I of independent invariants is:
I = DE
- (Dc - min 0,) xEE
.
In the previous example, the degree of freedom for the planar Euclidean transformation is 3, a circle is defined by three parameters, and the subgroup which leaves a circle invariant is the one-dimensional subgroup of rotations centered at the circle center. So we get one independent invariant. Notice that we are only dealing with groups here. So we are not addressing the problem of the projection of the 3-D space into a 2-D image. But this result is applicable to planar shapes projected onto an image: the group is then the group of 2-D collineations. In fact there is no invariant in the case of 2-D projection of 3-D data [ 181 without additional conditions such as coplanarity. 6.2. Computing the Invariant Using the Cross Ratio
Formula (6.1) provides us only with the number of possible independent invariants. Here we explore how they can be computed easily in the projective case using the cross ratio. Recall that the degree of freedom for the collineation in P z is 8, as a collineation in the projective plane is defined by a 3 x 3 homogeneous matrix. 6.2.1. Invariant for two conics
Each conic has five degrees of freedom; this provides us with ten parameters for two conics. As there is no collineation which leaves two conics invariant in the general case, two invariants have to be discovered. The two conics intersect in four points, and as stated in Section 2, four points on a conic define a cross ratio. Each of these two cross ratios (one for each conic) are obviously independent. From (6.1) it is then possible to conclude that all other invariants can be obtained as a combination of these two measures. Such invariants are very useful as conics can be found by conic approximation to different shapes [23]. 6.2.2. Two points and two lines
Let A and B be two points and a and /3 be two lines intersecting in 0. This provides us with an eight degrees of freedom configuration (see Fig. 13). There is
334
R. Mohr
Fig. 13. The two points and two lines configuration.
an obvious invariant cross ratio: the line y defined by A and B intersects (Y and p in C and D, and this defines four points on a line; four points on a line provide us with a cross ratio [A, B , C, D]. Formula (6.1) indicates therefore that they should be a collineation subgroup leaving this configuration invariant and with degree of freedom a t least equal to 1. In fact if A and B are at infinity, it is easy to see that the only collineations which leave this configuration invariant are the uniform scaling transformations (similitudes) with origin at 0. This obviously is a one-dimensional subgroup and so there is only one independent invariant: [A, B; C, D] is a solution. 6.2.3. Implementation Experiments were conducted on this kind of invariants at the University of Oxford [24]. Figure 14 shows the relevant features used. In fact these authors computed these invariants using the general algebraic tools from the invariant theory. These algebraic invariants are nicely related to the one presented here [25]. Using such invariants may however lead to small combinatorial problems. For instance five points in a plane have two invariants, the two cross ratios provided by (2.7) and (2.8). However, having five points we have to choose which four points are going to be used as basis, and in which order they have to be considered. This leads us to 5! = 120 possibilities. The solution to this combinatorial problem is the use of the symmetrical polynomials. For instance we know that if k is the cross ratio of four points, then all possible cross ratios obtained by taking these four points in different orders are: 1 k , - , 1-k,-
k
1
I-k’
k-1 -or-
k
k k-1.
2.4 Projective Geometry and Computer Vision 335
Fig. 14. Recognition using conic fitting and conic invariants (courtesy of D. Forsyth et al.).
So for getting an invariant which does not depend on the order of these points, we have to look for a symmetrical polynomial with six variables, each for a value of (6.2). One of the simplest symmetrical polynomials which is not constant is 6
= i=l
2k6 - 6 k 5
+ 9k4
-
8k3 -t- 9k2 - 6k 1- 2
k2(k - 1 ) 2
Finding such symmetrical polynomials avoids the combinatorics of sorting the feature in a feature set. There still remains the combinatorial problem of collecting the right set of features, for instance the right set of five points. No general answer exists for this problem.
7. Discussion This chapter provided a short introduction to the geometry of the image formation system in the case of a pure perspective projection. For real images, optics and electronics provide differences with this ideal model and this can reach some pixels when using standard CCD cameras. For this reason methods are proposed in calibration for correcting the image, bringing the geometry back to the original perspective projection [26,1]. Projective geometry offers the right tool for dealing with such perspective projections. Its basic invariant is the cross ratio. Computing cross ratios can be done easily with image points, lines or conics. Such sets of features provide invariants which can be used in two ways: by indexing models for finding the possible models which can be associated with the invariant measured in a scene, or by finding the relative 3-D location of an object with reference t o another object. The latter case is called relative positioning and has proved t o be more flexible and robust than standard 3-D positioning in a camera reference frame.
336
R. Mohr
New kinds of geometric invariants are presently under investigation. In [27] the reader may find studies of differential invariants, i.e. invariants obtained on a curve using derivatives of different orders. Cross ratios on areas can also be computed [28]. However few experiences were reported on the stability of the values computed in the different cases. Such practical evaluations still remain to be done.
Acknowledgements The Esprit program “Basic Research” and the French national project “Orasis” provided financial and intellectual support to many parts of the work reported here. E. Arbogast, P. Gros, L. Morin, and L. Quan are kindly acknowledged for their insightful discussions and contributions. Figure 7 is displayed with courtesy of N. Ayache, and Fig. 14 with courtesy of A. Zisserman, J. Mundy, D. Forsyth, and Ch. Rothwell. I would like to thank them all for their cooperation.
References [l] R. Y. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shell T V cameras and lenses, IEEE Trans. Robotics and Automation 3, 4 (1987) 323-344. [2] K. W. Wong, Mathematical formulation and digital analysis in closerange photogrammetry, Photogrammetric Eng. Remote Sensing 41 (1975) 1355-1373. [3] 0. D. Faugeras, 3-0 Computer Vision (MIT Press, Cambridge, MA, 1992). [4] J. Mundy and A. Zisserman (eds.), Applications of Invariance in Computer Vision (MIT Press, Cambridge, MA, 1992). [5] J. G. Semple and G. T. Kneebone, Algebraic Projective Geometry (Oxford Science Publication, 1952). [6] S. J. Maybank, The projective geometry of ambiguous surfaces, Technical Report 1623, Long Range Laboratory, GEC, Wembley, Middlesex, UK, Jul. 1990. [7] N. Efimov, Advanced Geometry (MIR, Moscow, 1978). [8] R. Horaud, B. Conio, 0. Leboulleux and B. Lacolle, An analytic solution for the perspective 4-point problem, Comput. Vision Graph. Image Process. 47 (1989) 33-44. (91 J. S. C. Yuan, A general photogrammetric solution for determining object position and orientation, IEEE Trans. Robotics and Automation 5, 2 (1989) 129-142. [lo] Y . Liu, T. S. Huang and 0. D. Faugeras, Determination of camera location from 2-D to 3-D line and point, IEEE Trans. Pattern Anal. Mach. Intell. 12, 1 (1990) 28-37. [ll] R. K. Lenz and R. Y. Tsai, Techniques for calibration of the scale factor and image center for high accuracy 3-D machine vision metrology, in Proc. IEEE Int. Conf. on Robotics and Automation, Raleigh, USA, 1987, 68-75. [12] 0. D. Faugeras and G. Toscani, Camera calibration for 3-D computer vision, in Proc. lnt. Workshop on Machine Vision and Machine Intelligence, Tokyo, Japan, 1987. [13] B. Caprile and V. Torre, Using vanishing points for camera calibration, Int. J. Cornput. Vision 4 (1990) 127-140. [14] W. E. L. Grimson, From Images to Surfaces. A Computational Study of the Human Early Visual System (MIT Press, Cambridge, MA, 1981). [15] N. Ayache, Stereovision and Sensor Fusion (MIT Press, Cambridge, MA, 1990). 1161 S. Maybank, 0. Faugeras and Q. T. Luong, Camera self-calibration: Theory and experiments, in Proc. Second European Conf. on Computer Vision, Santa Margherita, May 1992.
2.4 Projective Geometry and Computer Vision 337 [17] 0. Faugeras, What can be seen in three dimensions with an uncalibrated stereo rig? in Proc. Second European Conf. on Computer Vision, Santa Margherita, May 1992. I181 J. B.'Burns, R. Weiss and E. M. Riseman, View variation of point-set and linesegment features, in Proc. DARPA-ESPRIT Workshop on Applications of Invariants in Computer Vision, Reykjavik, Iceland, Mar. 1991, 55-108. [19] R. Mohr, L. Morin and E. Grosso, Relative positioning with poorly calibrated cameras, in Proc. DARPA-ESPRIT Workshop on Applications of Invariants in Computer Vision, Reykjavik, Iceland, Mar. 1991, 7-45. [20] J. T. Tou and R. C. Gonzalez, Pattern Recognition Principles (Addison-Wesley, 1974). [21] P. C. Wayner, Efficiently using invariant theory for model-based matching, in Proc. Conf. on Computer Vision and Pattern Recognition, Maui, Hawaii, Jun. 1991, 473-478. [22] H. L. Wolfson, Model-based object recognition by geometric hashing, in 0. Faugeras, (ed.), Proc. 1st European Conf. on Computer Vision, Antibes, France (SpringerVerlag, 1990) 526-536. [23] D. Forsyth, J. L. Mundy, A. Zisserman and C. M. Brown, Projectively invariant representation using implicit algebraic curves, in 0.Faugeras (ed.), Proc. 1st European Conf. on Computer Vision, Antibes, France (Springer-Verlag, Apr. 1990) 427-436. [24] D. Forsyth, J. L. Mundy, A. Zisserman and C. Rothwell, Invariant descriptors for 3-D object recognition and pose, in Proc. DARPA-ESPRIT Workshop on Applications of Invariants in Computer Vision, Reykjavik, Iceland, Mar. 1991, 171-208. [25] L. Quan, P. Gros and R. Mohr, Invariants of a pair of conics revisited, in P. Mowforth (ed.), Proc. British Machine Vision Conf., Glasgow, Scotland (Springer Verlag, 1991) 71-77. [26] C. C. Slama (ed.), Manual of Photogrammetry, fourth ed. (American Society of Photogrammetry and Remote Sensing, Falls Church, VA, 1980). [27] A. Zisserman and J. Mundy (eds.), Proc. DARPA-ESPRIT Workshop on Applications of Invariants in Computer Vision, Reykjavik, Iceland, Mar. 1991. [28] E. B. Barrett, P. M. Payton, N. N. Haag and M. H. Brill, General methods for determining projective invariants in imagery, Comput. Vision Graph. Image Process.: Image Understanding 53, 1 (1991) 46-65.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 339-385 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 2.5 I 3-D MOTION ANALYSIS FROM IMAGE SEQUENCES USING POINT CORRESPONDENCES
JOHN. J. WENG Department of Computer Science, Michigan State University East Lansing, Michigan 48824, USA and THOMAS S. HUANG Beckman Institute, University of Illinois at Urbana-Champaign Urbana, Illinois 61801, USA
The objective is to analyze the motion between a rigid scene and the camera. The temporal correspondences between consecutive images are established by a procedure of image matching. Such temporal correspondences are then used for the estimation of the three-dimensional (3-D) interframe motions, as well as the 3-D structure of the scene. Long term motion that covers many image frames is modeled using object dynamics, and the model parameters are determined from the interframe motions. Thus, smooth 3-D motion can be predicted using the model parameters. Keywords: Image matching, optical flow, 3-D motion estimation, structure from motion, motion modeling, motion prediction.
1. Introduction
The projection of a dynamic 3-D scene onto an image plane contains rich dynamic and geometric information about the scene. The projections a t different time instants can be recorded by a sequence of images. To extract the information from the image sequence, several basic subtasks are identified: image matching, interframe motion estimation, motion modeling and prediction.
1.1. Image Matching The first subtask is to establish the correspondences between images. Its objective is to identify image elements in different images that correspond to the same element of the sensed scene. The matching elements, or tokens, can vary significantly from one approach to another. Existing techniques for image matching roughly fall into two categories: continuous and discrete.
339
340
J. J. Weng & T. S. Huang
(1) Continuous approaches. Although the objective of the approaches in this category is to determine image velocity field instead of performing explicit matching between images, the computed velocity field amounts to image matching. Each velocity vector approximates the correspondence between two points in different images. Ideally, one needs the projection of 3-D velocity on the image plane. However, since such a projection is not directly available from visual sensors, an optical flow field (the field representing the apparent motion of the brightness pattern) is used to approximate the actual image plane velocity field. The techniques in this category typically need the assumption that the interframe motion is small and the intensity function is smooth and well-behaved [l-51. (2) Discrete approaches. The techniques in this category treat the images as samples of the scene taken at discrete times, and select discrete features as tokens that are to be matched. Points with high intensity variation are often used as the matching tokens [6,7]. Other features used for matching include closed contours of zero crossings of Laplacian-of-Gaussian images to compute velocity field [8], edges for stereo matching [9-121, lines for stereo matching [13], correlation of intensity patterns [14,15], local phase information [16-181, or some aspects of higher level scene structure [19-211. Discrete approaches allow either small motion or relatively large motion. The image matching algorithm presented here belongs t o the discrete approach, but it takes the advantages of implicit matching which is common in continuous approaches. It associates multiple attributes with the images t o obtain a n overdetermined system of matching constraints. This accommodates, to varying degrees, image noise and slight variations in image intensity that result from changes in viewing position, lighting, shading, reflection, etc. More importantly, the displacement vectors are determined in this overdetermined system without resorting to smoothness constraint. A multi-resolution multi-grid computational structure is employed to deal with relatively large image disparities caused by large interframe motions. The approach is capable of dealing with uniform non-textured object surfaces that are often present in real world images. We also address the problem of discontinuities in the field of displacement, and occlusion. The algorithm computes the displacement field and occlusion maps along a dense pixel grid, based on two perspective images of a scene. This algorithm has been tested on images of real world scenes which contain significant occlusions and depth discontinuities [22]. 1.2. 3 - 0 Motion Estimation 3-D motion estimation has been investigated also under two types of approach: discrete and continuous. In the discrete approaches, the motion is treated as a displacement from one time instant to the next. Therefore, the time separation between the two time instants can be either long or short. The parameters of interframe motion are called two-view motion parameters. The result of image matching, used as input for 3-D motion estimation, is given as the displacement vectors between the
2.5 3-D Motion Analysis from Image Sequences 341 corresponding image points. In the continuous approaches, the interframe motion is approximated by motion velocity and, therefore, in order for such an approximation to be reasonable, the interframe motion must be very small. The 3-D motion is formulated as velocity. The result of image matching, which is needed as input for motion estimation, is given as optical flow. Since the discrete approach more accurately models what actually happens than the continuous approach, we present a discrete approach in this chapter. The possibility of recovering the 3-D motion and structure of a scene from its monocular views has been known t o photogrammetrists for quite long. This subject attracted investigations in the computer vision area around the early 80’s. A few iterative algorithms were proposed [23-271. One drawback of these iterative algorithms is that the solution is not guaranteed because the iterative search may be trapped at local extrema. Two linear algorithms were developed independently by Longuet-Higgins [28], and Tsai and Huang [29]. The linear algorithms guarantee a unique solution if certain nondegeneracy conditions are met. Yen and Huang [30] reported a vectorgeometric approach to this problem. Because these algorithms were designed primarily for noise-free cases, a high sensitivity to noise has been reported [29,31]. In the framework of continuous approach, closed-form solutions for motion velocity from optical flow have been presented by Zhuang et al. [32,33] and Waxman et al. [34]. Since then, improvements have been made in reducing the sensitivity to noise while still keeping the algorithm linear. The post-consideration of the constraint in E through a constrained matrix fitting was independently reported by Faugeras, Lustman and Toscani [35], and Weng, Huang and Ahuja [36]. The latter algorithm is almost the same as the one presented in this chapter (Section 3). It eliminates the need t o compute three false solutions and also uses other measures to improve the stability of the solution. While the above linear algorithms require the solution of only linear systems and therefore no iteration is needed, further improvement of the solution requires a globally optimal nonlinear solution and thus a nonlinear algorithm. The optimal solution is presented in Section 3.4.
1.3. Motion Modeling and Prediction The trajectory of a moving object can be used to understand the motion pattern and predict future motion. Section 4 presents a framework for motion modeling, understanding and prediction. Based on dynamics, a locally constant angular momentum (LCAM) model is introduced [42]. The model is local in the sense that it is applied to a limited number of image frames at a time. Specifically, the model constrains the motion, over a local frame subsequence, to be a superposition of precession and translation. Thus, the instantaneous rotation axis of the object is allowed to change with time. The trajectory of the rotation center is approximated
342
J. J. Weng & T. S. Huang
by a vector polynomial. The parameters of the model evolve in time so that they can adapt to long term changes in motion characteristics. Based on the assumption that the motion is smooth, object position and motion in the near future can be predicted, and short missing subsequences can be recovered. 2. Image Matching Two images, I and 1’,are two functions: i: U -+ B and i‘: U -+ B , where U is a subset of 2-D space, and B is a subset of l-D space for monochrome images (3-D space for color images). U defines the image plane. Functions i and i’ map each point in the image plane to an intensity value. The occlusion map, 0 for image I consists of those image points in image I whose corresponding points are not visible in image 1’. Similarly, we define the occlusion map, 0’ for image 1’. An image matching from I to I’ is a mapping K : U + U such that for any u E U - 0 (the symbol “-” denotes set subtraction), u and K(U) are the projections of the same point of the scene onto images I and 1’, respectively. Notice that the mapping from an occluded point u E 0 is arbitrary. Similarly we define K’ as the matching from I’ to I. The displacement field is defined by d = K - e , where e is a n identity mapping e(u) = u for all u in 2-D space. Therefore, the matched image point for u is u’ = K(U) = u d(u). We will use the term “displacement field” to refer to the result of image matching.
+
2.1. Image Attributes Image attributes are defined for image matching: the matching points should have similar attributes. Some simple image attributes are defined in the following. Image intensity is a simple image attribute. Under certain conditions, e.g. from matte surfaces illuminated by extended light sources, the image intensity value of a scene is in fact quite stable under motion. However, if matching is based on intensity only, a point can be matched to any point with the same or similar intensity. Although intensity may vary, certain relationships among intensity values of nearby points may be relatively stable. These relationships provide some structural information of the scene. A candidate of such image structural information is a sharp transition of intensity, or edge. To get a continuous measure of edgeness, we define edgeness as the magnitude of the gradient of intensity, namely, e = IlVill. Similarly, the edgeness and intensity are generally not sufficient to reliably determine the correct match. The features that relate t o the shape of the local intensity surface are useful in distinguishing otherwise similar looking points. For example, different points on the border of a region may have the same intensity and edgeness values, but the local border shape may vary from point to point on the border. A point at a geometrical corner may be clearly distinguished from others. The cornerness or the curvature of a region border thus can be used as a matching
2.5 3-0 Motion Analysis from Image Sequences 343 criterion. We define the cornerness in the following way that does not use computationally expensive polynomial fitting but achieves very good performance on real world images. As we mentioned earlier, we define positive and negative cornerness separately. Roughly speaking, the edgeness at a point u measures the change in the direction of the gradient at two nearby points, weighted by the gradient at the point. These two points, u ra and u rb are located on a circle centered at u. The radius of the circle is determined by the level of image resolution. We choose ra and rb such that the directional derivative along the circle reaches the minimum and the maximum values, respectively. Let a = V i ( u ra), b = V i ( u +rb), and angle (a, b) be the angle from a to b measured in radians counter-clockwise, ranging from -T to T . The closer the angle is to ~ / 2 the , higher the positive cornerness measure should be. In addition, the measure should be weighted by the magnitude of gradient at the point u since the direction of the gradient in a uniform region is very unreliable. Mathematically, the positive cornerness and negative cornerness are defined, respectively, by
+
+
+
e ( u ) ( l - 11 - 2 / angle(a, ~ b)l)
0 5 angle(a, b) 5 otherwise
T
(2.1)
and n(u) =
e ( u ) ( l - 11
+2
/ angle(a, ~ b)l)
-T 5 angle(a, b) 5 0 otherwise
where column vectors a and b are intensity gradients at u tively:
where ra and rb are such that llrall = llrbll
=T
(2.2)
+ ra, and u + rb, respec-
and
and
where the superscript “I” denotes the corresponding perpendicular vector: if r = ( T ~ T, , ) ~ , then r l = (-T,, T , ) ~ . If ra and rb that correspond t o the minimum in (2.3) and the maximum in (2.4), respectively, are not unique, we choose those that minimize p(u) in (2.1) and (2.2), in addition to satisfying (2.3) and (2.4). The cornerness is defined only if the involved derivatives exist. The value of T is a
344 J. J. Weng d T. S. Huang
parameter of cornerness, and is directly related to image resolution. In the discrete version, T is equal to the pixel size. The framework described below does not depend very much on the type of image attributes used. Different image attributes can be used according to the actual applications. The attributes defined in this section are planar rigid motion invariant (PRMI) in the sense that if the image is rigidly moved in the image plane, the attributes defined at the two corresponding points (before and after the 2-D motion) have the same value. 2.2. Smoothness Smoothness constraints impose some similarity of the displacement vectors over a neighborhood. In addition t o considering the smoothness of the overall displacement vectors, we separately consider the smoothness of the orientation of these vectors. The reason for emphasizing orientation smoothness is that (1) the orientation of the displacement vectors projected from a coarse level is generally more reliable than their magnitude, and (2) at a fine level, the local attribute gradient perpendicular to the displacement vector can easily lead the displacement vector in a wrong direction if the orientational smoothness is not emphasized. Clearly, the smoothness constraint should be enforced only over points whose displacements are related, e.g. over adjacent points from the same surface. To selectively apply the smoothness constraint to two points, we use the similarity of intensities and the similarity of available displacement vector estimates at the two points. We represent the displacement vector field in the vicinity of a point uo by a vector which is intended to approximate the displacement field within the region that uo belongs to. In the implementation, ~ ( u o )is computed as
a(,,), -
JII’
d(uo) = w(i(u) - ~ ( u o ) ,d(u) - d(uo))d(u)du o< 1 Iu-uo I I
(2.5)
where 0 < IIu - uoll < T denotes a region around U O , and w ( . , .) is a weight. In the digital implementation, the integration is replaced by a summation over UO’S eightneighboring pixels. The weight is a function of the intensity difference i(u) - i(u0) and the displacement vector difference d(u) - d(u0). The objective that ~ ( u o ) represents the neighboring displacement vectors of the region of uo suggests the following requirements on the weight. (1) The weight is large if the intensity difference is small. We assume that small intensity difference is observed when two neighboring points u and uo belong to the same region, and therefore, their displacement vectors should be similar. (2) If u and u o have similar intensity but the corresponding displacement vectors are different, the weight should remain large. This case occurs when the displacement field is projected from a coarse level to the finer level. Two adjacent points with the same intensity may take quite different initial displacement vectors if they belong to different grid points at the coarse level.
2.5 3 - 0 Motion Analysis from Image Sequences 345 (3) If u and uo have different intensities and their displacement vectors are very different, the weight should be extremely small to suppress the influence of u on a(u0). Let ~i = li(u) - i(u0)) and qd = d(u) - d(u0). satisfies the above criteria is as follows:
A
definition of weight that
n
where E is a small positive number to reduce the effects of noise in intensity and prevent the denominator from becoming 0, and c is a normalization constant which makes the integration of weights equal t o 1:
//
w(i(u) - ~(uo), d(u) - d(u0)) du = 1.
o< I Iu-uo II< r To ensure that requirement (2) is met, a small scale factor can be applied to the term llqd1I2, or alternatively, it can be set to zero which gives a simpler form:
When the displacement field is computed with a computed occlusion map, the weights in (2.6) and (2.7) should be modified if, in (2.5), uo is not an occluded point but u is. In this case, the weight corresponding t o the occluded point u should be zero, w = 0, since ~ ( u oshould ) not take the meaningless d(u) into account. If uo is an occluded point, the weight need not be set to zero no matter whether u is an occluded point or not, since the displacement vector at an occluded point may conveniently take the value of such a ~ ( u o ) . Thus, the weight is automatically determined based on the intensity difference and the displacement difference. The smoothness constraint imposes similarity between d(u0) and ~(uo). The larger the difference in intensity, the more easily the fields for two adjacent regions can differ. If two regions get different displacements after some iterations, the quadratic term 11qd112 results in a very small weight to reduce their interactions. On the other hand, the displacement vectors in the same region will be similar since the corresponding weight is large. Because intensity difference is usually much larger than the magnitude of displacement difference, 1q21 is not squared in (2.6) (unlike q d ) , otherwise the weight will be too sensitive to small changes in intensity. The weights, thus, implicitly take into account discontinuities and occlusions. The registered value ~ ( u oallows ) us t o perform matching using uniform numerical optimization despite the presence of discontinuities and occlusions. This is discussed below. 2.3. Matching Based on Image Attributes
Given a displacement vector field, some measure of similarity, or residua1 error, between the attributes of estimated corresponding points can be defined. The
346
J. J. Weng tY T. S. Huang
residual of intensity is defined by:
ri(u, d) = i'(u + d ) - i(u) . Similarly, we define the residual of edgeness re(u,d), that of positive cornerness rp(u,d), and that of negative cornerness rn(u, d). The residual of orientation smoothness is defined by
ro(u, d ) where cross{(a, b ) , ( c , d ) } defined by
=
=
Ilcross{d(u),
W}ll/ll~(4Il 1
ac - bd. The residual of displacement smoothness is
Td(U,
d) = 1144 - ai(u>I I .
These residuals account for changes of a wide variety of factors. Under the conditions we discussed above, the similarity of attributes approximately holds. So, we determine displacement vector d such that the weighted sum of squares of residuals is minimized: min c { r : ( u , d) +Aera(u, d) +XPri(u,d) d
+
XnT:(U,
d) +X&(u, d ) +X&(U,
d)}
U
where re, r p ,rn, ro and T d are weighting parameters that are dynamically adjusted at different levels. Let A
r=
(ri, r e , r p ,
rn, ro, rcilt.
With the previous estimate of the displacement vector d (initially d is a zero vector at the highest level), we need to find increment bd. Expanding r(u, d f & ) at 6d = 0, we have (suppressing the variable u for conciseness) r ( d f 6d) = r(d) f
+ o(llbdll) A + Jbd + o(ll6dll)
where
, dU
-zJ/lla1I 1 0
(2-8)
2.5 3-0 Motion Analysis from Image Sequences
347
denotes the partial derivative of i'(u, u) (&, &)' = d, and the partial derivative with respect to u a t point u+d, and so on. Define a diagonal matrix which specifies the weights on different residuals
A = diag(1, A,
A,,
An, L,A d ) .
(2.9)
We need to solve for bd such that the sum of squared residuals is minimized. Neglecting high order terms and minimizing IlA(r J6d)1I2, from (2.8) we get the formula for updating d:
+
bd = -(JtA2J)-lJtR2r(u).
For each grid point along which the displacement vectors are to be computed at a resolution level (see Fig. l ) ,the displacement vector d is replaced by d bd. An iteration consists of such an updating for every grid point. At each resolution level, a fixed number of iterations (e.g. 20) are performed before the displacement field along the grid is projected to the next finer level. The final displacement field is obtained at the original image resolution.
+
2.4. Multi-Resolution and Multi-Grid
To find matches over a large disparity requires that we know the approximate locations of the matches, since otherwise multiple matches may be found. One solution to this problem is blurring the image to filter out high spatial frequency components. However, blurred intensity images have very few features left, and their locations are unreliable. Therefore, instead of blurring the image first and then measuring edgeness and cornerness, we blur the original edgeness and cornerness images (called attribute images here). Since the cornerness measure has a sign, nearby positive and negative corners may be blurred to give almost zero values, which is the same as the result of blurring an area without corners. We therefore separate positive and negative corners into two attribute images. Blurring is done for positive and negative images separately. Such blurred edgeness and cornerness images are not directly related to the blurred intensity images. They are related to the strength and frequency of the occurrence of the corresponding features, or to the texture content of the original images. While texture is lost in intensity images a t coarse levels, the blurred edgeness and cornerness images retain a representation of texture, which is used for coarse matching. The intra-regional smoothness constraint at coarse levels applies to blurred uniform texture regions (with averaged intensity). When the computation proceeds to finer levels, the sharper edgeness and cornerness measures lead to more accurate matching. Therefore, in general the algorithm applies to both textured or non-textured surfaces. At a coarse resolution, the displacement field only needs to be computed along a coarse grid, since the displacement computed at a coarse resolution is not accurate, and a low sample rate suffices. A coarse grid also helps to speed up the propagation of results within uniform regions. In the approach described in this paper,
348
J. J. Weng & T. S. Huang
the coarse displacement field is projected to the next finer level (copied to the four corresponding grid points), where it is refined, according to the above discussion. Such a projection-and-refinement procedure continues down t o finer levels successively until we get the final results at the original resolution. The computational structure and data flow used in this process are shown in Fig. 1.
AnrihlLe unages
Di lacement
ti% on grid
Afvihlte
images
& lumng
Displacement field
A 7
Image 1
A
/-f-7 Image 2
Fig. 1. The computational structure and the data flow.
The partial derivatives of the entries of J in (2.9) are computed by a finite difference method in the implementation. Let s denote the distance between two adjacent points on a grid, along which the finite deference of the attributes is computed, assuming a unit spacing between adjacent pixels. Then s should vary with the resolution. In addition, s should also vary with successive iterations within a resolution level. A large spacing is necessary for a rough displacement estimate when iterations start at each level. As iterations progress, the accuracy of the displacement field increases and s should be reduced t o measure local structure more accurately. The mask to compute the finite differences is shown in Fig. 2, where spacing s at level I is equal to 2' for the first one-half number of iterations at level 1, and is reduced by a factor of 2 for the second half, except for 1 = 0. At the original resolution ( I = 0 ) , the spacing is always equal t o 1, since no smaller spacing is available on pixel grid.
2.5 3-D Motion Analysis from Image Sequences
349
Fig. 2. Mask for computing derivatives.
2 . 5 . Preprocessing, Normalization, and Recursive Blurring The matching algorithm has t o cope with images of a wide variety of scenes. The purposes of preprocessing are (1) to normalize the images so that the algorithm can use a set of standard parameters for different scenes and (2) to filter out noise in the images. The pair of intensity images to be matched is first normalized by a linear function so that the minimum and the maximum intensities are equal to 0 and 255, respectively. (This range from 0 t o 255 is to adapt to the representation with 8 bits/pixel, but it can be changed as needed.) Then it is filtered with a small (3 x 3) low pass filter to suppress gray level noise. Similarly the edgeness and the cornerness also need to be normalized. The following considerations motivate the normalization. First, small gradients are more susceptible to intensity noise and are not reliable. Second, strong gradients may overwhelm moderate gradients in the edgeness measurement. Third, different scenes have different ranges of gradient magnitude and the algorithm should treat them in a systematic way. Therefore, we slightly modify the definition of edgeness defined in Section 2.1. Edgeness is the magnitude of the gradient normalized and transformed by a function f shown in Fig. 3
The function f maps the magnitude of the gradient onto the range [O, 2551. It has two transition points z o and z1. From z = 0 to z = z o , f (z) M 0 t o suppress noise. From z = z o t o z = 2 1 , f ( z ) increases from near 0 to almost 255 gradually and smoothly. The smooth transition interval [zo, 211 allows continuous variation of edgeness for the gradient with a moderate magnitude. For z > 2 1 , f(x) x 255, to limit strong edges and relatively enhance the moderate edges. The values of two transition points zo and 2 1 are determined automatically through an analysis of the histogram of gradient magnitudes such that the fractions of the pixels in the edgeness images that have values below f(z0) and above f(z1) are maintained a t predetermined levels.
350
J. J. Weng & T. S. Huang
Fig. 3. Two normalization functions for t h e edgeness.
The edgeness e(u) used in the definition of cornerness, (2.1) and (2.2), should also use the modified definition (2.10) as well. Note that such modified edgeness and cornerness are still PRMI attributes. The preprocessing and normalization steps enable the algorithm to perform consistently for a wide variety of images using a set of standard parameters, which are selected based on a moderate number of image examples. In the implementation, the parameters can be determined through trials. A set of parameters, e.g. those in (2.6) or (2.7) and (2.9), are determined for each level of resolution. At coarse levels, the edgeness, cornerness and smoothness have relatively large weights. Their weights are reduced gradually down to finer levels, since smoothness constraint should be reduced at finer levels where details of the displacement are obtained, and cornerness and edgeness measurements a t finer levels are more susceptible to noise than significantly blurred measurements. The original images are first preprocessed by the methods discussed above. Then as shown in Fig. 1, four pairs of attribute images are generated (intensity, edgeness, positive cornerness and negative cornerness). The attribute images are extended in four directions to provide context for the points that are near the image border. The extension is made by replicating the border row or column. We use recursive blurring to speed up computation. Only integer summations and a few integer divisions are needed to perform such a simple blurring. The blurring for level 1 1 is done using the corresponding attribute image at level 1: For each pixel at level I + 1, the value is equal t o the sum of the values of four pixels at level 1 divided by k ( k = 4 for the intensity, k = 3 for the edgeness and k = 2 for the cornerness). The locations of these four pixels are such that each is centered at a quadrant of a square of a x a. a is equal to 2' at level 1. Therefore, the blurred intensity image at level I is equal to the average over all pixels in a square of size a x a. To enhance sparse edges and corners, k is smaller than 4 for the edgeness
+
2.5 3 - 0 Motion Analysis f r o m Image Sequences 351
and the cornerness. So, the results can be larger than 255. If this occurs, the resulting value is limited t o 255. This multilevel recursive normalization is useful for the algorithm t o adapt to different scenes. 2.6. Occlusion
To correctly match two images, those scene regions which are occluded in one or the other image must be identified. Occlusion occurs when a part of the scene visible in one image is occluded in the other by the scene itself, or a part of the scene near the image boundary moves out of the field of view in the other image. If the occluded regions are not detected, they may be incorrectly matched to nearby regions, interfering with the correct matching of these regions. To identify the occluded regions, we define two occlusion maps, occlusion map 1 showing parts of image 1 not visible in image 2, and similarly occlusion map 2 for image 2 (see Fig. 4, where black areas denote the occluded regions). We first determine the displacement field from image 2 to image 1, without occlusion information. The objective of this matching process is t o compute occlusion map 1. This matching may “jam” the occluded parts of image 2 (e.g. the right-most section in Fig. 4) into parts of image 1 (e.g. the right-most section in Fig. 4). This generally will not affect the computation of the occlusion map 1, since the occluded regions of image 1 may only occur on the opposite side across the “jammed” region (in Fig. 4,e.g. the occluded region of image 1 is to the right of a “jammed” region). Those regions in image 1 that have not been matched (in Fig. 4,no arrows are pointing to them) are occluded in image 2 and are therefore marked in occlusion map 1 (black in Fig. 4). These unmatched patches may also be located at the center of the images, if they are occluded by other parts of the scene. Once the occlusion map 1 is obtained, we then compute the displacement field from image 1 to image 2 except for the occluded regions of image 1. The results of this step determine occlusion map 2 (see Fig. 4).
Occlusion map 1
Occlusion map2
Fig. 4. A 1-D illustration of determining occlusion maps (see text). Images are represented by lines as one-dimensional images. The displacement fields shown just illustrate the correspondences between two 1-D images, and are not the actual displacement fields.
352 J. J. Weng €4 T. S. Huang
From the definition of K and K ' , it is clear that K. and K.' are one-to-one correspondences from U - 0 t o U - 0', and from U - 0' to U - 0, respectively. Therefore, the occlusion map 0 can be determined by 0 = U - K'(U - O'), and similarly 0' = U - n(U - 0). However, this procedure is recursive. Once one occlusion map is determined, the other can also be determined. The procedure outlined in Fig. 4 used preliminary K.' that is computed t o determine 0 without information about 0'. Since regions in 0 and 0' are generally far apart, this preliminary K.' may be good enough to determine 0. 2.7. Outline of the Algorithm
The following summarizes the steps of the procedure that computes the displacement field from one image t o the other (see Fig. 1): (1) Filter the two images using a 3 x 3 low pass filter to remove noise and normalize the pair of images as described in Section 2.5. (2) Compute the image attributes: intensity, edgeness, positive cornerness and negative cornerness as described in Sections 2.1 and 2.5. ( 3 ) Set the level to the highest e.g. 1 = 6, and set the displacement field on the grid of level 6 to zero. (4) Blur the attribute images to level 1 as described in Section 2.5. The scale of the blurring filter at level I is 2'. ( 5 ) Compute the displacement field along the grid. Perform a number of iterations as discussed in Section 2.3. (6) If I = 0, the procedure returns with the resulting displacement field. Otherwise go to 7. (7) Project the displacement field on the grid of level 1 to the grid of level 1 - 1 (replicating the vector at each grid point t o the four corresponding grid points of level 1 - 1);decrement 1 by one and go to (4). Suppose we need to determine the displacement field from image 1 t o image 2. In order to obtain occlusion map 1, first compute the displacement field from image 2 to image 1 using the above procedure, without occlusion information (assuming image 2 has no occluded region in image 1). The displacement field computed is used to determine the occlusion map 1 for image 1. In the implementation, the occlusion maps are filtered by 3 x 3 median filters to remove single-pixel-wide occlusion and noise. Then, the displacement field from image 1 to image 2 is computed using the occlusion map 1 by calling the above procedure starting from step (4).In step ( 6 ) , if a point in image 1 is marked in the occlusion map 1, it is not visible in image 2, and so, the displacement vector from this point cannot be determined. We just copy the vector d to this occluded point. The final computed displacement field assigns a displacement field t o every pixel in image 1. In summary, the matching algorithm computes a displacement vector for every pixel in the source image. This vector points t o the matching point in the target
2.5 3-0 Motion Analysis f r o m Image Sequences
353
image. Since subpixel precision is used, the two components of the displacement vector use real value prepresentation. Mathematically, this is equivalent to computing point correspondences: For each point in the source image, the algorithm determines the corresponding point in the target image so that these two points are projections of the same point in the scene.
3. Motion Estimation This section first presents a linear algorithm that exploits redundancy in the available data to improve accuracy of the solution. Then, the optimization is discussed. We first define a mapping [.Ix from a 3-D vector to a 3 x 3 matrix:
0 [(Xl, 5 2 ,
x3y1x =
-53
53 [ L X 2
52
0
-x1]
51
0
(3.1)
Using this mapping, we can express cross operation of two vectors by the matrix multiplication of a 3 x 3 matrix and a column matrix:
x x Y = [XIx Y .
(3.2)
3.1. Problem Statement
Let the coordinate system be fixed on the camera with the origin coinciding with the projection center of the camera, and the z-axis coinciding with the optical axis and pointing toward the scene (Fig. 5 ) . Since we are only interested in the ratio of image coordinates to the focal length and one can always measure the image coordinates in the unit of focal length, we assume, without loss of generality, that the focal length is unity. We call such a camera model normalized camera model. Thus, in the normalized camera model, the image plane is located a t z = 1. Visible objects are always located in front of the camera, i.e. z > 0. Notice that 0 < z < 1 may occur since the camera model is normalized. Consider a point P on the object which is visible at two time instants. The following notation is used for the spatial vectors and the image vectors. x = (5, y, z ) ~ spatial vector of P at time t l ; x' = ( X I , y', z ' ) ~ spatial vector of P a t time t 2 ;
x = ( u , w, l ) t = X' = (u', w', ly
=
image vector of P at time t l ;
(5,2,l) Y'
t
image vector of P at time
t2;
where ( u , w) and (u', w') are the image coordinates of the point. Therefore, the spatial vector and image vector are related by
x =zx,
x' = z'x' .
354
J. J. Weng & T. S. Huang
Figure 5 shows the geometry and the camera model of the setup. From the figure we can see that the image of a point determines nothing but the projection line, the line that passes through the point and the projection center. The direction of this projection line is all that we need, and the position of the image plane is immaterial. That is why we can normalize the focal length to unity. It is obvious that the model in Fig. 5 is not meant to describe the optical path in a conventional camera. But rather, it is a simple geometrical model that is mathematically equivalent to an ideal pin-hole camera. A conventional camera can be calibrated so that every point in the actual image plane can be transformed to a point in the image plane of this normalized model. X
Fig. 5 . The geometry and the camera model of the setup.
Let R and T be the rotation matrix and the translational vector, respectively. The spatial points at the two time instants are related by x’=Rx+T or for image vectors: z’X’=zRX+T.
If IlTll
# 0, from
(3.3)
(3.3) we get
(3.4)
where
Given n corresponding image vector pairs at two time instants, Xi and Xi, i = 1, 2, . . . , n, the algorithm solves for the rotation matrix R. If the translation vector T does not vanish, the algorithm solves for the translational direction represented by a unit vector T and the relative depths and for object points xi and llTll IlTll
2.5 3-D Motion Analysis from Image Sequences 355
x:, respectively. The magnitude of the translational vector llTll, and the absolute depths of the object points, zi and zi, cannot be determined by monocular vision. This can be seen from (3.4), which still holds when 11T11, zi and z: are multiplied by any positive constant. In other words, multiplying the depths and llTll by the same scale factor does not change the images.
3.2. Algorithm We shall first state the algorithm, and then justify each of the steps.
Step (2). Solvingfor E. Let Xi = (ui, vi, l ) t ,Xi = (u:,v;,l ) t ,i be the corresponding image vectors of n (n 2 8) points, and u1u: u2u;
A = [
u1v: u2v;
u1
v1u:
u2
u2u;
v1v: v2v;
211
v2
u; u;
. .
~
unu;
unvh
un vnu;
V ~ V ; V,
=
1, 2,
v; 1 v; 1 . .],
U;
.
.
. . . , n,
(3.5)
V; 1
and
h = (hi,
h3, h4, h5, h6, h7, h8, h s ) t .
(3.6)
We solve for unit vector h in min h
11 A h 11 , subject to: 11 h II=
1.
(3.7)
The solution of h is a unit eigenvector of AtA associated with the smallest eigenvalue. (Alternatively, the above problem can be transformed to a linear least squares problem by setting a nonzero nonvanishing component of h to one and moving the corresponding column to the right hand side.) The matrix E is determined by
Step ( i i ) . Determining a unit vector T, with T = f T , . Solve for unit vector T, in min 11 EtT, 1 1 , subject to: 11 T, II= 1 . (3.9) TS
The solution of T, is a unit eigenvector of EEt associated with the smallest eigenvalue. If (3.10) (T, x X i ) . (EXi) < 0 , i
356
J. J. Weng & T. S. Huang
then T, t -T,. The summation in (3.10) is over several values of i's to suppress noise (usually three or four values of i will suffice).
Step (iii). Determining rotation matrix R. Without noise, it follows that E
=
[TSIXR
(3.11)
or
Rt[-T,].
= Et
.
(3.12)
In the presence of noise, we find rotation matrix R in
)I Rt[-Ts]x - Et 11 ,
subject to: R is a rotation matrix.
(3.13)
Alternatively, we can find R directly: Let
W = [Wi Wz WJ] = [El x T, Ez x E3 Ez x T,
+
Without noise, R such that min R
=
+ E3 x El
E3 x T, + E l x Ez] . (3.14)
W. In the presence of noise, we find rotation matrix R
11 R - W 11,
subject to: R is a rotation matrix.
(3.15)
We can use either (3.13) or (3.15) t o compute R. They both have the form min R
11 RC - D 11 , subject
to: R is a rotation matrix
(3.16)
where C = [C, Cz C,], D = [D1 Dz D3]. The solution of (3.16) is as follows: Define a 4 x 4 matrix B by (3.17)
where (3.18)
2.5 3-D Motion Analysis from Image Sequences
S t e p (iv). Checking T # 0. If T # 0, d e t e r m i n e t h e sign of T . Let small threshold ( a = 0 without noise). If
II x: x RXi II < I1 x!,II I1 xi I1 for all 1 5 i 5 follows. If
TI,
then report T
= 0.
(Y
357
be a
(Y
Otherwise determine the sign for T as
C(T,x X:) . ( X i x RXi) > 0 ,
(3.20)
i
then T = T,. Otherwise T = -T,. Similar to (3.10), summation (3.20) is over several values of i.
S t e p (w). IfT depth
#
0, e s t i m a t e relative depths. For i , 1
5 i 5 n, find relative (3.21)
to minimize
11 [x: - RXiIZi - T 11
(3.22)
using a standard least squares method for linear equations. A simple method to correct structure based on rigidity constraint is as follows. The corrected relative 3-D position (scaled by 11 T 11-l) of point i at time t 2 equals to XI = (R(&Xi) T ZiX:)/2. Its relative 3-D position (scaled by 11 T 11-1) at time tl equals to j i i = R - ~ ( X ;- TI.
+ +
3.3. Justification of the Algorithm We now justify each step of the algorithm.
For S t e p ( i ) . Let T, be a unit vector that is aligned with T, i.e.
T, x T = O .
(3.23)
Pre-crossing both sides of (3.4) by T, we get, using (3.1) and (3.2), 2’
-T, IlTll
x
XI
=
z -[T,]xRX. IlTll
(3.24)
Pre-multiplying both sides of (3.24) by XIt (inner product between vectors), we get:
X’t[T,] RX = 0
(3.25)
since X’t(T, x XI) = 0 and z > 0. Geometrically, (3.25) means that three vectors X’, T, and RX are coplanar, which can be seen from (3.3). Define E to be
E = [T,],R = [T, x
R1
T, x R2 T, x Rs] = [El E2 E3]
(3.26)
358
J. J. Weng & T. S. Huang
where R = [R1R2 R3]. From the definition of T,, the sign of E is arbitrary since the sign of T, is arbitrary (as long as the sign of T, and that of E match such that (3.26) holds). Using (3.26), the definition of E, we rewrite (3.25) as
XIt EX
=0
.
(3.27)
Our objective is to find E from the image vectors X and X’. Each point correspondence gives one equation (3.27) which is linear and homogeneous in the elements of E . n point correspondences give n such equations. Let
Given n point correspondences, we rewrite (3.27) as linear equations in the elements of E and get AE=O, (3.28) where the coefficient matrix A is given in (3.5). In the presence of noise, we use (3.7). The solution of h in (3.7) is then equal to E up to a scale factor provided rank ( A ) = 8. The rank of the n x 9 matrix A cannot be larger than 8 since E is a non-zero solution of (3.28). Longuet-Higgins [37] gives a necessary and sufficient condition for the rank of A t o fall below. Assuming the relative motion is due to motion of the camera, the condition is that the feature points do not lie on any quadratic surface that passes through the projection center of the camera at the two time instants. To satisfy this condition, at least eight points are required. More points are needed to combat noise. Since the sign of E is arbitrary, we need only to find the Euclidean norm of E to fully determine E (equivalently E) from h. Let Ts = (s1, s2, ~ 3 ) Noticing ~ . T, is a unit vector and using (3.26), we get
llE1I2 = trace{EEt} = trace{ [Tslx R([Tsl x RIt } = trace{[TsIx
([Tsl.)t}
= II[Tslx112 2 = 2(s1
So, E
=
2 + sa2 + s3) = 2.
a h . This gives (3.8).
For Step (ii). We determine T,. From (3.26), T, is orthogonal to all three columns of E . We get EtT, = 0 . With noise, we use (3.9). It is easy to prove that the rank of E is always equal to 2. In fact, let Q 2 and Q3 be such that Q = [T, Q 2 Q3] is an orthonormal 3 x 3 matrix. S = R t Q is then also orthonormal. Post-multiplying the two sides of the first equation of (3.26) by S , we get
ES = [TsIxRS = [ T s I x Q
=
[O Ts x
Q2
Ts x Q3].
2.5 3 - 0 Motion Analysis from Image Sequences 359
We see the second and the third columns of ES are orthonormal, according to the definition of Q. Thus, rank { E } = rank { E S } = 2. Since rank { E } = 2, the unit vector T, is uniquely determined up to a sign by (3.9). To determine the sign of T, such that (3.26) holds, we rewrite (3.24) using
E
=
[T,] R: -I
I
L T , x X' = L E X . (3.29) IlTll IlTll Since z > 0 and z' > 0 for all the visible points, from (3.29) we know the two vectors T, x Xi and EXi have the same directions. If the sign of T, is wrong, they have the opposite directions. Thus, if (3.10) holds, the sign of T, should be changed. For Step (iii). In steps (i) and (ii) we found E and T, that satisfy (3.11). R can be determined directly by (3.14). We now prove W in (3.14) is equal to R without noise:
R = [Ri R2 R3] = [El x T, + E2 x E3 E2 x T,
+ E3 x El
E3 x T, + E l x Ez] .
Using the identity equation (a x b) x c = ( a . c)b - (b . c)a and (3.26), we get
This proves that the first column of R is correct. Similarly we can prove that the remaining columns of R are correct. In the presence of noise, however, the estimated E has errors, and so does the matrix determined by (3.14). In particular, W in (3.14) does not give a rotation matrix in general. For the same reason, generally, one cannot find a unit vector T, and a rotation matrix R so that [T,], R = E if E has errors. This can be understood by considering degrees of freedom in a correct E (3 for rotation and 2 for a unit T,), which is smaller than the degrees of freedom, 8, in a unit h in (3.7). In other words, in solving for h in (3.7), we neglect the constraint in h. This is necessary to be able to derive a linear algorithm. The alternative steps, (3.13) and (3.15), re-consider such a constraint through matrix fitting. To solve the problem of (3.16), we represent the rotation matrix R in terms of a unit quaternion q. R and q are related by Eq. (3.19). We have [38]
llRC
-
Oil2 = q t B q
(3.30)
360
J. J. Weng & T. S. Huang
where B is defined in (3.17) and (3.18). The problem of (3.16) is then reduced to the problem of minimization of a quadratic. The solution of the unit vector q in (3.30) is then a unit eigenvector of B associated with the smallest eigenvalue. Note that R is uniquely determined in (3.12), since the rank of [-T,]. is two and the positions of any two non-collinear vectors completely determine a rotation: If RX1 = Y1, RX2 = Y2, and X1 x X2 = 0, then we have the third equation:
R(X1 x X2) = Y1 x Y2, and [XI X2
X1 x X,] has a full rank.
For Step (iv). Pre-crossing both sides of (3.3) by X’, we get 0 = zX’ x RX+ X’ x T .
(3.31)
If T = 0, for any point X’ we have (note z > 0) X ’ XR X = 0 .
(3.32)
If T # 0, X’ x T # 0 holds for all the points X’ (except at most one). Therefore, (3.32) cannot hold for all points by virtue of (3.31). In the algorithm, we normalize the image vectors in (3.32) and give a tolerance threshold a in the presence of noise.
From (3.31), if T = T, then T, x X’ and X‘ x RX have the same directions. Otherwise they have opposite directions since T = -Ts.We use the sign of the inner product of the two vectors in (3.20) t o determine the sign of
T.
For Step (v). The equations for the least-squares solution (3.21) follow directly from (3.4). The idea for correcting structure based on rigidity is as follows. Moving the recovered 3-D points at time t l using the estimated rotation and translation, their new positions should coincide with the recovered position at time t 2 , if the data is noise free. However, in the presence of noise, the positions do not coincide. Here we adopt a simplistic way of removing this discrepancy: the midpoint between these two positions of a point at time t2 is chosen as the corrected solution for the position of the point at time t 2 . Moving the midpoint back gives the corrected 3-D position of the point at time t l .
T
In summary, we have proved that if rank ( A ) = 8, the solution of R and is unique, and we have derived the close-form solution. Given eight or more point correspondences, the algorithm first solves for the essential parameter matrix E. Then the motion parameters are obtained from E. Finally the spatial structure is derived from the motion parameters. All the steps of the algorithm make use of the redundancy in the data to combat noise. As the results of determining the signs in (3.10) and (3.20), the computations of three false solutions [28,29] are avoided. These steps for determining signs are stable in the presence of noise, since the decisions are made based on the signs of the inner product of the two vectors which
2.5 3-0Motion Analysis from Image Sequences 361 are in the same or opposite direction without noise. Summations over several points in (3.10) and (3.20) suppress the effects of the cases where two noise-corrupted small vectors are used, whose inner products are close t o zero and the signs are unreliable. If T # 0 and the spatial configuration is nondegenerate, the rank of A is 8. In this case, we can determine the unit vector h in (3.7) up to a sign, and determine R and % uniquely. If T = 0, any unit vector T, satisfies (3.24) and so matrix E , and correspondingly the unit vector h, have two degrees of freedom (notice T, and h are restricted to be unit vectors). Therefore, A in (3.5) has a rank less than or equal to 6 . If T = 0, relative depths of the points cannot be determined. However, the rotation parameters can be determined even if T = 0.
3.4. Optimal Motion Estimation The optimization is motivated by the following observations on the linear algorithms (including the one presented in Section 3.2): (a) With certain types of motion, even pixel level perturbations (such as digitization noise of conventional CCD video cameras) may override the information characterized by the epipolar constraint, which is a key constraint used for determining motion and structure by linear algorithms. The epipolar constraint restricts only one of the two components in image point displacement. The other component is related to the depth of the point and the motion. If this component is also used for motion estimation, the accuracy of the estimated motion parameters can be considerably improved. (b) Existing linear algorithms give closed-form solution to motion parameters. However, the constraints in the intermediate parameter matrix (essential matrix E ) are not fully used. The use of these constraints can improve the accuracy of the solution in the presence of noise. The above considerations are unified under a general framework of optimal estimation: Given the noise-contaminated point correspondences, we need the best estimator for motion and structure parameters. In reality, the image coordinates of a n object point as well as the corresponding displacement vector in the image plane are the results from a feature detector and the corresponding matcher, whose accuracy is influenced by a variety of factors including lighting, structure of the scene, image resolution and the performance of the feature matching algorithms. Thus, the observed 2-D image plane vectors u, of image 1 and u: of image 2 are noise-contaminated versions of the true ones. Let (uz,u:) be the observed value of a pair of random vectors (U,, Ui). (With n point correspondences over two time instants, we add subscripts a t o denote the ith point. A subscript-free letter denotes a general example of the vectors.) What we obtain is a sequence of the observed image vector pairs
362
J. J. Weng €4 T.S. Huang
of a sequence of random vector pairs
u 4 i (u;,(Ui)t, u;, (UL)t, . . . , u;,
(UL)t)>".
We need to estimate the motion parameter vector m and the 3-D positions of the feature points (scene structure)
(xi,
'X
I
t t .
xi, ( 4 1 t 7 ... 7 xt,,(Xn)
We assume that the errors are uncorrelated between the different components of a point and between different points. Let h,(m, x) be the noise-free projection of the ith point in the first image, given motion m and structure x, and h',(m,x) be the corresponding projection in the second image. Then, according to the principle of minimum variance estimator, the optimal estimate of m and x is the one that minimizes n
C(llU2 2=
+ 114-
- hz(m, x)l12
(3.33)
x)l12)
1
which is just the sum of discrepancies between the observed projection and the inferred projections. The value of (3.33) measures the differences in the observed images and the inferred images. Equation (3.33) involves both motion parameters and the 3-D position of every feature point. The maximum is over all the possible motion parameters and scene structures. The parameter space for iteration is huge and the computation is very expensive. However we do not have to iterate on the structure of the scene. In fact, given motion parameters m, the structure x that minimizes the value of (3.33) can be estimated analytically. That is, we can compute m;tn{llu2 - h,(m, x)ll2
+ llu: - h:(m,
(3.34)
x)l12} i? g2(m)
from a given m. In fact,
+ llui
= min( m C m i Xn { l ( u i - hi(m, x)(I2 i= 1
-
h~(m,x)l12}} = m$xgi(m). i= 1
So computationally, structure x will not be included in the parameter space of iteration. Given an m, x is computed directly as we will discuss in the following paragraph. This drastically reduces the amount of computation. Otherwise it is computationally extremely expensive to iterate on this huge (m, x) space (iterations on n points needs (3n 5)-dimensional parameter space!). Since the optimal structure x can be determined from motion parameters, we can exclude x from the notation for parameters t o be estimated. That is, symbolically, the parameters to be determined are just m.
+
2.5 3-D Motion Analysis from Image Sequences 363
To derive the closed-form expression for x that gives g2(m)in (3.34), we use the following methods. From motion parameter vector m and the observed projections of point i, the two observed projection lines are determined. These two observed projection lines do not intersect in general. If the true 3-D point is on the observed projection line of the first image, the discrepancy ((u,- hz(m,x)1I2 is equal to zero, but llul - h’,(m,x)l12generally is not. If the true 3-D point is on the other observed projection line, llu: - h’,(m,x)(I2is equal to zero while I(u,- h,(m, x)1I2 is not. Given the motion parameters, we need to find a 3-D point for each feature point such that the corresponding term IJuz- hz(m,x)1I2 llu: - h’,(m,x)1I2 is minimized. Obviously, under a normal configuration the point lies in the shortest line segment L that connects the two observed projection lines, because otherwise the perpendicular projection of a 3-D point onto L is better than the 3-D point. An exact solution of the optimal point requires solving a fourth order polynomial equation. It can be shown that using a reasonable approximation, we can get a closed-form solution. The optimal point is generally not far from the midpoint of the line segment L, unless the distance to the object and the viewing angle differ a lot between two images. For computational efficiency, we may just use the midpoint of the line L as an approximated optimal point, which is the solution in (3.22). Computationally, a two-step approach is proposed here. First, a linear algorithm is applied which gives a closed-form solution. Then in the second step, this solution is used as an initial guess for an iterative algorithm which improves the initial guess to minimize the objective function (3.33). This two-step approach has the following advantages:
+
(1) A solution is generally guaranteed. The linear algorithm always gives a solution provided that degeneracy does not occur. Unless the noise level is very high, this solution is close to the true one. As long as the initial guess is within the convergent region to a globally optimal point, iteration leads to the optimal solution. (2) The approach yields reliable solutions. The linear algorithms use only the epipolar constraint and so, the solution is sensitive to noise and the reliability of solutions varies with motion types. The optimization in the second step employs more global constraints and achieves significant improvements over the first step. (3) The computation is faster than straight iterative methods that start with a “zero” initial guess. Generally, a linear algorithm is fast, and a nonlinear algorithm is slow. When a linear algorithm is followed by a nonlinear algorithm, the amount of computation is not simply equal to the sum of those needed by each algorithm individually. Since the linear algorithm provides a good initial guess, the time taken by the nonlinear algorithm to reach a solution is greatly reduced.
364
J. J. Weng 63 T. S. Huang
4. Motion Modeling and Prediction In general, the moving objects exhibit a smooth motion, i.e. the motion parameters between consecutive image pairs are correlated. From this assumption and given a sequence of images of a moving rigid object, we determine what kind of local motion the object is undergoing. A Locally Constant Angular Momentum model, or LCAM model for short, is introduced. The model assumes short term conservation of angular momentum and a polynomial curve as the trajectory of the rotation center. This constraint is the precise statement of what we mean by smoothness of motion. However, we allow the angular momentum, and hence, the motion characteristics of the object to change or evolve over the long term. Thus, we do not constrain the object motion by some global model of allowed dynamics. We will give a closed-form solution to motion parameters and structure from a sequence of images. As a result of the analysis presented in this section, some of the questions that we can answer are: whether there is precession or tumbling; what the precession is if it exists; how the rotation center of the object (which may be an invisible point!) moves in space; what the future motion would probably be; where a particular object point would be located in image frames or in 3-D at the next several time instants; where the object would be if it is missing from an image subsequence, and what the motion before the given sequence could be. This approach of motion modeling and prediction is based on the two-view motion analysis of image sequences consisting of either monocular images, or stereo image pairs. Generally, two-view motion does not represent actual continuous motion undergone by the object between the two time instants. The physical location of the rotation axis is not determined by such a two-view position transformation. Using a a single camera, the 3-D translation and the range of the object can be determined up t o a scale factor. If stereo cameras are used, we can determine the absolute translation velocities and the ranges of object points. The approach presented in this section is independent of the type of algorithms used to determine two-view motion parameters. To be specific, feature points are used for the discussion here. We assume that there is a single rigid object in motion, the correspondences of feature points between images are given, and the motion does not exhibit any discontinuities such as those caused by collisions. 4.1. Motion of a Rigid Body in 3 - 0
We first present the laws of physics that govern the motion of a rigid body. All external forces acting on a body can be reduced to a total force F acting on a suitable point Q, and a total applied torque N about Q. For a body moving freely in space, the center of mass is to be taken as the point Q. If the body is constrained to rotate about a fixed point, then that point is t o be taken as the point Q. That point may move with the supports. Letting m be the mass of the body, the motion
2.5 3-D Motion Analysis from Image Sequences 365
of the center of mass is given by
Let L be the angular momentum of the body. The torque N and the angular momentum L satisfy [39,40]:
dL
N=-. dt The rotation is about the point Q, which will be referred t o as the rotation center. In the remainder of this subsection, we concentrate on the rotation part of the motion. The motion of the rotation center Q will be discussed in the next subsection. In matrix notation, the angular momentum L can be represented by L=Gw or writing in components:
[f.] [ =
Qxx
QYX
Qzx
W X
QXY
QYY
Qzy
WY
sxz
Qyz
Qzz
wz
where
gzx=gxZ=gzy = g y z
=
.I
x ~ d m ,g y x = g x y = - ] ~ v d m ,
-/zYdm.
The above integrals are over the mass of the body. If the coordinate axes are the principal axes of the body [39,40], the inertia tensor G takes the diagonal form:
[gr ] 0
G=
gyy
0
.
(4.3)
Qzz
Referring to a coordinate system fixed on such a rotating body, (4.2) becomes nx
=QXXGX
+ w y w z ( g z z - gyy)
ny = g y y b y
+
n z = QzzWz
+ WxWy(Qyy
W Z W X ( 9 X X -
Qzz)
- Qxx)
1
, ,
where (n,, ny, n,) = N. These are known as Euler’s equations for the motion of a rigid body. These equations are nonlinear and have generally no closed-form solutions. Numerical methods are generally needed to solve them.
366
J. J. Weng & T. S. Huang
Clearly the motion of a rigid body under external forces is complicated. In fact even under no external forces, the motion remains complex. Perspective projection adds further complexity to the motion as observed in the image. However, in a short time interval, realistic simplifications can be introduced. One simplification occurs if we ignore the impact of the external torque over short time intervals. If there is no external torque over a short time, there is no change in the angular momentum of the object. Thus, if we have a dense temporal sequence of images, we can perform motion analysis over a small number of successive frames under the assumption of locally constant angular momentum. Another simplification occurs if the body possesses an axis of symmetry. The symmetry here means that at least two of i,,, iyy, i,, in (4.3) are equal. Cylinders and disks are such examples. Most satellites are also symmetrical or almost symmetrical in this sense. Under the above two simplifications, Euler's equations are integrable [39,40]. The motion is such that the body rotates about its axis of symmetry m, and at the same time the axis rotates about a spatially fixed axis 1. The motion can be represented by a rotating cone that rolls along the surface of a fixed cone without slipping as shown in Fig. 6, where the body is fixed on the rolling cone, the axis of symmetry coincides with that of rolling cone, and the center of mass or the fixed point Q of the body coincides with the apices of the cones. Then, the motion of the rolling cone is the same as the motion of the body. Figure 6 gives three possible configurations of the rolling cone and the fixed cone.
fixed
Fig. 6. The precessional motion of a symmetrical rigid body.
Let w1 be the angular velocity at which the rolling cone rotates about 1, and w, be the angular velocity a t which the rolling cone rotates about its own axis of symmetry m. Then the instantaneous angular velocity w is the vector sum of w~ and w, as shown in Fig. 6. The magnitudes of w, and w1 are constant. Thus,
2.5 3-D Motion Analysis from Image Sequences 367
the magnitude of the instantaneous angular velocity is also constant. This kind of motion about a point is called precession in the following sections and it represents the restriction imposed by our model on the allowed object rotation. A special case occurs when m is parallel to 1. Then w is also parallel to 1. Therefore, the instantaneous rotation axis does not change its orientation in motion. This type of motion is called motion without precession. 4.2. Motion of Rotation Center
The location of rotation center Q ( t ) changes with time. Assume the trajectory of the rotation center is smooth, or specifically, it can be expanded into a Taylor series:
If the time intervals between image frames are short, we can estimate the trajectory by the first k terms. We get a polynomial of time t . The coefficients of the polynomial are three-dimensional vectors. Letting 1 djQ(0) -bj+1, j ! dtj j = O , 1 , 2, . . . , k - 1 , w e h a v e
Qi = bl
+ b2(ti - to) + b3(ti
-
to)2
+ . . . + bk(ti
-
to)"'
.
(4.5)
For simplicity, we assume the time intervals between image frames are constant to. From (4.5) we get
c, i.e. ti = ci
+
Letting aj = cj-lbj, j
=
1, 2,
. . . , k , we get
Equation (4.7) is the model for the motion of the rotation center. The basic assumption we made is that the trajectory can be approximated by a polynomial. If the motion is smooth and the time interval covered by the model is relatively short, Eq. (4.7) is a good approximation of the trajectory. In the sense of dynamics, (4.7) implies that the total force acting on the center of rotation has zero high order temporal derivatives. A polynomial trajectory of center of rotation in (4.7) together with the precession model presented in the previous subsection, gives the complete LCAM model [42]. The model is characterized by locally constant angular momentum, i.e. the angular momentum of the moving object can be treated as constant over short time intervals.
368
J. J. Weng €9 T.5’. Huang
A point should be mentioned here. Though we derive the model from the assumption of constant angular momentum and object symmetry, the condition leading t o such motion is not unique. In other words, the motion model we derived applies to any moving object whose rotation can be locally modeled by such motion: the rotation about a fixed-on-body axis that rotates about a spatially fixed axis, and whose translation can be locally modeled by a vector polynomial. It is important to motivate the kinematics from dynamic conditions. But in reality, many different dynamic conditions may result in the same type of motion. Our goal here is t o understand 3-D motion of an object over an extended time period using the two-view motion analysis of images taken at consecutive time instants. Thus we would first estimate the motion parameters of the moving object from the images taken at two time instants, using the method presented in the previous section. Such motion parameters give the displacement between two time instants and do not describe the actual motion, since the object can move arbitrarily between the two time instants. The displacement can be represented by a rotation about an axis located at the origin of a world coordinate system, and a translation [41]. We have called this displacement two-view motion. Let the column vector po be the 3-D coordinates of any object point at time t o ; let p1 be that of the same point at time t l , R1 be the rotation matrix from time t o t o t i , and T1 be the corresponding translation vector. Then, p o and p1 are related by P I = Rip0 Ti (4.8)
+
where R1 represents a rotation about an axis through the origin. Given a set of point correspondences, R1 and TI can be determined by two-view motion analysis. In the case of monocular vision, the translation vector can only be determined up t o a positive scale factor, i.e. only the direction of T, T = T/llTll, can be determined from the perspective projection. In Eq. (4.8),letting po be at the origin, it is clear that TI is just the translation of the point at origin. For any point Qo, we can translate the rotation axis so that it goes through Qo and rotate p o about the axis at the new location. Mathematically, from (4.8) it follows that pi = Ri(po - Qo)
+ (R1Qo + T i ) .
(4.9)
Compared with (4.8), (4.9) tells us that the same motion can be represented by rotating PO about QOby R1, and then translating by RlQo Ti. Because QO is arbitrarily chosen, there are infinitely many ways to select the location of the rotation axis. This is an ambiguity problem in motion understanding from image sequences. If we let the rotation axis always be located at the origin, the trajectory described by Ri and Ti, a = 1, 2, 3 . . . would be like what is showed in Fig. 7, which is very unnatural. In Fig. 7 the real trajectory of the center of the body is the dashed line. However, neither the rotation nor the translation components show this trajectory. As
+
2.5 3-D Motion Analysis from Image Sequences
369
Y 4
Fig. 7. Trajectory described by
Ri and Ti if the rotation axis is always located at the origin.
we discussed in Section 4.1, the center of mass of a body in free motion satisfies Newton's equation of motion of a particle (4.1). Rotation is about the center of mass (or fixed point if it exists). Thus, motion should be expressed in two parts, the motion of the rotation center (the center of mass or the fixed point), and the rotation about the rotation center. Let Qi be the position vector of the rotation center at time ti, Ri be the rotation matrix from t i - 1 to ti, and T i - b e the translation vector from t i - 1 to t,. From (4.8) it follows that Qi = RiQo Ti,
+
or, -RlQo
+ Qi = TI.
Similarly we get equations for the motion from
ti-1
to t i , i = 1, 2,
. . . , f:
(4.10)
Equations (4.10) give the relationship among the locations of the rotation center, the two-view rotation matrices and the two-view translation vectors. Substituting (4.7) into (4.10), we get
+ a 2 + a3 + . . . + a k = TI , ~ 2 ) a l + (21 ~ 2 ) a Z+ (41 - ~ 2 1 ~ 1+3 . . . + (2"'1-
( I - R1)al
(I -
-
R2)ak= T ~ ,
......
(4.11)
+
+
(I- R f ) a l + ( f -~ (f - 1 ) ~ f ) a z (f21 - (f - 1 l 2 ~ f ) a 3. . . + ( f " ' I - (f - l)k-lRf)ak = Tf . Vector equations (4.11) are referred to as the coefficient equations. Both sides of the equations are three-dimensional vectors. There are f equations in k unknown
370
J. J . Weng €9 T. 5'. Huang
three-dimensional vectors. Let A = (a:, a;, . . . , a",,", T = (Ti, Ti, . . . , T;)t, and D be the coefficient matrix of the unknowns in (4.11). Let the element of D at the ith row and j t h column be the 3 x 3 matrix Dij, i.e. D = [Dij]rxk. We have
D23. . ij-11- (i - l ) j - l R i . We can rewrite the coefficient equations (4.11) as
DA=T.
(4.12)
D and T are determined by two-view motion analysis. The problem here is to determine A, the coefficients of the polynomial in (4.7). 4.3. Solutions of the Coefficient Equation Let f = k in (4.11). Then the matrix D is a square matrix. We wish to know whether the linear equations (4.12) have a solution. If a solution exists, is it unique? If it is not unique, what is the general solution? The solution of the coefficient equations depends on the types of motion, or the rotation matrices Ri and the translation vectors Ti. Let us first consider a simpler case, where k = 2. This means that the trajectory of the rotation center is locally approximated by a motion of constant velocity. Three frames are used in this case. The coefficient equations become
(I - R l ) a l
(I - &)a1
+ (21
-
+
,
(4.13)
= T2.
(4.14)
a 2 = TI
R2)az
Solving for a2 in (4.13) and substituting it into (4.14), we get
(I - 2R1+ R2Ri)ai If I - 2R1 + R2R1 is nonsingular, a1
Then
a2
= (I - 2R1
a1
=
(21 - R2)T1 - T 2 .
(4.15)
can be uniquely determined from (4.15):
+ R2R1)-'((21-
&)TI - T2).
is determined from (4.13): a2
= TI - (I - R l ) a l .
It can be shown [42] that (I- 2R1+ R2R1) is nonsingular if and only if the following two conditions are both satisfied: (1) the axes of rotations, represented by R1 and R2, respectively, are not parallel; (2) neither rotation angle is zero. Condition (2) is usually satisfied if the motion is not pure translation. If condition (1) is not satisfied, the solution of Eqs. (4.13) and (4.14) is not unique and has some structure. To show this, assume the rotation axes of R1 and R2 are parallel. Let
2.5 3-D Motion Analysis f r o m Image Sequences
371
w be any vector parallel to these axes. Because any point on the rotation axis remains unchanged after rotation, we have Rlw = w, R ~ w = w. For any solution a1 and a 2 , a1 cw and a2 is another solution, where c is an arbitrary real constant. Therefore, there exist infinitely many solutions. The following theorem presents the results for the general case.
+
Theorem 1 [42]. In coefficient equations, let f = k . Define Si to be a 3 x 3 matrix :
Define number
ui,j:
Then k
SLT1,
$a1 = 1=1
If S i is not singular, the first equation given by Theorem 1 uniquely determines Then a k , a k - 1 , . . . , a2 can be determined, sequentially, by the second, third, ..., and last equations in Theorem 1. Thus, if S i is not singular, the solution is unique.
al.
Theorem 2 [42]. In the case of rotation without precession, let w be any column vector parallel to the rotation axes, then
s,ow=o,
(4.16)
372
J. J. Weng tY T. S. Huang
and for any vector a (S,Oa).w=O.
(4.17)
Using Theorem 1 gives k
SEal = -
C SLTi.
(4.18)
1=1
In the case of rotation without precession, Eq. (4.16) implies S; is singular. From (4.17), the left-hand side of (4.18) is orthogonal t o w. However if the real trajectory of the rotation center is not exactly a j t h degree polynomial with j 5 k - 1 in (4.7), the right-hand side of (4.18) can be any vector, which may not be orthogonal t o w. This means that no solution exists for Eq. (4.18). If the real trajectory is a j t h degree polynomial with j 5 k - 1, then Eq. (4.18) has a solution by our derivation of (4.18). Since Eq. (4.7) is usually only an approximation of the real trajectory, a least-squares solution of (4.18) can serve our purpose. Let 61 be a least-squares solution of (4.18) which is solved by using independent columns of SE. If the rank of SE is 2, which is generally true for motion without precession, the general solution is then a1 = a 1 cw, where c is any real number. All general solutions { a 1 + cw} form a straight line in 3-D space. From Eq. (4.7), this line gives the location and direction of the two-view rotation axis of the motion between time instants t o and tl. From Theorem 2 it follows that
+
Then
S;-lal
= ~ E - ~ a lSE-2al , = s k0- 2 a 1 ,
.. ., Syal = ~
y a 1 .
Based on the equations given by Theorem 1, the unknowns a k , a k - 1 , . . . , a 2 are determined without knowing the undetermined number c. If the motion is pure translation without rotation, all the rotation matrices Ri, i = 1, 2, . . . , k, are unit matrix 1. SE is the zero matrix. The first three columns of D are zero. a1 cannot be determined by coefficient equations. From Theorem 1, a 2 , a3, . . . , a k , can still be determined by coefficient equations. Because no rotation exists, any point can be considered as a rotation center. Equation (4.7) can be used to approximate the trajectory of any object points. Thus the solutions of the coefficient equations can be summarized as follows. (1) In the case of rotation with precession, the solution of the coefficient equations is generally unique. The trajectory of the rotation center is described by (4.7). (2) In the case of rotation without precession, the general solution of a1 gives the two-view rotation axis of the first two-view motion. All other coefficients a 2 , a3, . . . , a k are generally determined uniquely by Theorem 1. Thus, the twoview rotation axes of all two-view motions are determined by (4.7). Because no precession exists, any point on the rotation axis can be considered as the rotation
2.5 3-0 Motion Analysis f r o m Image Sequences
373
center. This is the meaning of the general solution al. Once a particular point on the rotation axis is chosen as the rotation center, its trajectory is described by Eq. (4.7). There are infinitely many possible “parallel” trajectories of the rotation center depending on which point on the axis is chosen as the rotation center. (3) In the case of pure translation without rotation, a2, a3, . . . , ak can still be determined by coefficient equations. However a1 cannot be determined by coefficient equations. al can be chosen to be the position of any object point at time to. Then Eq. (4.7) describes the trajectory of this point. In the presence of noise, both a large number of point correspondences and a large number of image frames provide overdetermination. The algorithm presented in Section 3 can be used for the closed-form least-squares solution of two-view motion parameters. To use overdetermination based on a large number of frames, we let f > k in the coefficient equations (4.11). In fact, the coefficient matrix 5’; is essentially a high order deference [42]. 5’; tends t o be ill-conditioned when k gets large. This means f > k is more important when 5 is large. I f f > k, Eq. (4.12) can be solved by a least-squares method. We find a solution A to minimize
In the case of motion with precession, all the columns of D are generally independent. The least-squares solution is
A
=
(DtD)-lDtT.
In the case of motion without precession, the column vectors of D are linearly dependent. This can be shown by letting a1 in Eq. (4.11) be a non-zero vector parallel to the two-view rotation axes. Then the first three columns of D linearly combined by a1 is a zero vector. To get the least-squares solution of the coefficient equations (4.11), the largest set of independent columns of D should be found or tolerance-based column pivoting should be made. Theorem 1 solves a2, a3, . . . , ak. This means the last 3k - 3 columns of D are always independent. In the presence of noise the columns of D are very unlikely to be exactly linearly dependent even in the case of motion without precession. 4.4.
Continuous and Discrete Motions
The LCAM model we discussed is based on continuous precessional motion. We must find the relationship between continuous precession and two-view motion, before we can estimate the precessional parameters of our model based on discrete two-view motions. As we discussed in Section 4.1, a precession can be considered as the motion of a rolling cone which rolls without slipping upon a fixed cone. The angular frequency
374
J. J. Weng & T. 5’. Huang
Q
time 1 ,
Q time t 2
Fig. 8. The relation between rotation angles 0 and
4.
at which the symmetrical axis of the rolling cone rotates about the fixed cone is constant. Assumed at time t l , that an edge point A’ on the rolling cone touches a n edge point A on the fixed cone as shown in Fig. 8. After a certain amount of rolling, the touching points become B’ on the rolling cone and B on the fixed cone at time t 2 . Let 8 be the central angle of points A’ and B’, and q5 be that of A and B. Let r and r’ be the radii of circles 0 and 0’,respectively. The arc length between A and B is equal to that between A‘ and B’. Thus, q5r = Or’ or q5 sin a = 8 sin 0,where a and p are generating angles of the fixed cone and the rolling cone, respectively. We get 8 - sina (4.19) q5 sinp The precession consists of two rotational components. One is the rotation of the rolling cone about its own symmetrical axis. The other is the rotation of the rolling cone about the fixed cone. From Fig. 8 it can be readily seen that the relative position of the rolling cone and the fixed cone is uniquely determined if the touching points of the two cones are determined. Or alternatively, starting from the previous position, the new position of the rolling cone is determined if the two angles q5 and 8 are determined. Thus, no matter how we order these two rotational components, the final positions are identical as long as the angle 4 and 8 are kept unchanged. We can first rotate the rolling cone about its axis m and then rotate the rolling cone about the axis of the fixed cone, 1, or vice versa. We hope to find the equivalent two-view rotation axis of this continuous motion between two frames at time tl and time t 2 , respectively in Fig. 8. If we can find two fixed points which stay in the same positions before and after the motion, then the two-view rotation axis must go through these points. One trivial fixed point is the apex Q of the cones. Another fixed point can be found as follows: In Fig. 9 let the midpoint of arc AB touch the rolling cone (at time (tl - k t 2 ) / 2 ) . Extend line OB
2.5 3-0 Motion Analysis from lmage Sequences 375
Fig. 9. Finding fixed points for two-view rotation.
so that it intersects the plane containing Q, 0’ and B’ at a point PI. Extend line OA so that it intersects the plane containing Q, 0’ and A’ at a point P2. Draw a circle centered at 0 and passing through PI and Pz. Then the midpoint P of arc PI P2 is a fixed point. This can be seen by noting that the rolling cone can also reach its position at the next time instant t 2 in a n alternative manner as follows. First, rotate the rolling cone (slipping along the fixed cone) about 1 by angle 4/2, thus rotating P to its new position a t P I , and axis m reaches the position shown in Fig. 9. Then rotate the rolling cone (slipping on the fixed cone) about its own axis m by angle 8. Point P now reaches position Pz. Finally, rotate the rolling cone (slipping along the fixed cone) about 1 again by angle 4/2, taking the rolling cone to the position at time instant t 2 . This takes the point P back to its starting position. Therefore, the two-view rotation axis q5 found by two-view motion analysis from two image frames, goes through Q and P. Notice that the angular frequency at which the symmetrical axis of the rolling cone rotates about the fixed cone is constant. From the way of finding P, it is clear that the two-view rotation axis also rotates about 1 by a constant angle between consecutive frames. Thus, we have the following theorem:
Theorem 3. If a rigid body undergoes a precessional motion of the LCAM model, the two-view rotation axis between constant time intervals changes by rotating 0 about the precessional vector by a constant angle. Without loss of generality, we assume the time intervals between consecutive image frames are of unit length. We define the precessional vector to be a unit vector 1 parallel to the symmetrical axis of the fixed cone, the precessional angular frequency 4 to be the angular frequency at which the symmetrical axis of the rolling
376
J. J. Weng & T. S. Huang
cone rotates about the precessional axis, the ith body vector. mi to be a unit vector parallel to the symmetrical axis of the rolling cone at time ti, and the body rotation angular frequency 8 to be the angular frequency at which the rolling cone rotates about its symmetrical axis (see Fig. 10).
Fig. 10. Parameters of continuous precession and discrete two-view motion.
From image sequences we find estimates of two-view motion parameters. They are the ith two-view rotation axis vector ni, a unit vector parallel t o the two-view rotation axis between time instants t i - 1 and ti; the corresponding ith two-view rotation angle $i and the ith two-view translation vector Ti. Figure 10 shows the precession parameters of continuous motion and discrete two-view motion. Let R(n, 8) = [rij]denote the rotation matrix representing a rotation with axis unit vector n = (n,, ny,n,) and rotation angle 8, then R(n, 8) is given by
[
+
I
(n: - 1)(1- c o s ~ ) 1 n,ny(l- case) - n, sin8 n,n,(l- cos8) + ny sine n y n , ( ~ - c 0 s 8 ) + n , s i n ~ (np - 1 ) ( 1 - c 0 s ~ ) + 1 nyn,(~-c0s~)-n,sin~. n,n,(l-cos8)-nysin8
nZn,(l-cos8)+n,sin8
(np - i ) ( i - c o s e ) + i
(4.20)
Theorem 4. The continuous precession parameters and discrete two-view motion parameters are related by
Proof. From time t i - 1 , to time ti, the body moves from its previous position to a new position. From Fig. 10 the new position of the rolling cone (or the body) can be reached in the following way: First, the rolling cone rotates about its body vector mi-1 by angle 8. Then, the rolling cone rotates about the precessional
2.5 3-D Motion Analysis f r o m Image Sequences
377
vector 1 by angle q5. The two-view motion combines these two motions into one, which is the rotation about the two-view rotation axis vector ni by angle $i. We get Eq. (4.21). Similarly if we change the order of these two rotational components 0 we get Eq. (4.22). From Theorem 3, the two-view rotation axis rotates about the precessional vector. Therefore, the precessional vector 1 is perpendicular to ni - ni-1 and ni-1 - ni-2. The sign of 1 is arbitrary. Thus, 1 can be determined by (4.23)
We will assume that precessional angular frequency, body rotation angular frequency and two-view rotation angle are not larger than half a turn between every two consecutive frames. This assumption is necessary in practice, since the rotation must be small enough for matching to be possible. The precessional angular frequency 4 is equal to the the angle between ni x 1 and ni-1 x 1: (4.24)
The sign of q5 is the same as the sign of (ni-1
x ni) . 1 .
(4.25)
After 1 and q5 are found by Eqs. (4.23), (4.24) and (4.25), R(1, q5) can be calculated by (4.20). R(mi-1, 0) and R(mi, 0) can be determined by (4.21) and (4.22): R(mi-1,0) = R - l ( 1 , 4 ) R ( n i , + i ) ,
(4.26)
4).
(4.27)
R(mi, 0)
= R(ni,+i)R-'(l,
We can determine mi-1, mi and 0 by (4.26) and (4.27), because n and 0 can be determined from R(n, 0) [41]:
0 = fcos-1 1=
[T32
-
(
TIl
+
T23, T13
\ ) [ T ~-z TZ3, T13
TZZ2+ T33 -
1
- T31, TZl
T12It
>, -
T31, TZl
-
- Tiz]))'
Thus, we get the following theorem.
Theorem 5. The precessional vector, precessional angular frequency, body axes and body rotation angular frequency which define the precession part of the LCAM model can all be determined from three consecutive two-view motions, or 0 four consecutive image frames. In addition to these basic parameters which uniquely determine the motion of the model, some other parameters can also be determined from these basic parameters.
378
J . J. Weng €4 T. 5'. Huang
For example, the generating angles a: and 0 of the fixed cone and the rolling cone, respectively, in Fig. 8 can also be determined from 1, 4, mi, 0 and Eq. (4.19). 4.5. Estimation and Prediction
The LCAM model is applied t o subsequences of the images successively. The parameters of the model are estimated for every (overlapping) subsequence. The estimated model parameters can then be used to describe the current local motion. The following questions can be answered. Is there precession? If so, what are the precession parameters? What are the current or previous body vectors? What is the body rotation angular frequency? What is the probable motion for the next several time intervals? What are the probable locations of the feature points at the next several time instants? If the moving object is occluded in some of the previous image frames, what are the motion and the locations of these feature points during that time period? The number of frames covered by an LCAM model can be made adaptive to the current motion. The number can be changed continuously to cover as many frames as possible so long as the constant angular momentum assumption is approximately true during the time period to be covered. The value of the number of frames chosen can be based on the accuracy with which the model describes the current set of consecutive frames. The residuals of least-squares solutions and the variances of the model parameter samples indicate the accuracy. The noise level also affects the residuals and the variances of parameter samples. However, the noise level is relatively constant or can be measured. The resolution of the cameras and the viewing angle covering the object generally determine the noise level. The noise can be smoothed by determining the best time intervals and the number of frames covered by the model, according to the current motion. Because the LCAM model is relatively general, the time interval an LCAM model can cover is expected to be relatively long in most cases. The following part deals with the estimation of model parameters using overdetermination. Although one can derive formulation for minimum variance estimator here, the computation of the estimates requires iterations. We will discuss this optimal solution in Section 4.7. Here we give a closed-form solution that uses overdetermination. The solution can be directly computed without resorting to iterations. After finding two-view rotation axis vectors nl, n2, . . . , nf, precessional vector 1 should be orthogonal to n2 - nl, n3 - n2, . . . , nf - nf-1. However, because of noise, this may not be true. Thus, we find 1 such that the sum of squares of the projections of 1 onto n2 - nl, n3 - n2, . . . , nf - nf-1 is the smallest. Let
2.5 3 - 0 Motion Analysis from Image Sequences
379
We are to find unit vector 1 in minIIAlI1, subject to: llll1 = 1 . 1
(4.28)
The solution 1 of (4.28) is the unit eigenvector corresponding to the smallest eigenvalue of AtA. Let the precessional angular frequency determined from (4.24) and (4.25) be 4i. The precessional angular frequency of the model 4 can be estimated by the mean
Let the body rotation angular frequency determined by (4.27) be 8i. Body rotation angular frequency of the model can be estimated by the mean
Body vectors are estimated by averaging two consecutive two-view motion using (4.26) and (4.27), respectively. Two-view angular frequency can also be estimated by mean
According to the motion model, the ( f + l ) s t two-view rotation axis vector nf+l is R(1, ( f - i l)4)ni for any 1 5 i 5 f. In the presence of noise, we use the mean over all previous two-view motions to predict the next two-view rotation axis vector:
+
f
nf+l = f - l
C R(Z,( j i + l)+)ni. -
i= 1
From (4.9) the next position of point xf in the f t h frame is predicted by
where Qf and Qf+l are determined by (4.7). The prediction can be made for more than p 2 2 frames by using the following equations successively.
If the object was occluded in parts of image sequences,the positions and orientations of the object as well as the locations of the feature points on the object
380
J . J. Weng & T. 5’. Huang
can be recovered by a n interpolation similar to the prediction procedure discussed above. For the motion of the rotation center, occlusion just means that some rows in the coefficient equations are missing. The solution can still be found if we have enough rows. For the precession part of the emotion, the interpolation can be made in a way similar t o prediction or extrapolation. When making interpolation we use both the “history” and the “future” of the missing part. For prediction, only the “history” is available. Furthermore, we can also extrapolate backwards to find “history”, i.e. to recall what has not been seen before. The essential assumption is that the motion is smooth. 4.6. Monocular Vision
For the monocular case, 3-D positions can be predicted only up to a global scale factor from two images. When more interframe motions are involved, how many scale factors are unable t o be determined? The answer is one, provided that at least one point is visible among every three consecutive image frames. In other words, from point correspondences over a sequence of monocular image sequences, the depth of any visible points and translation between any two image frames can be determined up t o a scale factor c. Therefore, although the 3-D position of the points can only be determined up to a scale factor, the image coordinates of the points can be predicted without knowing the scale factor, since the scale factor c is canceled out in image coordinates. In the following, we derive these results. Suppose a point P is located at xi = ( ~ i , y i , z i a) ~t time t i , i = 0 , 1, 2, . . .. The image vectors are defined by xi = zXi. From image frame fi to image frame fi+l the motion, called ith motion, is a rotation represented by Ri followed by a translation represented by Ti.
or
i = 1, 2, . . , n. As discussed in Section 3, we can determine relative depths (4.29)
and
for i = 1, 2,
. . . , n, assuming the point P is visible for t o to t,. We have (4.30)
2.5 3-D Motion Analysis from Image Sequences
Multiplying both sides for i = 1 to i
=k
381
yields (4.31)
Letting llTlll
s be the unknown scale factor, (4.31) gives
(4.32)
That is, the norm of the kth translation, k factor s and a number
=
1, 2, . . . , n, is a product of the scale (4.33)
which can be determined from the relative depths. Therefore, although many interframe motions are involved in a long monocular sequence, only one unknown scale factor exists, instead of many. If the norm of translation in any interframe motion is known, s is determined from (4.31) and then motion and structure in the entire sequence is determined completely. Similarly, if the absolute depth of a point at a time instance is known, the norm of the translation vector is determined from (4.29), and then so does motion and structure in the entire sequence. Now what happens if there exists no point that is visible through the entire sequence? We can see from (4.30) that (4.31) and (4.32) still hold true if each factor &/Z; corresponds to a different point. For Zi and Z: to be determined for a point P, the point P must be visible from ti-1 to ti+l in three consecutive image frames. Therefore, as long as through any three consecutive frames there is a t least one point visible, the norm of any translation vector can be completely determined from s as in (4.32). Now we have proved the following theorem:
Theorem 6. Suppose the rotation matrix and direction of the translation between every consecutive monocular image pair can be determined. The magnitude of any interframe translation and the depths of visible points can be determined up to the same scale factor s, provided that through every three consecutive images at least one point is visible. 0 Since the depth of every point can be determined up to the same scale factor s, the 3-D position of the points can be predicted up to s as we discussed in the previous section. Since the image coordinates of a point cancel out this scale factor, the image coordinates of the point can be predicted without knowing the scale factor. For example, we can let s = 1 and compute the predicted 3-D position in the next frame based on the previous motion trajectory. Then by projecting the predicted 3-D point onto the image plane, we get the predicted image position of the point. This predicted position is the same for any positive s. In the presence of noise, the error in a single point may significantly influence the accuracy of the product in (4.33). Therefore, the product should be determined
382
J. J. Weng & T. S. Huang
based on many points instead of one. For example, the value of Zi/i?i can be averaged over many points before it is used to evaluate (4.33). 4.7. Optimization
The algorithm discussed above gives closed-form solutions for the model parameters. A simple least-squares solution is obtained when redundant data are available. Those solutions are generally good preliminary estimates. However, since the statistics of noise distribution is not employed, the solution is not optimal. If iterations are allowed, one can obtain higher accuracy through optimization. The optimization for more than two image frames is very similar to the case of two frames (either monocular or stereo). We just need a natural extension. The two-step approach can be used. First, the algorithm discussed above provides initial guesses. Then the initial guesses are improved through an iterative optimization. Similar to the notation in Section 3.4. Let m denote the motion model parameter vector, x = ( X I , x 2 , . . . , x,) denote the structure of the scene at time to, u i 3 k denote the observed ith image point in the j t h camera (e.g. left: l s t , right: 2nd) at time t k ( k = 0, 1, . . . , f), and hijk(m, x) denote the computed projection of the ith point in the j t h camera at time t k . Suppose that the additive noise in the coordinates of the image point is uncorrelated, U , j k = hijk(m, X)
-k b i j k
.
According to the minimum variance estimation, the optimal estimate of m and x is the one that minimizes image plane errors: n
2
f
i = l j = 1 k=O
To keep the dimension of the iterative search space low, only the independent model parameters should be included into m. They include precessional velocity vector defined as 41, 0th body velocity vector defined as Omo, and coefficient vectors in the coefficient equation {ai}. Those parameters uniquely determine the two-view motion parameters through (4.21) and (4.22). Since the optimal 3-D position of a point xi at time to depends only on model parameters and the observations of the point in image sequences, the optimal x i can be determined by minimizing 2
f
(4.34) The solution for x i that minimizes (4.34) can be estimated in a closed-form as in Section 3.4. Then, using the space decomposition technique gives
2.5 3-D Motion Analysis from Image Sequences
383
This decomposition significantly reduces the dimension of search space. If the twoview motion parameters used t o compute the model parameters are optimized, the initial guess of m is usually very good. This will significantly reduce the number of iterations in computing the optimal m. It is clear that either using monocular vision or any multi-ocular vision, we just need t o change the upper limit of j accordingly. Computationally, the minimization of the above function can be performed using a recursive-batch approach [43] which is a revised version of the Kalman filtering approach [44]. The recursive-batch approach is useful for improving the performance and efficiency when the image sequence is very long, or virtually infinite.
References [l] B. K. P. Horn and B. G. Schunck, Determining optical flow, Artif. Intell. 17 (1981) 185-203. [2] A. M. Waxman, An image flow paradigm, in Proc. Workshop on Computer Vision: Representation and Control, Annapolis, MD (IEEE Computer Society Press, Washington D.C., 1984) 49-57. [3] H.-H. Nagel and W. Enkelmann, An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences, ZEEE Trans. Pattern Anal. Mach. Intell. 8 (1986) 565-593. [4] J. K. Kearney, W. B. Thompson and D. L. Boley, Optical flow estimation: An error analysis of gradient-based methods with local optimization, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 229-244. [5] D. J. Heeger, Optical flow using spatiotemporal filters, Znt. J. Comput. Vision 2 (1987) 279-302. [6] S . T. Barnard and W. B. Thompson, Disparity analysis of images, IEEE Trans. Pattern Anal. Mach. Intell. 2 (1980) 333-340. [7] L. Dreschler and H.-H. Nagel, Volumetric model and 3-D trajectory of a moving car derived from monocular TV frame sequences of a street scene, Comput. Graph. Image Process. 20 (1982) 199-228. [8] E. C. Hildreth, The Measurement of Visual Motion (MIT Press, Cambridge, MA, 1983). [9] D. Marr and T. Poggio, A theory of human stereo vision, Proc. Royal Society of London B204 (1979) 301-328. [lo] J. E. W. Mayhew and J. P. Frisby, Psychophysical and computational studies towards a theory of human stereopsis, Artif. Intell. 17 (1981) 349-385. [I11 W. E. L. Grimson, From Images to Surfaces: A Computational Study of the Human Early Visual Systems (MIT Press, Cambridge, MA, 1981). [12] Y. Ohta and T. Kanade, Stereo by intra- and inter-scanline search using dynamic programming, IEEE Trans. Pattern Anal. Mach. Intell. 7 (1985) 139-154. [13] N. Ayache and B. Faverjon, Efficient registration of stereo images by matching graph descriptions of edge segments, Int. J. Comput. Vision 1 (1987) 107-131. [14] H. P. Moravec, Towards automatic visual obstacle avoidance, in Proc. 5th Znt. Joint Conf. on Artificial Intelligence (William Kaufmann, Los Angeles, LA, 1977). [15] F. Glazer, G. Reynolds and P. Anandan, Scene matching by hierarchical correlation, in Proc. ZEEE Conf. on Computer Vision Pattern Recognition (IEEE Computer Society Press, Washington D.C., 1983) 432-441.
384
J. J. Weng & T. 5'. Huang
[16] T. D. Sanger, Stereo disparity computation using Gabor filters, Biol. Cybern. 59 (1988) 405-418. [17] A. D. Jepson and M. R. M. Jenkin, The fast computation of disparity from phase differences, in Proc. IEEE Conf. on Computer Vision Pattern Recognition, San Diego, CA (IEEE Computer Society Press, Washington D.C., 1989) 398-403. [18] J. Weng, A theory of image matching, in Proc. Third Int. Conf. on Computer Vision, Osaka, Japan (IEEE Computer Society Press, Washington D.C., 1990) 200-209. [19] J. J. Hwang and E. L. Hall, Matching of featured objects using relational tables from stereo images, Comput. Graph. Image Process. 20 (1982) 22-42. [20] W. K. Gu, J. Y . Yang and T. S. Huang, Matching perspective views of a polyhedron using circuits, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 390-400. [21] H. S. Lim and T. 0. Binford, Stereo correspondence: A hierarchical approach, in Proc. Image Understanding Workshop (Science Applications Corp., Mclean, VA, 1987) 234-241. [22] J. Weng, N. Ahuja and T. S. Huang, Two-view matching, in Proc. 2nd Znt. Conf. on Computer Vision (IEEE Computer Society Press, Washington D.C., 1988) 64-73. Also, Matching two perspective views, I E E E Trans. Pattern Anal. Mach. Intell. 14 (1992) 806-825. [23] R. Jain and H.-H. Nagel, On the analysis of accumulative difference pictures from image sequences of real world scenes, I E E E Trans. Pattern Anal. Mach. Intell. 1 (1979) 206-214. [24] J. W. Roach and J. K. Aggarwal, Determining the movement of objects from a sequence of images, IEEE Trans. Pattern Anal. Mach. Intell. 2 (1980) 554-562. [25] A. R. Bruss and B. K . Horn, Passive navigation, Comput. Vision Graph. Image Process. 21 (1983) 3-20. [26] G. Adiv, Determining three-dimensional motion and structure from optical flow generated by several moving objects, I E E E Trans. Pattern Anal. Mach. Intell. 7 (1985) 348-401. [27] A. Mitiche and J. K. Agganval, A computational analysis of time-varying images, in T. Y . Young and K. S. Fu' (eds.), Handbook of Pattern Recognition and Image Processing (Academic Press, New York, 1986). [28] H. C. Longuet-Higgins, A computer program for reconstructing a scene from two projections, Nature 293 (1981) 133-135. [29] R. Y . Tsai and T. S. Huang, Uniqueness and estimation of 3-D motion parameters of rigid bodies with curved surfaces, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 13-27. [30] B. L. Yen and T. S. Huang, Determining 3-D motion and structure of a rigid body using the spherical projection, Comput. Vision Graph. Image Process. 21 (1983) 21-32. [31] J. Q. Fang and T. S. Huang, Some experiments on estimating the 3-D motion parameters of a rigid body from two consecutive image frames, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 547-554. [32] X. Zhuang and R. M. Haralick, Rigid body motion and the optic flow image, in Proc. I E E E 1st Conf. on Artificial Intelligence Applications, Denver, CO (IEEE Computer Society Press, Washington D.C., 1984) 366-375. [33] X. Zhuang, T. S. Huang, N. Ahuja and R. M. Haralick, A simplified linear optic flow-motion algorithm, Comput. Vision Graph. Image Process. 42 (1988) 334-344. [34] A. M. Waxman, B. Kamgar-Parsi and M. Subbarao, Closed-form solutions to image flow equations for 3-D structure and motion, Int. J. Comput. Vision 1 (1987) 239-258.
2.5 3-0 Motion Analysis f r o m Image Sequences 385 [35] 0. D. Faugeras, F. Lustman and G. Toscani, Motion and structure from point and line matches, in Proc. Int. Conf. o n Comput. Vision, London, UK (IEEE Computer Society Press, Washington D.C., 1987) 25-34. [36] J. Weng, T. S. Huang and N. Ahuja, Error analysis of motion parameter determination from image sequences, in Proc. 1st Int. Conf. on Computer Vision, London, UK (IEEE Computer Society Press, Washington D.C., 1987) 703-707. [37] H. C. Longuet-Higgins, The reconstruction of a scene from two projections-configurations that defeat the 8-point algorithm, in Proc. I E E E 1st Conf. o n Artificial Intelligence Applications, Denver, CO (IEEE Computer Society Press, Washington D.C., 1984) 395-397. [38] J. Weng, T. S. Huang and N. Ahuja, Motion and structure from two perspective views: Algorithms, error analysis and error estimation, I E E E Trans. Pattern Anal. Mach. Intell. 11 (1989) 451-476. [39] G. R. Fowles, Analytical Mechanics, 3rd ed. (Holt, Rinehart and Winston, New York, 1977). [40] W. D. Macmillan, Dynamics of Rigid Bodies (McGraw-Hill, New Jersey, 1936). (411 0. Bottema and B. Roth, Theoretical Kinematics (North-Holland, New York, 1979). [42] J. Weng, T. S. Huang and N. Ahuja, 3-D motion estimation, understanding and prediction from noisy image sequences, I E E E Trans. Pattern Anal. Mach. Intell. 9 (1987) 37&389. [43] J. Weng, P. Cohen and N. Rebibo, Motion and structure estimation from stereo image sequences, IEEE Trans. Robotics and Automation 8, 3 (1992) 362-382. [44] G. S. Young and R. Chellappa, 3-D motion estimation using a sequence of noisy stereo images: Models, estimation, and uniqueness results, I E E E Trans. Pattern Anal. Mach. Intell. 12 (1990) 735-759.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 387-424 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 2.6 1 SIGNAL-TO-SYMBOL MAPPING FOR LASER RANGEFINDERS
KENONG WU* and MARTIN D. LEVINE Centre for Intelligent Machines €4 Department of Electrical Engineering, McGill University, Montre'al, P. Q. Canada H3A 2A7 A new approach for computing qualitative part-based descriptions of 3-D objects is presented. The object descriptions are obtained in two steps: Object segmentation into parts and part model identification. Beginning with single- or multi-view range data of a 3-D object, we simulate the charge density distribution over an object's surface which has been tessellated by a triangular mesh. We detect the deep surface concavities by tracing local charge density minima and then decompose the object into parts at these points. The individual parts are then modeled by parametric geons. The latter are seven qualitative shapes, each of which is formulated by a restricted globally deformed superellipsoid. Model recovery is performed by fitting all parametric geons to a part and selecting the best model for the part, based on the minimum fitting residual. A newly defined objective function and a fast global optimisation technique are employed to obtain robust model fitting results. Experiments demonstrate that this approach can successfully recover qualitative shape models from input data, especially when part shapes are not fully consistent with model shapes. The resultant object descriptions are well suited for symbolic reasoning and fast object recognition. Keywords: Computer vision, 3-D shape representation, object segmentation, object description, range data, shape characterisation, electrical charge density distribution, volumetric primitives, parametric geons, superellipsoids.
1. Introduction
A major problem in machine vision is the development of an object recognition system which is not based on accurately known models, but rather on coarse, qualitative ones representing classes of objects. Sensor data usually only provide point-by-point measurements such as the distance from the sensor to objects in the viewed scene. Thus, for example, these numerous and unstructured data are not appropriate for representing the environment for a mobile robot executing quick and complicated tasks. On the one hand, such a robot must make use of symbolic models which are concise and organised descriptions about the structure of the world. On the other, the robot must transform sensor data to symbolic descriptions which 'Current address: Integrated Surgical Systems, 829 West Stadium Lane, Sacramento, CA 95831, USA.
388
K. Wu i 3 M. D. Levine
are consistent with the models in a stored database and support efficient model matching. Thus this signal-to-symbol mapping is at the heart of any functioning robot carrying out complex tasks. In this chapter, we present a new approach t o 3-D shape representation of objects based on parts. The input t o our system is a single range image or multiple range images of an object. Our task is twofold. The first is to segment the object into individual parts. The second is to select a particular part model which describes the best shape approximation of each segmented object part from a few predefined model candidates. The segmentation approach works as follows. An object to be segmented is viewed as a charged perfect conductor. It is a well known physical fact that electrical charge on the surface of a conductor tends t o accumulate at sharp convexities and vanish at sharp concavities. Thus object part boundaries, which are usually denoted by a sharp surface concavity [l],can be detected by locating surface points exhibiting local charge density minima. Beginning with range data of a 3-D object, we tessellate the object surface with a closed triangular mesh and simulate the charge density distribution over the surface. We then detect the deep surface concavities by tracing local charge density minima and decompose the object into parts at these boundary points. The segmentation method efficiently deals with certain thorny problems in traditional approaches, such as unrealistic assumptions about surface smoothness and instability in the computation of surface features. Part model recovery employs parametric geons as the part models. These are seven qualitative shapes associated with pose and attribute parameters governing model size, tapering rate and bending curvature. Parametric geons are formulated in terms of restricted deformed superellipsoids [2]. The equations of the models provide explicit global constraints on the qualitative shape of the part models. This permits the algorithm to directly compare model shapes with a given shape and restricts the part model to a predefined shape family, no matter how the input data vary. We obtain part models by fitting all parametric geons to a part and selecting a particular model based on the minimum fitting residual. Thus our approach implements explicit shape verification of the resultant part models, obtaining them more robustly and accurately than in previous work. We begin this chapter with a review of previous related research on both part segmentation and model recovery, and make a comparison with our work in Section 2. Then we present the part segmentation method in Section 3. We describe the parametric geon models and the model recovery approach in Sections 4 and 5, respectively. In Section 6 we report experimental results with both single- and multi-view range data. The characteristics of our approach are further discussed in Section 7 and conclusions are drawn in the last section. 2. Related Work The significance of object descriptions at the part level is well understood [1,3-61. Many objects consist of parts or components which have perceptual salience
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 389 and reflect the natural structure in the world [7]. Building part-based object descriptions for various tasks has been a major strategy in computer vision for many years [3,7-151. To obtain part-based descriptions, one needs to address the following two points: (1) Which are the parts? and (2) What is the model for each of the parts? The former is the issue of object segmentation into parts (part localisation), while the latter deals with part model recovery (part identification). 2.1. Object Segmentation into Parts The problem of object segmentation into parts can be stated as follows: Given a set of data points that represent a multi-part object, classify these data into meaningful subsets, each of which belongs to a single physical part of the object. By a physical part, we mean a portion of the object surface which can be distinguished perceptually, geometrically or functionally from the rest of the object. Definitions of parts are discussed in [1,4,6,16]. Part segmentation algorithms can be categorised as being shape- or boundary-based. Shape-based approaches decompose objects into parts by measuring the shape similarity between image data and a n arrangement of predefined part models. For example, spheres [17-191, quadrics [20],and superellipsoids [9,21-231 have been used for part segmentation. These approaches first hypothesise an object configuration composed of part models, and then evaluate a measure of the similarity between the hypothesis and the true object shape. If the measure is worse than a preselected threshold, another hypothesis is generated and evaluated until the similarity measure is less than the threshold. The last hypothesis is then adopted as the desired object segmentation. Another type of shape-based segmentation uses an aspect hierarchy of part shapes [13,24,25]. These methods commonly employ a finite but relatively large set of distinctive part models. Each model exhibits a restricted number of configurations of surface patches in all possible views. Thus, even for all models, the number of surface configurations in all possible views is quite limited. The procedure is to first identify surface patches using region growing [26] or edge detection [27], and then group surface patches into a potential part according to the permissible surface configurations. For perfect object shapes, shape-based approaches can be quite efficient. However, problems arise when the part shapes are not very consistent with the available model shapes. This produces non-unique or incorrect part segmentations. Boundary-based methods segment objects into parts by seeking part boundaries instead of complete shapes. This type of approach can segment an object without incorporating part shape information. For example, Koenderink and Van Doorn [4] have proposed parabolic lines as part boundaries. At a parabolic line, one of the principal curvatures [28] of the surface changes from being convex to concave. Rom and Medioni [29] have performed part decomposition based on this theory using range data as input. The drawback of this scheme is that parabolic lines cannot indicate part boundaries on cylindrical surfaces [l].Also, since this method is based
390
K. Wu €3 M. D. Levine
on the classification of regions of positive and negative Gaussian curvature, it is not clear how t o apply it t o objects containing planar surfaces. Hoffman and Richards [l]have proposed segmenting objects into parts a t deep surface concavities. Ferrie and Levine have analysed surface principal curvatures to locate surface concavity and segment range data containing objects [30]. This strategy has also been applied t o the segmentation of edge-junction graphs of object range images [31]. In addition to range data, cross-sections of 3-D objects have also been taken as input [32-341. In this case, part segmentation is performed by examining the connectivity and shape of cross-sections. Such an approach requires a voxel-based coordinate system t o relate the data across the cross-sections. Our approach to part segmentation is consistent with the boundary-based approaches. As indicated above, all previous methods [11,29,30,35] are based on surface curvature, a geometrical property. In contrast, we employ a physical property, the simulated charge density distribution over a n object surface, t o find part boundaries. This approach has some distinguishing characteristics and advantages. Since the curvature computation involves the first and second partial derivatives of the surface, an assumption on smoothness of the object surface is mandatory [36]. Curvature computation is also very sensitive to local noise [37] and a smoothing operation on the range data is usually required. Alternatively, a larger area or scale may be employed to reduce noise effects. However, selecting a suitable scale is in general a difficult problem. Also, alarger scale will increase the computational time. In contrast, our approach solves an integral equation rather than performing surface curvature computations and thus does not require smoothness of the object surface. Since the charge density computation uses global data, the influence of noise is reduced. Furthermore, although the charge density must be computed on a closed triangular mesh, the variables are restricted only to the surface of objects. Thus, one does not require a voxel-based coordinate system or need t o compute the object interior. Because of this, our approach involves many fewer unknown variables than voxel-based approaches [32,33,38].
2.2. Part Identification The problem of part identification can be stated as follows: Given a set of data points on a particular part and all candidate part models, find the model which is the best description of that part. On the basis of psychological experimentation, Biederman’s theory of Recognition-by-Components (RBC) [39] proposed a modest set of volumetric primitives, called geons, as qualitative descriptions of object parts. The theory postulated that if an arrangement of a few geons could be recovered from the line drawings of an object, it could be quickly recognised with great tolerance for variations in viewpoint and shape. Geons have therefore been proposed as a basis for recognition of 3-D objects observed in single 2-D views [13-15,401. The idea of using a finite set of qualitative shape primitives as part models is also adopted here.
2.6 Signal-to-Symbol Mapping f o r Laser Rangefinders
391
Nearly all research has focused on the recovery of geon models from complete edge maps or ideal line drawings which have depicted objects whose parts are instances of geon models [41]. However, in general, “clean” or complete line drawings of objects cannot be obtained due to the colour and texture of object surfaces or complex illumination configurations. Because of this, and also for practical reasons, some research has focussed on data obtained from laser rangefinders [24,25,31]. In all of the latter, part descriptions were determined in a bottom-up fashion, inferring global properties by aggregating local features. This type of approach will fall short when local features do not fully satisfy the exact definitions of the geons. Clearly, any computer vision system, which successfully recovers qualitative descriptions, would have t o approximate various object shapes by ideal shape models. It would seem that the most popular volumetric model for part shape approximation is the superellipsoid, a parameterised family of closed surfaces [42]. The implicit equation of superellipsoids is given as follows [42]:
Here €1 is the “squareness” parameter in the north-south direction; €2 is the “squareness” parameter in the east-west direction. a l l a2, a3 are scale parameters along the z, y, z axes, respectively. The advantage of the superellipsoid model is that by using only two more parameters than ellipsoids, it can describe a large variety of volumetric shapes. In addition, its mathematical definition provides a useful global constraint which restricts its shape during model recovery. The latter adapts and molds the model to the object shape, thereby, reducing the influence of missing data, image noise and minor variations in shape. In this way, a n approximate shape description can be obtained efficiently [2,9,11,43,44]. The fundamental distinction between superellipsoids and geons is that the latter is a prescribed set of individual primitive component shapes, which are qualitatively discriminable, thereby supporting fast object recognition as indicated by the RBC theory [39]. The only previously reported attempt to derive geons from superellipsoids is due to Raja and Jain [45] in a continuation of previous work on superellipsoid model recovery [2,9,12]. They explored the recovery of 12 geons from single-view range images by classifying the actual parameters of globally-deformed superellipsoids [2]. Although they obtained 89% accuracy for objects with smooth surfaces, they found that the estimated parameters were very sensitive to viewpoint, noise, and objects with coarse surfaces. They also noticed that their major classification errors were due t o the misclassification of straight and curved geon cross-sections. This is mainly caused by the Euclidean distance measure they used for classifying part shapes. The problem can be easily illustrated by Fig. 1, which shows a series of superellipses representing the cross-sections of superellipsoids. The shape parameter of the superellipse changes uniformly from 0.1 t o 1. Accordingly, the shape changes gradually (row by row) from a square to a circle. The number under each figure
392
K. Wu tY
M. D. Levine
0.10
0.13
0.16
0.19
0.22
Fig. 1. Classification of cross sections of superellipsoids. A series of shapes of a superellipse is given row by row. The shape parameter of the superellipse changes from 0.1 to 1, and consequently, its shape changes from a square to a circle. The number under each figure indicates the value of the shape parameter. The task is to classify these shapes into two groups, square-like shapes and circle-like shapes. If the classification were based upon the shape parameter, the shapes in the first three rows would be classified into square-like shapes. However, human perception seems to classify more of these shapes into square-like shapes.
indicates the value of the associated shape parameter. If these shapes were t o be classified into two groupsa based on the Euclidean distance of the shape parameter, the top three rows would be classified into one group and the rest into the other. However, we clearly observe that the shapes in at least the first four rows are more similar to the square than the circle. Thus, there is a significant difference in shape discrimination between a n Euclidean distance Euclidean distance-based method and human perception. Another reason is the ambiguity between superellipsoid shapes and their associated parameters, as noted by Solina and Bajcsy [2]. According to RBC theory (391, “the memory of a slight (sic) irregular form would be coded as the closest regularized neighbor of that form”. Our work is the aAccordingly, the corresponding superellipsoids are classified as one of two geons (a cuboid and a cylinder).
2.6 Signal-to-Symbol Mapping for Laser Rangefinders
393
first attempt to accomplish this. The importance of this approach is the ability t o achieve an explicit shape verification of the resdtant part models. The equations of the parametric geons provide explicit global constraints on the qualitative shape of part models. This constraint permits the algorithm to directly compare the model shapes with a given part shape. The resultant shape must be one of the predefined parametric geon shapes, no matter how the input data vary. Therefore, our approach can compute the qualitative shape models of parts reliably from data representing parts whose shapes are not fully consistent with their models. In our research, we define a new objective function which measures (i) the Euclidean distance Euclidean distance from data points t o model surfaces and (ii) the squared difference between the normal vectors of a model and the object. Model fitting is performed by minimising this function using a stochastic global optimisation approach (Very Fast Simulated Re-annealing), which statistically guarantees finding the global minimum. A similar approach has been presented by Yokoya et al. [44]. The first term of their objective function was the squared Euclidean distance, and their optimisation technique was the classical simulated annealing [46]. We will show that our objective function and optimisation approach significantly improve the efficiency of the model fitting procedure. 3. Part Segmentation
3.1. Motivation According to Hoffman and Richards [l],the concept of a part is based upon a particular regularity in nature - transversality [47]. The theory states that when two arbitrarily shaped surfaces are made t o interpenetrate, they always meet at a contour of concave discontinuity of their tangent planes. This is illustrated in Fig. 2 by ellipsoids. Transversality has been widely applied to part segmentation [11,30,31,34,35]. Interestingly, there exists an analogy between the singularity in surface tangents and the singularity in electrostatics. When a charged conductor with an arbitrary
contour of concave discontinuity of tangent olanes
Fig. 2. Transversality. Two ellipsoids joined together create a contour of concave discontinuity at their intersection.
394
K. Wu kY M . D. Levine
Fig. 3. Charge distribution over an object. The crosses represent charge on the surface of the object. The charge density is very high and very low at sharp convexities and concavities, respectively. Thus, the object part boundary can be located at local charge density minima.
Fig. 4. Charge densities near edges. (a) An edge formed by two planes with an angle charge density at P(7,B).
p. (b) The
shape is in electrostatic equilibriumlb all charge resides unevenly on the outer surface of the conductor [48], as shown in Fig. 3. The charge density is very high at sharp convex edges and corners. Conversely, almost no charge accumulates at sharp concavities. Electrical charge density at sharp edges and corners has been carefully studied by Jackson [49]. An object edge or corner is defined as a C1 discontinuity of an object surface. By ignoring secondary global effects, Jackson has derived an approximate relationship governing the charge density p at an edge formed by two conducting planes, as shown in Fig. 4. Here p is the angle between two planes defining an edge and 7 is the distance from the edge t o a point P , where the charge density is measured. It has been shown [49] that the larger p and the smaller 7 , the greater the charge density at P. For a constant 7 , the charge density increases monotonically as p increases. A theoretical singular behaviour of charge densities bThe conductor is in electrostatic equilibrium when there is no net motion of charge within the conductor.
2.6 Signal-to-Symbol Mapping for Laser Rangefinders
at edges (for
= 0)
395
has also been suggested as follows [49]:
This means that the charge density is infinite, constant and zero when the angle defined by the two planes is convex, flat and concave, respectively. The singular behaviour of charge densities at corners, which is similar t o that at edges, has also been investigated [49]. We have observed that at slightly smoothed edges and corners, the positions of the local extrema of charge density remain unchanged. Consequently, by assuming that a multi-part object is a charged conductor, we can detect deep surface concavities, which we have noted delineate part boundaries [l],by isolating significant charge density minima.
3.2. Charge Density Computation Our physical model is the charge density distribution on a charged conductor in 3-D free space, where no other charge or conductors exist. To begin with, we list three physical facts which can be derived from physical laws.
Fact 3.1. I n electrical equilibrium, any charge o n a n isolated conductor m u s t reside entirely o n its outer surface [50]. This means that there is no charge inside the conductor. Thus the structure within an object does not affect its charge density distribution. This fact indicates that, under these circumstances, the charge density distribution is a surface property.
Fact 3.2. T h e surface of a n y charged conductor in electrical equilibrium i s a n equipotential surface [50]. Fact 3.3. Conservation of Charge: Charge cannot be created or destroyed, since the algebraic s u m of positive and negative charges in a closed or isolated system does not change under any circumstances [51]. These facts provide us with the conditions needed to establish mathematical equations with charge densities as their variables. Consider the electrical potential at the vector position r E R3, produced by a point charge q , located at the vector position r' E R3, as shown in Fig. 5. The corresponding electrical field at r can be calculated by an application of Gauss's law. Thus,
Here € 0 is a constant, known as the permitivity of free space. The electrical potential 4(r) at r can be derived by an integration of (3.2) along the dashed line from ro E R3, the vector position of the reference point, to r
396
K. Wu & M. D. Levine observation point
Fig. 5. Configuration for a point charge. r is the vector position where the electrical potential is observed. r’ is the vector position of the source point charge. ro is the vector position of the potential reference point.
Fig. 6. Configuration of charge distribution over the surface. 0 is the origin of the coordinate system.
(see Fig. 5):
It is customary to choose the reference potential to be zero at Iro I = w. Accordingly, 4(ro) = 0 and Eq. (3.3) becomes: (3.4) Secondly, consider that the charge is continuously distributed over the object surface S (see Fig. 6 ) . Thus the electrical potential at r is contributed by all of the charge on S and satisfies the principle of superposition. It can be expressed as follows:
Here q = p(r’)dS’, p(r’) is the charge density at r‘, and S‘ is the area over S. Thirdly, according to Fact 3.2 that all points on a charged conductor in electrical equilibrium are at the same electrical potential, if we restrict r in Eq. (3.5) to the
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 397 conductor surface, 4(r) is constant. Thus, (3.5) may be rewritten as follows:
Here V = 4 7 r ~ 4 ( r is ) a constant. Since S in Eq. (3.6) is a n arbitrary surface, it is impossible t o solve the equation analytically. However, we can obtain an approximate solution to Eq. (3.6) by using finite element methods [52], as described in the next sect ion.
3.3. Finite Element Solution To compute the charge density distribution based on Eq. (3.6), we approximate the 3-D object by a polyhedron, each face of which is a planar triangle which possesses a constant charge density. Then the problem of integration over the complete surface (see Eq. (3.6)) can be converted into a summation of integrations over each triangle. Since the latter can be solved analytically, the charge density on each triangle can be easily computed. The finite element solution is obtained as follows. We tessellate the object surface using a triangular mesh having N planar triangles, Tk, Ic = 1,.. . , N . Each triangle is assumed to have a constant charge density, pk, as shown in Fig. 7. A set of basis functions f k , k = 1,. . . ,N is defined on this triangular mesh as follows: (3.7) Thus the basis function, f k , is nonzero only when r’ is on the triangle T k , as shown in Fig. 7. Therefore, the charge density p(r’) can be approximated by a piecewise constant charge density function as follows: N
dr’)
xpkfk(r’). k=l
Fig. 7. Polyhedral approximation of an ellipsoid. When r’ is on T k , fk = 1 and
(3.8)
fi
= O(i
# k).
398
K. Wu & M. D. Levine
Substituting (3.8) into Eq. (3.6), we have (3.9) Since the charge density is assumed to be constant on each Tk, we may take ri as the observation point on each Ti and rewrite Eq. (3.9) as: (3.10) According to Fact 3.3, the sum of the charges on each triangle equals the total charge on the object surface. Let Q be the total charge on the object surface and s k be the area of T k . Then we have N
(3.11) Assuming Q is known, and given (3.10) and (3.11), we obtain a set of linear equations with N l unknowns, p 1 , . . . , p~ and V. Since the integral in (3.10) can be evaluated analytically [53],the charge density distribution p k and the constant V can be obtained by solving the set of linear equations. Because the potential on a particular triangle is actually contributed by the charge on all of the triangles, the matrix A is dense. In the actual computation, the observation point ri on each triangular patch is selected at its centroid. The set of linear equations is solved by a conjugate gradient-squared method [54]. To compute the charge density distribution, we need to tessellate a closed triangular mesh on the object surface. A triangular mesh also specifies a data indexing system for object surfaces, which are represented by a set of discrete 3-D points. Thus, triangulation establishes a specific spatial relationship between these points and facilitates the extraction of part boundaries. The objects to be segmented are represented in either single or multiview range data. The former indicate the 3-D coordinates of data points only on the visible object surfaces. To obtain a closed triangular mesh for this case, we tessellate the visible surface and artificially construct a mesh for the invisible surface. Depending on how this shape completion is done, the resulting mesh will affect the absolute values of charge density on the visible surfaces. However, we observe that the locations of the charge density extrema which indicate object part boundaries do not change. We may combine these two meshes to construct a closed triangular mesh. A detailed description of this mesh construction is given in [55]. Multiview data are obtained by merging range data acquired from several viewpoints. They are a sequence of 3-D data points representing the complete surface of the object. In our experiments, surface tessellation for multiview data was performed by Decarlo and Metaxas at the University of Pennsylvania using a mesh blending approach [56].
+
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 399
Fig. 8. Direct Connection Graph (DCG). (a) A triangular mesh. (b) DCG of the triangular mesh in (a). (c) Subgraphs of (b) after boundary node deletion. Here triangular patches 1, 2, 3 and 8 are assumed to be located on the part boundary.
3.4. Object Decomposition
The object is decomposed into parts after the simulated charge density distribution is obtained. The method is based on a Direct Connection Graph (DCG) defined on the triangular mesh, as shown in Fig. 8. Here nodes represent triangles in the mesh and branches represent the connections between direct neighbors. Two triangles which share two vertices are considered t o be direct neighbors. For example in Fig. 8(a), triangles 1 and 2 are direct neighbors while 2 and 3 are not. Thus the DCG provides a convenient coordinate system on the object surface and indicates the spatial relationship between a triangle and its neighbors. It permits the tracing of the part boundaries on the triangular mesh without employing a voxelbased coordinate system. This significantly reduces the required memory space for describing the object and increases the computational speed. For a triangular mesh of multiview range data, we decompose the complete mesh. For single-view range data, only the visible surface of the object is segmented. On the basis of the transversality principle described in Section 3.1, we have assumed that a part boundary is explicitly defined by deep surface concavities. For a complete object the part boundary is a closed contour. This ensures that the decomposition algorithm will be able to segment a part from the rest of the object. The assumption also provides a stopping criterion for the boundary tracing procedure. Since the part boundary is located at local charge density minima, it can be traced along the LLvalle”’of the charge density distribution. We note that for singleview data, the visible surface is not closed, and therefore the part boundary may not be a closed contour. In this case, the tracing process stops when it reaches a mesh boundary, that is, a triangle which has only two direct neighbors. The algorithm examines the charge density on all triangles to find an initial triangle for tracing each boundary which must satisfy the following conditions: 1. It must be a concave extremum; that is, its charge density must be a local minimum.
400 K . W u & M . D. Levine
2. It must be located at a deep concavity. Thus the charge density on the triangle must be lower than a preselected threshold.c 3. It, as well as its neighbors, must not have been visited before. This ensures that the same boundary will not be traced again.
Beginning at the initial triangle, the algorithm proceeds to the neighbor with the lowest charge density. During the tracing procedure, all triangles detected on the boundary are marked. These will not be checked again and eventually will be deleted from the DCG. For the mesh constructed from mdtiview data, the process continues until it returns to the initial triangle. Since we have assumed that a part boundary is closed, this means that all triangles on this part boundary are visited. For a mesh constructed from single-view data, as illustrated in Fig. 8, the process continues until it reaches a triangle (face 3) on the boundary of the mesh. If the initial triangle (for example, face 2) possesses three direct neighbors, the procedure will move in the other direction until reaching a triangle (face 8) on the boundary of the mesh. Thus all triangles on the part boundary are visited. Next the algorithm finds a new initial triangle and traces another boundary. It repeats the same tracing procedure, and finally stops when the charge density at an initial triangle is higher than the preselected threshold. After all triangles on part boundaries have been found, the nodes of the DCG representing these triangles are deleted. In this way, the original DCG is divided into a set of disconnected subgraphs, as shown in Fig. 8(c). Physically the object has been broken into parts. Each object part can be obtained by applying a component labeling algorithm to a subgraph of the DCG. The result is several lists of triangles, each containing those belonging to the same object part. 4. Parametric Geons
4.1. The Model
Parametric geons are a finite set of distinct volumetric shapes, which are used to describe the shapes of object parts. We believe that such model shapes should reflect the essential geometry of objects in the real world. Seven volumetric shapes were chosen, primarily motivated by the art of sculpture, perhaps the most traditional framework for 3-D object representation. One of the most obvious features of sculptured objects is that they consist of a configuration of solids with different shapes and sizes which are joined together but which we can perceive as distinct units. The individual volume is the fundamental unit in our perception of sculptural form, as indeed it is in our perception of fully 3-D solid forms in CThisthreshold determines when an object should not be decomposed any further. If the charge density at an initial triangle is greater than this threshold, we assume that all the boundary points have been found. The selection of the threshold depends on a priori knowledge of the surface concavity and there is no universal rule for determining it. Currently we choose 1.5 times the lowest charge density on the object surface as the threshold.
2.6 Signal-to-Symbol Mapping for Laser Rangefinders
ellipsoid:e,= 1 E,
=1
TAPERING
BENDING (6)
(2)
tapered cylinder
401
cylinder:E,=o.iE,=
1
curved cylinder
0.1 E ,=o.I
curved cuboid
I
Ez=O.1
tapered cuboid
cuboid:+
Fig. 9. The seven parametric geons. e l and
€2
are the superellipsoid shape parameters.
general [57]. From a sculptor's point of view, all sculptures are composed of variations of five basic forms: the cube, the sphere, the cone, the pyramid and the cylinder [58,59]. Another important belief in the world of sculpture is that each form originates either as a straight line or a curve [59]. Straightness and curvature are significant for characterising the main axes of elongated objects and were employed in defining the original geon properties [39]. By generalising the five primitive shapes used in sculpture and adding two curved primitives, we arrive a t the following seven shapes for parametric geons (see Fig. 9): the ellipsoid, the cylinder: the cuboid, the tapered cylinder, the tapered cuboid, the curved cylinder and the curved cuboid. These seven shapes are derived from the superellipsoid Eq. (2.1) by (i) specifying the shape parameters, €1 and €2 and (ii) applying tapering and bending deformations [2,42]. Each parametric geon shape can be expressed by a compact implicit function,
g(x,ai)=O i = 1 ,
...,7 .
dActually this could be a cylindrical shape with an elliptical cross-section.
(4.1)
402
K. Wu & M. D. Levine
Fig. 10. Some other examples of parametric geons. The number beside each shape indicates its geon type: 1 - ellipsoid, 2 - cylinder, 3 - cuboid, 4 - tapered cylinder, 5 - tapered cuboid, 6 - curved cylinder, 7 - curved cuboid.
Here x E R3 and ai is the nine- to eleven-dimensional vector. The elements of the vector are three scale parameters, six spatial transformation parameters, two tapering parameters if it is a tapered primitive, and one bending parameter if it is a curved primitive. Since these seven shape types are defined quantitatively, their variations can represent a variety of different shapes. Some of this diversity is shown in Fig. 10. The detailed derivations of the implicit and normal equations for all of the seven parametric geons can be found in [55]. 4.2.. Comparison with the Original Geons
The major distinction between parametric geons and the conventional geons of Biederman [39] is that the latter are defined in terms of certain attributes of volumetric shapes, which do not impose global shape constraints. By contrast, parametric geons are defined in terms of different analytical equations, which do provide such constraints. In addition, the original geons are described in strictly qualitative terms. However, parametric geon descriptions simultaneously include both qualitative and quantitative characterisations of object parts. The geometrical differences between these two sets of primitives are given in Table 1. Certain qualitative properties of the parametric geons have been simplified in comparison with the original geons of Biederman. For example, an asymmetrical cross-section is not used in defining any of the parametric geons because of the symmetrical nature of superellipsoids. The assumption that all parametric geons are symmetrical with respect to their major axes is consistent with the well-known human perceptual tendency toward phenomenological simplicity and regularity [60]. Symmetrical primitives have also been employed in alternatives to the original geons discussed by other researchers [45,61].
2.6 Signal-to-Symbol Mapping f o r Laser Rangefinders
403
Table 1. Difference in qualitative properties between parametric geons and Biederman’s original geons. Attributes
Parametric Geons
~~
Geons ~
cross-sectional shape
symmetrical
symmetrical, asymmetrical
cross-sectional size
constant, expanding
constant, expanding, expanding and contracting
combination of properties
either tapering or bending
both tapering and bending
5. Part Model Recovery
5.1. The Objective Function The strategy for recovering parametric geons bears some resemblance to that for other parametric primitives. That is, a fitting scheme is used to minimise an objective function which measures some property difference between a n object and a model [2,9,44]. However, there is an additional requirement for parametric geon recovery. The process must also produce discriminative information such that the resultant metric data can be converted t o a qualitative description. The objective functions studied previously by several researchers were neither intended nor used for this purpose. An approach has been reported which did use fitting residuals t o guide object segmentation into parts, but for another purpose [20]. To identify individual qualitative shapes based on fitting residuals, were quire that the values of the objective function correctly reflect the difference in size and shape between the object data and the parametric models. Our objective function consists of two terms expressed as follows:
E
= tl
+ Xytz .
The first term, t l , measures the distance between object data points and the model surface; the second term, t 2 , measures the difference between the object and model normals. X and y are parameters controlling the contribution of t 2 made to the objective function. 5.1.1. T h e distance measure The first term of the objective function is given by
Here N is the number of data points, {di E E 3 ,i = 1 , . . . , N } is the set of data points described in terms of the model coordinate system, and a is the vector of model parameters.
404 K . Wu tY M . D. Leuine
I'
f
\
] Imagedata
Fig. 11. Defining the objective function. n, and nd are the model and object surface normals, respectively. 0 is the origin of the model. A is the distance between a particular data point and the centre of the model. zsis a point on the model surface. l?i is the angle between a model and object surface normals.
For the three regular primitives (ellipsoid, cylinder and cuboid), e(di,a) is defined as the Euclidean distance from a data point to the model surface along a line passing through the origin 0 of the model and the data point [62,63] (see Fig. 11):
where
P=
{
2 20
for the ellipsoid for the cylinder and cuboid.
A is the distance from di to 0 and g(di,a) is an implicit function for a parametric geon. Since tapering or bending significantly complicates the implicit equations of the deformed primitives, in these cases we cannot obtain a closed-form solution for e(di,a),as was done in Eq. (5.3). Thus an iterative method would be indicated. However, the objective function evaluation is the largest computational component of the model recovery procedure. Hence, for the sake of simplicity, we compute an approximate distance measure for the tapered and curved models. No iteration is required. First, we apply an inverse tapering or bending transformation to both the data and the model in order to obtain the transformed data di, as shown in Fig. 12; this gives either a regular cuboid or regular cylinder. Second, we use (5.3) to compute the distance from the transformed data point d{ to the transformed model surface along a line passing through d: and the model origin 0. We interpret e(d'i,a) as the approximation of the distance along a line from di to the model surface. Although this approximation creates a small error in the distance measure, it greatly speeds up computation.
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 405
distance
Fig. 12. The cylinders on the right in (a) and (b) are obtained by applying inverse tapering and bending transformations t o the left tapered and curved cylinder, respectively. e(di, a) is the Euclidean distance along a line Odi in the inverse transformed case.
5.1.2. The normal measure We define the second term ( t 2 ) of the objective function by measuring a squared difference between the surface normal vectors n d of objects and the surface normal vectors n, of models at each corresponding position (see Fig. 11): (5.4) Here N is the number of data points and
+ +
In (5.1), y = (a, ay a , ) / 3 , which makes the second term adapt to the size of the parametric geons, and a,, ay and a, are model size parameters. This factor also forces the selection of a model with a smaller size if object data can be fitted
406 K . Wu €4 M. D. Levine
equally well by a model with different parameter sets. This case can happen when the data on the bottom surface of an object cannot be obtained. However, the size of the model is prevented from being arbitrarily small since the value of the objective function increases if the model size is smaller than the object size. This is similar t o the volume factor used in superellipsoid recovery [2]. A, a weighting constant, controls the contribution of the second term t o the objective function. It should be selected based on assumptions of shape differences between objects and their models. There is no general rule for selecting it. We chose X = 5 according to a heuristic based on the shape difference between each pair of parametric geons [64]. 5.2. Minimising the Objective Function The procedure for fitting parametric geons is a search for a particular set of parameters, which minimises the objective function in (5.1). This function has a few deep and many shallow local minima. The deep local minima are caused by a n inappropriate orientation of the model. The shallow minima are caused by noise and minor changes in object shape. In order to obtain the best fit of a model t o an object, we need to find model parameters corresponding to the global minimum of the objective function. To accomplish this, we employ a stochastic optimisation technique, Very Fast Simulated Re-annealing (VFSR) [65]. Motivated by an analogy to the statistical mechanics of annealing in solids, the simulated annealing technique uses a “temperature cooling” operation for non-physical optimisation problems, thereby transforming a poor solution into a highly optimised, desirable solution [46]. The salient feature of this approach is that it statistically finds a global optimal solution. VFSR uses an annealing schedule which decreases exponentially, making it much faster than traditional (Boltzmann) annealing [46], where the annealing schedule decreases logarithmically. In addition, a re-annealing property permits adaptation to changing sensitivities in the multidimensional parameter space. 5.3. Biasing the Objective Function with Different N o r m s
We have suggested a n L1 norm in (5.2) and LZ norm in (5.5) t o measure differences in distance and orientation, respectively. It is known that the sensitivity of an Lz norm gradually increases [66]. In other words, this norm is insensitive to small values of the objective function and becomes sensitive to outliers. On the other hand, the sensitivity of a n L1 norm is the same for all residual values. It is also known that the absolute size of a model is independent of the measurement of the differences between normals. These properties can be used to construct an efficient parameter search during model fitting. Effectively, the procedure automatically endeavours to compute the correct result in what amounts t o two successive “stages”. In the first “stage”, when the fitting procedure begins, the models and objects are not well aligned, so most of the data can be viewed as outliers. Thus
2.6 Signal-to-Symbol Mapping for Laser Rangefinders
407
140
c
0 .3
tj
120
c
2
.
<--;/
objective function
&
I
100
. U3
0
0
9
80
60
0
+3
r r 4 0
0
%
9
2o 0 0
20
40
60
80
100
120
number of decrements in the objective function Fig. 13. The curves show the convergence of the distance measure t i and the normal measure Art2 as they change during the search. The solid line indicates values of the distance measure. The dotted line gives values for the normal measure. The dashed line presents the values of the complete objective function. These curves were obtained when a curved cylinder model was fit to data from the same type of object.
the second term is much larger than the first, thereby dominating the search. Obviously the exact size of the model has little effect on the second term. Hence, the actual search space mainly involves transformation and deformation parameters, as well as the ratio of the size parameters. Clearly, this search space will be smaller than the entire parameter space. As the fitting procedure progresses, the position, orientation and shape of the model will approach that of the object. Now the contribution of the second term gradually decreases and the influence of the first term becomes progressively larger. When the value of the first term is similar to that of the second, the search enters the second ‘‘stage’’in which both terms will contribute equally to the objective function, and the search space becomes the full parameter space. Thus, a search in the full parameter space without good initial estimates is automatically achieved by a “subspace” search followed by a full-space search with good initial estimates of transformation parameters, as shown in Fig. 13. A similar two-term objective function for fitting superellipsoids has been reported [44] where both terms invoked L2 norms. Our approach takes advantage of the sensitivity difference between L1 and L2 norms, and therefore is more efficient. 6. Experiments
This section provides experimental results t o show various aspects of part segmentation and identification.
6.1. Charge Density versus Curvature Before demonstrating part segmentation, we compare the charge density approach to the more conventional curvature computation. The latter has been
408
K. Wu tY M . D. Levane
Fig. 14. A polygonal contour. Because of image quantisation, the boundary of a polygon in an image is jagged.
traditionally used in boundary-based part segmentation [11,29,30,34,35]. In order t o clearly illustrate the issues, we performed experiments with 2-D closed contours. The first experiment examined the sensitivity of the charge density and curvature computations using a simple 2-D polygon, as shown in Fig. 14. Due to the image sampling process, the boundary of this polygon is contaminated by high frequency noise. In a similar fashion t o computing the incremental curvature 1671, we approximate the curvature of a contour based on the changing rate of the discrete tangent at a point on the contour. The increment is 1. A comparison of noise sensitivity for the charge density and curvature for the object contour in Fig. 14 is given in Fig. 15. We show the charge density distribution in the left column and curvature distribution on the right. Without any smoothing operation, all corners on the contour are indicated by the charge density distribution, but concave corners are poorly isolated by the curvature distribution (see the first row). Next we applied lowpass filtering to the discrete Fourier transform of the polygon data to remove the high frequency components. The amount of smoothing was between 1% to 4% of the energy in the Fourier domain. The results are shown in the remaining figures. These experiments clearly illustrate that the charge density computation is more robust with respect to high frequency noise than the curvature computation. In distinction t o a local shape computation, such as curvature, the significance of the charge density distribution is its ability to reveal both fine and gross shape information. We demonstrate this by the following two examples. The first example uses a dumbbell-like object with wiggles superimposed (see Fig. 16). The gray levels indicate charge densities, which are normalised to the range between 20 (darkest intensity) and 255 (white). The object contains two kinds of structures. These are: (1) the fine structure, which is represented by small wiggles and (2) the gross structure, which is delineated by the two major components of the dumbbell. Figure 16(b) shows the charge density distribution along the arc length of this object. This curve simultaneously indicates the fine and gross structures of the contour. The dashed line depicts the two gross components defined by the envelop of the charge density distribution. However, the incremental curvature of the contour only indicates the fine structure, as shown in Fig. 16(c). In the second example, we examine a 2-D contour with just six protrusive parts. The gray level intensities superimposed on the contour indicate the charge density
2.6 Signal-to-Symbol Mapping f o r Laser Rangefinders 409
1 .!::Lu curvature
charge density
151
1
f
$10
0%
(a)
g 5
c
0 0
100
200
300
400
arc length
20 I
500
,E-1
0
100
200
300
400
I
500
arc length
I
I
F
D
1%
+ 5
0 0
100
200
300
400
arc length
151
500
e
I
arc length
0.61
0.4
2%
(c)
'0
100
200
300
400
arc length
10,
p
0.2
500
2 -0.20
I
2 0.3,
(h)
100
200
300
400
500
arc length I
3%
(0 0
100
200
300
400
500
.E -0.1 0I
100
200
300
400
I 500
arc length
101
I
4%
"0
100
200
300
arc length
400
500
arc length
Fig. 15. Comparison between charge density and curvature computations for the 2-D contour in Fig. 14. The left and right columns show the charge density and curvature distributions, respectively. In the first row, no smoothing was employed. In the other rows, we have applied lowpass filtering to the Fourier components of the contour by removing between 1% to 4% of the total energy in the Fourier domain. At each level of smoothing, the charge density computation is clearly more robust to high frequency noise than the curvature computation.
in the same way as in Fig. 16(a). Figure 17(b) shows the charge density distribution along the arc length. We observe that the peaks and valleys can indicate not only the contrast caused by the convexities and concavities of the contour shape but also the significance of protrusive parts. For example, the higher the charge density, the larger the part protrusion. The incremental curvature is unable to do so, as shown in Fig. 17(c).
410
K . W u €4 M. D. Levine
12
10
I
I -1 1.5
I
"I
Fig. 16. Fine and gross features. (a) The charge density for a dumbbell with wiggles superimposed. The brightest and darkest intensities indicate the maximum and minimum charge densities, respectively. (b) The charge density distribution over the contour in (a). The arc length is referenced from the highest pixel on the contour and goes counterclockwise. The frequent peaks indicate small wiggles on the contour. The two peaks of the envelope (the dashed line) of the curve denote the two major parts of the dumbbell. (c) The incremental curvature distribution along the contour in (a). In this computation, the smoothing factor was chosen to be 2% and the increment for the curvature computation was taken as 1.
6.2. Part Segmentation In this section, we will discuss experimental results related to segmentation of 3-D objects into parts. The first object is a vase, consisting of a sphere and a cylinder. The raw range image (sphere + cy1inder.Z) originated at the PRIP Lab at Michigan State University, and is shown in Fig. 18(a). The triangular mesh (see Fig. 18(b)) for the object was obtained using a mesh blending approach developed by DeCarlo and Metaxas [56]. There are 480 triangles in the mesh. Figure 18(c) shows the computed charge density distribution over the object surface. The gray level indicates charge density, which is normalised to the range between 0 (darkest intensity) and 250 (white). It can be clearly seen that the lowest charge densities are located at surface concavities, which are at the intersection of the spherical and cylindrical portions of the object. Conversely, since the edge on the top of the
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 411
a n lngh
(b)
(c)
Fig. 17. The charge density and curvature distributions on an image contour. The arc length is referenced from the highest pixel on the contour and goes counterclockwise. (a) An object contour with a superimposed charge density distribution. (b) The charge density distribution along the arc length of the contour. The height of the peaks indicates the significance of the object parts. (c) The incremental curvature along the contour. In the computation, the smoothing factor is 2% and the increment for the curvature computation is 1.
object is sharply convex, the charge density at these points reaches a maximum. Figure 18(d) shows the two segmented parts of the vase. The second 3-D object is a toy bowling pin. The range data (see Fig. 19(a)) were obtained by multiview integration, as described in [55]. The triangular mesh of the object in Fig. 19(b) was computed in the same way as for the previous object. There are 864 triangles in the mesh. Figure 19(c) shows the simulated charge density distribution. Again the charge density distribution clearly indicates the local charge density minima. The result of segmentation is given in Fig. 19(d). The third example involves single-view range data of a stone owl. Figure 20(a) shows the shaded range data and (b) illustrates the triangular mesh for the visible surface. Here 368 triangular facets are used t o represent the visible surface. The
412
K . Wu €9 M. D. Levine
(a)
(b)
(c)
(4
Fig. 18. Segmenting a vase. (a) Range data; (b) The triangular mesh tessellation; ( c ) The computed charge density distribution; (d) Segmented parts.
(4
(b)
(c)
(dl
Fig. 19. Segmenting a bowling pin. (a) Multiview data; (b) The triangular mesh tessellation; (c) The computed charge density distribution; (d) The segmented parts.
(4
(b)
(c)
( 4
Fig. 20. Segmenting a stone owl. (a) Shaded range data of the owl. (b) The triangular tessellation on the visible surface. ( c ) The charge density distribution. (d) The segmented parts.
closed triangular mesh is composed of 966 triangles. Figure 20(c) illustrates the computed charge density distribution on the surface. It is evident that the lowest charge densities are located at surface concavities while the charge density at convexities exhibits a local maximum. As shown in Fig. 20(d), this object is segmented into three parts, namely, the head, the torso and the feet.
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 413
Fig. 21. Segmenting an alarm clock. (a) The shaded range data. (b) The triangular tessellation for the visible surfaces. (c) The charge density distribution. (d) The segmented parts.
In the fourth example, we consider an alarm clock with two ringers on top. Figures 2l(a) and (b) show the shaded range data and the triangular mesh, respectively. There are 596 triangles tessellated on the visible surface and 1,878 triangles on the closed mesh. The computed charge density distribution is illustrated in Fig. 21(c) and the segmented parts are given in (d). Here the clock is decomposed into three parts. These results are consistent with our intuition of object parts. The complexity of the charge density computation is governed by the construction of the coefficient matrix for the set of linear equations and the conjugate gradient squared method to solve the equations. The complexity of both is of the order of O ( N 2 ) ,where N is the number of triangular facets. On an SGI R8000 workstation, the actual computing time for the charge distribution for the owl is 90 seconds, with about two seconds for surface triangulation, and less than one second for part decomposition.
6.3. Model Recovery Experiments on recovering parametric geons from single- and multi view range data were conducted. We also investigated the efficiency of the objective function, the discriminative properties of parametric geons, the effect of object shape imperfection and the importance of multiview data for shape approximation. We are interested in examining the residual differences among all fitted models, especially when object data containing noise and object shapes do not exactly conform to those of the parametric geons. Single-part and multi-part objects were used in the experiments. The final data for parametric geon model recovery is a set of 3-D data points. 6.3.1. Using range data of imperfect geon-like objects The purpose of this experiment was to examine the uniqueness of the shape approximation using parametric geons when given a set of objects whose shapes varied slightly. Here, eleven real bananas were used as objects. Figure 22 shows four of the bananas used in the experiments. We observe that their shapes cannot simply be depicted by any of the parametric geons. The apparently noisy surfaces of the bananas shown in the figure were due to the rangefinder’s sampling error.
414 K. Wu tY M. D. Levine
Fig. 22. Four bananas used in the experiments.
Fig. 23. Fitted models superimposed on the range data obtained from a banana.
This was because the bananas had to be placed relatively far from the rangefinder in order for them to fit within the its scanning field-of-view. Four views were used for model recovery in these experiments. Figure 23 shows the results of fitting the seven parametric geons to 3-D data of a particular banana. The lighter shaded volumes are the models obtained by the fitting procedure and the darker sparse spots indicate the input data. (a) through (g) illustrate models of the ellipsoid, the cylinder, the cuboid, the tapered cylinder, the tapered cuboid, the curved cylinder and the curved cuboid superimposed on the 3-D data, respectively. The algorithm selected the curved cylinder shown in (f) as the best model. The numbers at the top left corner indicate the fitting residuals. Clearly this result is consistent with our intuition of the banana’s actual shape. Table 2 gives the average, maximum and minimum fitting residuals for all of the bananas for each of the seven models. In order to make an absolute comparison of the fitting residuals, each residual was normalised by the minimum residual among
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 415 Table 2. Fitting models to range data of eleven bananas. Models 1
2
3
4
5
6
7
Mean
3.255
2.889
3.851
3.324
3.611
1.000
2.987
Maximum residual
4.001
3.489
4.717
5.018
4.328
1.000
3.802
Minimum residual
2.656
2.458
3.102
2.464
3.073
1.000
2.385
those obtained for the same banana. The results show that the best model for all bananas is the curved cylinder, which gives the smallest average residual value. Therefore, the parametric geon models and recovery procedures demonstrate robust behavior and uniquely represent the different banana shapes. 6.3.2. Comparing different objective functions In this experiment, we recovered parametric geons from the multiview range data of the same eleven bananas using just tl (see Eq. (5.2)). This objective function measures the sum of the spatial distances from the data points to the model surface along a line passing through a datum and the model centre. It has a significant advantage over others as an objective function for superellipsoid fitting [62]. Although our 3-D data were obtained from multiple views, the data at the bottom of the bananas were still missing which would cause the model size to be underconstrained. Thus we multiply tl by a size factor y. The effect is same as that of y presented in Section 5.1.2. Figure 24(a) shows the fitting residuals obtained using the weighted tl with eleven bananas. The circles linked by dotted lines correspond to residuals from one particular set of data. In order to show the differences in the residuals clearly, they were normalised by the minimum residuals obtained from each banana. Although this simplified objective function actually produces the unique shape type, the minimum residuals are not significantly lower than the others. In addition, for some data sets the algorithm required many more evaluations to find the global minimum than the objective function proposed in this chapter. Figure 24(b) shows the fitting residuals obtained using the objective function proposed in this paper. Here, all minimum residuals are clearly lower than the others. This is because the normal measure makes the global minimum deeper in objective function space than the other local minima. This facilitates global optimisation and makes the fitting residuals more discriminating. It is noted that the curved cuboid would be clearly chosen as the next best if only the distance measure were used as shown in Fig. 24(a). However, when the normal measure is introduced to the objective function, the curved cuboid would not be chosen as the next best, as shown in Fig. 24(b).
416 K. Wu& M. D. Lewine
I 0
1
2
3
4 model types
5
6
7
I 8
(4
6.3.3. Comparing single-view and multiview data
This experiment examined the quality of fitting when using single-view data of the same eleven bananas. Figure 25 shows the normalised fitting residuals. The algorithm again selected the curved cylinder as the model for all of the bananas.
2.6 Signal-to-Symbol Mapping f o r Laser Rangefinders
I
2
3
4 primitives
5
6
417
7
Fig. 25. Fitting residuals obtained from single-view data of the eleven bananas.
..-_ ..
-. _...
_.-.
.
2
Fig. 26. Fitting a model to single-view (a) and multiview data (c) of a banana. The left column shows range data and the right column shows the model superimposed by range data. The model obtained from the single-view data is biased by the available partial shape information.
However, compared with Fig. 24(b), the fitting results are now much more diverse, and the differences between the minimum residuals and the others are significantly reduced. Figure 26 shows the results of fitting a parametric geon model to singleview and multiview data of a banana. Since the banana shape is not regular, the model estimated from single-view data was biased by the available partial shape information. Thus, the experimental results suggest that sufficient data are needed in order to accurately obtain part models. 6.3.4. Comparing perfect and imperfect geon-like objects Here we compared the parameter dispersion obtained by fitting parametric geons to single-view data of perfect and imperfect geon-like objects. We obtained four sets
418 K . Wu & M . D. Lewine
I 0.25-
0.2-
jl 0
B
.= .8 m
0.1
I
I
0.05
OO
1
2
3 parameters
Fig. 27. Comparing bananas and plastic tubes. The lighter bar denotes the coefficient of variation of four estimated parameters for the plastic tube. The darker bar indicates the coefficient of variation of four parameters averaged oveT eleven bananas. The horizontal axis indicates the parameters. 1, 2, 3 are the model size parameters along the X , Y , 2 axes, respectively, and 4 is the bending curvature.
of data by scanning a curved plastic tube whose shape resembled a perfect curved cylinder. The data of imperfect geon-like objects consisted of 44 sets of singleview data of the eleven bananas, each of which was scanned from four different views. Three scale parameters along the X, Y ,2 axes and the bending curvature parameter were examined. We cannot compare the transformation parameters (see Eq. 4.1) because the two types of objects are represented in terms of their individual views. Because of differences in size and axis curvature, we used the coefficient of variation, defined as the ratio of the standard deviation and the mean, as the measure of relative dispersion of each estimated parameter. Figure 27 shows that the parameter dispersion is much larger for the banana than the plastic tube. This is because the shape variations of imperfect objects in some views may be more than in other views, and data from perfect geon-like objects in single views contain more consistent information. Thus, the imperfection in shape makes it much more difficult for model recovery to obtain unique quantitative information using single-view data. This also suggests that employing multiview data is very important in parametric model recovery, especially when the object shapes are not highly consistent with the model shapes. 6.3.5. Using multi-part objects We have also conducted experiments with multi-part objects, which have been segmented into parts. Figure 28(a) illustrates single-view range data of an object and (b) its parametric geon model consisting of a cylinder and an ellipsoid. Figure 28(c) is one view from a set of multiview range data of a toy bowling
2.6 Signal-to-Symbol Mapping for Laser Rangefinders
419
Fig. 28. Part-based descriptions of objects. Shaded range data of three objects are shown in the left column. The part-based descriptions of these three objects are presented in the right column.
pin and (d) is the object model made up of a tapered cylinder and an ellipsoid. Figure 28(e) shows the single-view range data of a stone owl and (f) its model. The head and torso are identified as two curved cylinders and the model of the base is a tapered cylinder. These results again indicate that (1) when the object is composed of perfect geon-like parts, our method obtains the optimum result (see (a) and (b)); (2) when multiview shape information is available but the object is composed of imperfect geon-like parts, our method can also produce a good result (see (c) and (d)); (3) when only partial shape information is available and the object is not composed of perfect-geon like parts, our method still achieves a satisfactory qualitative result (see (e) and (f)).
7. Discussion The characteristics of our method for part segmentation are further elaborated in this section. The primary distinction between our approach and other boundarybased techniques is the algorithm for surface feature computation. As a result of simulating the electrical charge density distribution, our algorithm has the following interesting properties and advantages:
420
K. Wu €4 M. D. Levine
Uniqueness. It has been proven that the charge density distribution on a charged conductor in electrostatic equilibrium is uniquely determined [68]. Thus given a specific shape of an object, our method obtains a unique description of the surface property, and in turn, obtains a unique segmentation result based on this surface property. Invariance. The charge density distribution depends completely upon the total charge, as well as the shape and size of an object. Thus, it is independent of the coordinate system chosen for the computation (see Eqs. (3.10) and (3.11)). Also the relative positions of the extrema in the charge density distribution are not dependent on object size. Therefore, our part segmentation method is invariant to object scale, translation and rotation. Smoothness. The charge density computation, which is based on integral equations, does not require an assumption on smoothness of object surfaces. However, the common surface curvature computation, which is based on differentiation, requires a smooth surface, namely, the continuous second partial derivatives of the object surface [69]. Computability. The algorithm for the charge density distribution is based upon three physical facts and Gauss’s law. There are no crucial user-defined parameters required. This is superior to isotropic diffusion [38], another strategy for surface feature computation. In this case, it is extremely difficult to choose the parameter which controls the duration of the diffusion process. Scope and sensitivity. It is often required that an object description represent shape at different scales. Descriptions at coarse scales relate to the gross shape features. Details at finer scales include features that are more local. The charge density distribution carries information about scale in a different way from the more common curvature-based approach [70]. Curvature is completely determined by local data. But the charge density distribution is affected by all of the points on the object surface. Note that these points do not contribute equally to the potential at a particular observation point. Instead, their influence is weighted by the reciprocal of the distance between the observation and source points (see Eq. (3.6)). Thus, a charge density computation can simultaneously reveal both gross and fine features of object shapes, as demonstrated in Section 6.1. 8. Conclusion This chapter presents an approach to qualitative shape representation of 3-D object parts by a finite set of volumetric primitives. This is accomplished in two steps: a physics-based object segmentation into parts followed by a top-down model recovery procedure. The segmentation method is motivated by an analogy between the well-known transversality principle and the singularity of the electrical charge density distribution. Given single- or multiview range data of an object, a triangular mesh is constructed and the charge density distribution is simulated. Object
2.6 Signal-to-Symbol Mapping for Laser Rangefinders
421
part boundaries are then detected at the deep surface concavities where the charge density achieves local charge minima. The object is subsequently broken into parts at these points. Unlike most previous approaches, we do not compute surface curvature. Instead, we solve a set of integral equations over the whole object surface. Because of this, our algorithm does not require an assumption on surface smoothness and is very robust to sensor noise. Once the segmented parts have been obtained, the model recovery procedure is performed. We have proposed parametric geons as part models, which provide qualitative shape classes as well as quantitative size and deformation information. Model recovery is achieved by fitting all parametric geons to the range data of a part and classifying the fitting residuals. An objective function involving a measure of distance and normal differences is optimised by a global optimisation procedure (VFSR). The combination of L1 and Lz norms in the objective function permits an efficient and hierarchical search of the model parameters, resulting in more discriminative fitting residuals. Parametric geon models impose global shape constraints which reduce the influence of sensor noise and minor shape variations during model recovery. This approach achieves an explicit shape verification of resultant descriptions by directly comparing model shapes with a part shape. The consequent part descriptions are well suited for symbolic reasoning and efficient object recognition.
Acknowledgements We wish to thank Lester Ingber, Jonathan Webb, Demitri Metaxas, Gerard Blais, Gilbert Soucy and Douglas Decarlo for their kind help. MDL would like to thank the Canadian Institute for Advanced Research and PRECARN for its support. This work was partially supported by a Natural Sciences and Engineering Research Council of Canada Strategic Grant and an FCAR Grant from the Province of Quebec.
References [l] D. Hoffman and W. Richards, Parts of recognition, Cognition 18 (1984) 65-96.
[2] F. Solina and R. Bajcsy, Recovery of parametric models from range images: the case for superquadrics with global deformations, IEEE Trans. Pattern Anal. Mach. Intell. 12, 2 (1990) 131-147. [3] D. Marr and H. K. Nishihara, Representation and recognition of spatial organization of three-dimensional shapes, Proc. Royal SOC.B200 (1978) 269-294. [4] J. J. Koenderink and A. J. Van Doorn, The shape of smooth objects and the way contours end, Perception 11 (1982) 129-137. [5] B. Tversky and K. Hemenway, Objects, parts and categories, J . Ezp. Psychol. 113, 2 (1984) 169-191. [6] M. Leyton, Inferring causal history from shape, Cogn. Sci. 13 (1989) 357-389. [7]R. Bajcsy and F. Solina, Three-dimensional object representation revisited, in Proc. First Int. Con& Comp. Vision, London, 1987, 231-241. [8] R. Nevatia and T. 0. Binford, Descriptions and recognition of curved objects, Art. Intell. 8 (1977) 77-98.
422
K. W u €4 M. D. Levine
[9] A. P. Pentland, Recognition by parts, in Proc. First Int. Conf. Comp. Vision, London, Jun. 1987, 8-11. [lo] A. P. Pentland, Automatic extraction of deformable part models, Int. J. Comp. Vision 4 (1990) 107-126. [I11 F. P. Ferrie, J. Lagarde and P. Whaite, Darboux frames, snakes and superquadrics: Geometry from the bottom up, IEEE Trans. Pattern Anal. Mach. Intell. 15,8 (1993) 771-784. [12] D. Terzopoulos and D. Metaxas, Dynamic 3-D models with local and global deformations: Deformable superquadrics, I E E E Trans. Pattern Anal. Mach. Intell. 13, 7 (1991) 703-714. [13] S. J. Dickinson, A. P. Pentland and A. Rosenfeld, 3-D shape recovery using distributed aspect matching, I E E E Trans. Pattern Anal. Mach. Intell. 14, 2 (1992) 174-198. [14] J. Hummel and I. Biederman, Dynamic binding in a neural network for shape recognition, Psychol. Rev. 99, 3 (1992) 48&517. [15] R. Bergevin and M. D. Levine, Generic object recognition: Building and matching coarse descriptions from line drawings, I E E E Trans. Pattern Anal. Mach. Intell. 15, 1 (1993) 19-36. [16] K. Siddiqi, K. J. ‘Presness and B. B. Kimia, Parts of visual form: Psychophysical aspects, Perception 25,4 (1994) 399-424. [17] J. O’Rourke and N. Badler, Decomposition of three-dimensional objects into spheres IEEE Trans. Pattern Anal. Mach. Intell. 1 (1979) 295-305. [18] A. P. del Pobil, M. A. Serna and Juan Llovet, A new representation for collision avoidance and detection, in Proc. 1992 I E E E Int. Conf. Robotics and Automation, Nice, France, May 1992. [19] R. Mohr and R. Bajcsy, Packing volumes by spheres, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 111-116. [20] A. Gupta and R. Bajcsy, Volumetric segmentation of range images of 3-D objects using superquadric models, CVGIP: Image Understanding 58,3 (1993) 302-326. [21] A. Pentland, Part segmentation for object recognition, Neural Comput. 1 (1989) 82-91. [22] T. Horikoshi and S. Suzuki, 3D parts decomposition from sparse range data using information criterion, in Proc. 1993 IEEE Comp. SOC. Conf. Comp. Vision and Pattern Recogn., New York City, NY, Jun. 1993, 168-173. IEEE Computer Society Press. [23] F. Solina, A. Leonardis and A. Macerl, A direct part-level segmentation of range images using volumetric models, in Proc. 1994 IEEE Int. Conf. Robotics and Automation, San Diego, CA, May 1994, 2254-2259. IEEE Robotics and Automation Society, IEEE Computer Society Press. [24] S. J. Dickinson, D. Metaxas and A. Pentland, Constrained recovery of deformable models from range data, in Proc. Int. Workshop on Visual Form, Capri, Italy, May 1994. [25] N. S. Raja and A. K. Jain, Obtaining generic parts from range data using a multi-view representation, CVGIP: Image Understanding 60, 1 (1994) 44-64. [26] R. Hoffman and A. K. Jain, Segmentation and classification of range images, IEEE Trans. Pattern Anal. Mach. Intell. 9, 5 (1987) 608-620. [27] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8,6 (1986) 679-698. [28] B. O’Neill, Elementary Diflerential Geometry (Academic Press, Nork and London, 1966).
2.6 Signal-to-Symbol Mapping for Laser Rangefinders 423 [29] H. Rom and G. Medioni, Part decomposition and description of 3-D shapes, in Proc. 12th Int. Conf. Pattern Recogn. I, Jerusalem, Israel, Oct. 1994, 629-632. IEEE computer Society, IEEE Computer Society Press. [30] F. P. Ferrie and M. D. Levine, Deriving coarse 3-D models of objects, in Proc. ZEEE Conf. Comp. Vision and Pattern Recogn. Ann Arbor, Michigan, Jun. 1988, 345-353. (311 Q. L. Nguyen and M. D. Levine, Representing 3-D objects in range images using geons, Comp. Vision and Image Understanding 63,1 (1996) 158-168. [32] B. I. Soroka, R. L. Andesson and R. K. Bajcsy, Generalised cylinders from local aggregation of sections, Pattern Recogn. 13,5 (1981) 353-363. [33] T. Phillips, R. Cannon and A. Rosenfeld, Decomposition and approximation of 3-D solids, Comp. Vision, Graphics and Image Proc. 33 (1986) 307-317. [34] E. Trucco, Part segmentation of slice data using regularity, Signal Proc. 32 (1993) 73-90. (351 A. Lejeune and F. Ferrie, Partioning range images using curvature and scale, in Proc. 1993 IEEE Comp. Soc. Conf. Comp. Vision and Pattern Recogn. New York City, NY, Jun. 1993, 80Ck801. IEEE Computer Society Press. [36] P. J. Besl and R. C. Jain, Invariant surface characteristics for three dimensional object recognition in range images, Comp. Vision, Graphics and Image Proc. 33, 1 (1986) 33-88. [37] E. Trucco and R. B. Fisher, Experiments in curvaturebased segmentation of range data, IEEE Trans. Pattern Anal. Mach. Intell. 17,2 (1995) 177-181. [38] Y. Yacoob and L. S. Davis, Labeling of human face components from range data, CVGIP: Image Understanding 60,2 (1994) 168-178. [39] I. Biederman, Human image understanding: Recent research and a theory, Comp. Vision, Graphics and Image Proc. 32 (1985) 29-73. [40] S. J. Dickinson and D. Metaxas, Integrating qualitative and quantitative shape recovery, Int. J . Comp. vision 13,3 (1994) 311-330. [41] R. C. Munck-Fairwood and L. Du, Shape using volumetric primitives, Image €9 Vision Computing 11,6 (1993) 364-371. [42] A. H. Barr, Superquadrics and angle-preserving transformations, ZEEE Comp. Graphics Applications 1 (1981) 11-23. [43] T. Boult and A. Gross, Recovery superquadrics from depth information, in Proc. AAAI Workshop on Spatial Reasoning and Multisensor Integ. American Association for Artificial Intelligence, 1987, 128-137. [44] N. Yokoya, M. Kaneta and K. Yamamoto, Recovery of superquadric primitives from a range image using simulated annealing, in Proc. Int. Joint Conf. Pattern Recogn., 1 (1992) 168-172. [45] N. S . Raja and A. K. Jain, Recognizing geons from superquadrics fitted to range data, Image and Vision Computing 10,3 (1992) 179-190. [46] S. Kirkpatrick, Jr. C. D. Gelatt and M. P. Vecchi, Optimization by simulated annealing, Science 220,4598 (1983) 671-680. [47] V. Guillemin and A. Pollack, Differential Topology (Prentice-Hall, Englewood Cliffs, NJ, 1974). [48] F. J. Bueche, Introduction to Physics for Scientists and Engineers, 3rd edn. (McGrawHill Book Company, New York, 1980). [49] J. D. Jackson, Classical Electrodynamics (Wiley, New York, 1975). [50] R. A. Serway, Physics for Sciences €9 Engineers, 2nd edn. (Saunders College Publishing, 1986). [51] R. G. Lerner and G. L. Trigg, Encyclopedia of Physics, 2nd edn. (VCH Publishers, Inc., New York, 1991).
424
K . W u €9 M. D. Levine
[52] P. P. Silvester and R. L. Ferrari, Finite Elements for Electrical Engineering, 2nd edn. (Cambridge University Press, Cambridge, 1990). [53] D. Wilton, S. M. Rao, A. W. Glisson et al., Potential integrals for uniform and linear source distributions on polygonal and polyhedral domains, IEEE Trans. Antennas and Propag. AP-32, 3 (1984) 276-281. [54] R. Barrett, M. Berry, T. F. Chan et al., Templates f o r the Solution of Linear Systems: Building Blocks for Iterative Methook (SIAM, Philadelphia, 1994). (551 K. Wu, Computing Parametric Geon Descriptions of 3-0 Multi-Part Objects, PhD thesis, McGill University, Montreal, Canada, Apr. 1996. [56] D. DeCarlo and D. Metaxas, Adaptive shape evolution using blending, in IEEE Proc. Int. Conf. Comp. Vision (1995) 834-839. [57] L. R. Rogers, Sculpture (Oxford University Press, 1969). [58] B. Putnam, The Sculptor’s Way (Farrar & Rinehart, Inc., 1939). [59] W. Zorach, Zorach Explains Sculpture: What It Means and How It Is Made (Tudor Publishing Company, New York, 1960). [SO] G. Hatfield and W. Epstein, The status of the minimum principle in the theoretical analysis of visual perception, Psychol. Bull. 97, 2 (1985) 155-186. [61] S. J. Dickinson, A. P. Pentland and A. Rosenfeld, A representation for qualitative 3-D object recognition integrating object-centered and viewer-centered models, in K. N. Leibovic (ed.), Vision: A Convergence of Disciplines (Springer-Verlag, New York, 1990). [62] A. D. Gross and T. E. Boult, Error of fit measures for recovering parametric solids, in Proc. 2nd Int. Conf. Comp. Vision, Tampa, Florida, 1988, 69M94. [63] P. Whaite and F. P. Ferrie, From uncertainty to visual exploration, IEEE Trans. Pattern Anal. Mach. Intell. 13, 10 (1991) 1038-1049. [64] K. Wu and M. D. Levine, 3-D object representation using parametric geons, Technical Report TR-CIM-93-13, Centre for Intelligent Machines, McGill University, Montreal, Quebec, Canada, Sept. 1993. [65] L. Ingber, Very fast simulated re-annealing, Math. Comp. Model. 1 2 , 8 (1989) 967-973. [66] F. R. Hammpel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel, Robust Statistics: The Approach Based on Influence Functions (Wiley, New York, 1986). [67] H. Freeman, Shape descriptions via the use of critical points, Pattern Recogn. 10, 3 (1978) 159-166. [68] W. R. Smythe, Static and Dynamic Electricity, 3rd edn. (McGraw-Hill, Inc., New York, 1968) p. 124. (691 P. J. Besl and R. C. Jain, Intrinsic and extrinsic surface characteristics, in Proc. IEEE Conf. Comp. Vision and Pattern Recogn., San Francisco, CA, Jun. 1985, 226-233. [70] H. Asada and M. Brady, The curvature primal sketch, IEEE Trans. Pattern Anal. Mach. Intell. 8, 1 (1986) 2-14.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 425-451 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 2.7 I 3-D VISION OF DYNAMIC OBJECTS
SHIN-YEE LU Lawrence Livermore National Laboratory P.O. Box 808, L-156, Livermore, C A 94550, USA and CHANDRA KAMBHAMETTU Department of Computer and Information Science University of Delaware, Newark, DE 19716, USA Because the real world is three-dimensional and dynamic, imaging technology is progressing from photos to video recorders, t o 3-D scanners. The next step in imaging technology is the ability to capture, represent, and analyze dynamic 3-D surfaces. With high-quality and high-resolution, in both spatial and temporal domains, 3-D surface capturing capability is approaching commercial reality. Time sequence of high density 3-D surface data will become more available to the use of analyzing deformation and movement of dynamic objects. We will refer to such data sets as 3&1/2-D data. This chapter will cover 3-D motion data capturing techniques, surface representations, and tracking algorithms, as well as an example of 3-D temporal data - facial expression and lip movement - and a brief discussion of applications. Keywords: Computer vision, 3-D surface, motion analysis, motion tracking, deformation model.
1. Introduction Current commercial 3-D motion capturing systems use an optic marker-based technology. To capture human body motion, IR reflective markers are placed at strategic locations on the object and then recorded on CCD cameras. 3-D marker locations are triangulated using images from two different perspectives. From the marker locations and knowledge of human anatomy, a stick figure assimilating the body movements is reconstructed by the computer. This approach is accurate but has low spatial resolution. The main users of such systems are hospital and university motion analysis laboratories [l],and race horse and sports training centers. More recently, this type of motion capturing is used in movie and video animation for live character generation, or “inverse kinematics”, putting real, captured object motion into the likes of animated dancing cars. In a different but related arena, 3-D static surface measurement and imaging systems have been developed for industrial manufacturing applications. However, current commercial 3-D surface imaging systems usually require scanning or repeated imaging, and cannot be extended t o motion capturing. 425
426
S.-Y. Lu €d C. Kambhamettu
Systems that generate high-quality, high-resolution time sequence of surface data of dynamic objects are becoming available. Recently, the system developed at Lawrence Livermore National Laboratory, a 3-D motion camera system [2], has been applied to a n animated T V commercial in which the live facial expression of an actor is captured and pasted into a virtual environment. This type of technology will allow true 3-D images of electronic figures digitized from real actors t o be manipulated, using computer graphic techniques. The result is a realistic, controlled animation of a fantasy world. The 3-D surface motion data capturing approach will impact the medical and sports use of motion analysis. The dense surface coverage not only provides better visualization, but also surface- and volume- related measurements. This approach can be used potentially t o analyze skeletal-muscular relationships during movement. However, many difficulties still lie ahead before the 3&1/2-D surface motion data can benefit applications other than animation and reverse engineering. The main difficulties in developing application software are in associating, tracking, and labeling of vast data points. These are problems similar to machine vision of 2-D, 2&1/2-D and 3-D data that have been studied by researchers for the last three decades. The purpose of this chapter is to give an overview of 3-D surface motion capturing technology and analysis tools that have been developed in the field of machine vision. Since the availability of 3&1/2-D data is scarce so far, the analysis tools mentioned in this chapter are not fully developed t o handle the complex time sequence of 3-D surfaces. The authors hope that this chapter can be a reference for further research in this field. 2. 3-D Motion Capturing Systems This section describes systems that can capture the intricacy of movement of a dynamic object, such as a human body. There are two general approaches to 3-D surface motion capturing: marker-based and surface-based. A marker-based system attaches external markers on a subject and captures the positions of individual markers. Current systems are capable of handling 30 300 markers. Surface-based systems generate complete surface representation without using markers. Figure 1 shows a classification of current 3-D motion capturing systems. Marker-based systems capture trajectories of a sparse set of key locations on the surface of the object. These trajectories are used to analyze movement for medical applications or t o morph 3-D models for animation. The advantages of marker-based systems are their simplicity in capturing and developing application software. The drawbacks are that the method is intrusive, results in unrealistic animation, and fails to capture volume-related measurements. Markers attached to the skin are considered inappropriate for accurate medical measurements. The motion of the skin relative to the bones is likely t o affect the intended measurement of relative motion between articulate surfaces of a joint [3]. N
2.7 3 - 0 Vision of Dynamic Objects 427 13D 3D motion systems
Non-optics
I
1
Optics
Stereo
1 Structured
!Hybrid
light ~
I
Scanning laser
~
Fig. 1. A classification of 3-D motion capturing systems
Surface-based systems attempt to capture complete 3-D surface in motion. For example, dense range maps of a human facial expression sequence can be obtained at 2-mm spatial resolution, and at 30-millisecond temporal resolution (see Fig. 3). The detailed surface features provide the possibility of using anatomic features for movement analysis, thus overcoming problems caused by marker use.
2.1. Marker-Based Systems 3-D motion sensors are position sensors that can update the position of a point or a group of points at a specified sampling rate. Early technology attaches position sensors on the moving object and transmits signals electronically using cables. The use of light reflective markers and CCD cameras to capture marker positions provides a wireless system that is easier to use. This new approach of motion capturing had a great impact on movie and video game animation [4]. This type of system is also widely used by motion analysis laboratories in the medical field to evaluate skeletal muscular diseases for physical therapy and surgical planning [5]. The passive optic marker-based approach places multiple CCD cameras (typically 5 to 6 ) around the subject. Infrared LED illuminators are mounted in a ring around the lens of each camera. Light reflective markers are attached to the subject at key locations. The cameras are fitted with filters to enhance signals in the IR regime that capture reflective surfaces only. Marker positions on each image are the centroids of their respective blobs in the image. The calculation of 3-D positions of these markers is based on stereo vision. Marker correspondence in two images obtained from adjacent cameras is established first. 3-D positions are then calculated based on the principle of triangulation. These markers are specially designed to minimize the centroiding error. Markers are usually made in a dome shape to appear in an image as close to circles as possible
428
S.-Y.Lu & C. Kambhamettu
for the following reasons. First, statistical accuracy of centroiding can be achieved from any direction. Second, centroids of a marker from two different images can correspond to the same spot on the marker. A number of marker-based motion measurement systems are described in [l]. The optic marker-based systems have two main drawbacks: marker drop-out and marker confusion. The triangulation method requires a marker to be in the field of view of two cameras simultaneously t o calculate its position. Occlusion causes marker drop-out and results in discontinuity along the trajectory of some markers. Since all markers are identical, marker confusion becomes a difficult problem to solve. Association methods based on optimization, as well as tracking methods based on estimation theory and kinematics models, are often used to generate the correct trajectory of individual markers. Position sensing based on R F technology has the promise to become the next motion capturing technology that is wireless. RF-based systems are less affected by occlusion. It can also avoid the problem of marker confusion by individually tagging each marker with different frequency signatures [6].
2.2. Surface-Based Systems Systems to capture continuous and complete 3-D surface of objects are becoming available [2]. A survey of range data acquisition methods can be found in [7,8]. However, because not all of these methods are designed for real-time acquisition and motion capturing. To capture true motion without motion blur, camera-based methods are preferred, such as stereo vision or structured light systems that do not rely on multiple imaging or scanning motion of the sensor. 2.2.1. Stereo vision Tutorials on the principles of stereo vision can be found in the chapter on “ P r e jective Geometry and Computer Vision”, in this handbook [9] and in [lo]. A stereo vision system consists of three major software components: camera calibration, image correspondence, and 3-D reconstruction. Camera calibration establishes the perspective transformation between 2-D image planes and the 3-D world coordinate frame. This transformation is used for the triangulation of matching pixels t o derive the corresponding 3-D positions. Calibration of lens distortion should also be included in a stereo calibration procedure [ll]. Stereo correspondence is an ill-posed inverse problem. The epi-polar geometry is a rigid constraint that uniquely determines the matching relationship among epi-polar lines (defined by the optical line-of-sight) of two images. The left-right ordering relation constrains the correspondence along the orthogonal direction of epi-polar lines. The left-right ordering constraint is an insufficient constraint, which means that the relationship is not sufficient to match pixels from left to right along epi-polar lines. Also, this relationship does not always hold for a scene that contains isolated objects. However, for a continuous surface when the camera baseline is
2.7 3-0 Vision of Dynamic Objects 429
Fig. 2. Stereo Depth Resolution at 1M and 2M for, top: focal length = 12-mm vs. baseline; bottom: for base line = 100-mm vs. focal length.
small compared with the depth of the volume, the ordering constraint is generally true. A dynamic programming technique is often used t o exploit this ordering constraint [12-151. Image cues such as intensity and intensity gradient are used to complement the above two geometric constraints in many stereo reconstruction algorithms. Image features, such as edges, are extracted from individual images. An image is then represented by a structure of the features where strong features (high contrast) are matched first, and weaker features embedded between stronger features are matched afterwards [16-181. Generally, stereo correspondence suffers from high computational complexity and low reliability. The camera resolution also limits the accuracy of triangulation. Figure 2 shows the result of modeling depth resolution under different set-ups when stereo images are matched at pixel resolution. For example, if cameras with 12-mm focal length lenses are separated at a 15-cm base-line distance, the depth resolution is 7-mm when object is a t one meter distance and 27-mm at two meters. This analysis suggests that a video based 3-D motion camera system needs t o have the ability of resolve depth at sub-pixel accuracy. 2.2.2. Structured light systems Structured light methods that project textured patterns such as lines, dots, or grids, to generate 3-D positions can also be extended for motion capturing. The light source and the camera form the base for triangulation. Structured light methods calibrate the light pattern with the camera image plane similar to dual camera calibration in stereo vision. The advantage of structured light systems is that the light pattern creates an even textured image, thus an even coverage for reconstruction. Also, intensity changes created by a n active light source can be modeled and detected at sub-pixel accuracy.
430 S.-Y. Lu & C. Kambhamettu
The main difficulty lies in the so-called labeling problem, i.e. to correlate features in an image with the original. light pattern. Occlusion and depth changes of the object induce distortion of the line pattern [19-211. “Jump order” occurs when two different features, such as two separate lines, are seen as the same feature on the image, and results in erroneous depth estimation. Other sources of error including system calibration, image processing, and errors introduced by the difference in light reflectance due to orientation and curvature of the object are discussed in [22]. 2.2.3. Hybrid systems
A combined stereo/structured light system can reduce the problems associated with either methods [23,24]. Strobe light can be used to further reduce motion blur and discomfort for live subjects. Systems have also been proposed that use trinocular or more cameras to increase the accuracy of image correspondence [25]. Near-frame-rate scanning systems can be used for motion capturing although surface distortion can be expected due to the delay caused by scanning [26-281. 2.2.4. A facial expression example
Figure 3 shows an example of motion data from a facial expression sequence. The data is captured using a 3-D motion camera system developed at Lawrence Livermore National Laboratory. The system is based on a hybrid stereo/structuredlight technology. Data is captured at the video rate of 30 frames per second and shown here at every fifth frame (150 milli-second).
Fig. 3. Motion data of a facial sequence; top: surface normal display at 150-msec., bottom: 3-D rendered image of one frame displayed at different perspectives.
2.7 3-0 Vision of Dynamic Objects 431 2.3. Data Representation
The direct output of 3-D surfaces from a data acquisition system is a so-called “range map”. A range map is a two-dimensional array: { ( u ,w)I(u,v) G I t 2 } where the value at an array point, f(u,w), is the distance measured of the point to a zero-plane defined by the camera system during calibration. The corresponding surface with respect to a known coordinate system can be described by the explicit parametric form:
Polygon mesh and parametric bicubic surface representations commonly used in the field of computer graphics (291 can be generated from the above explicit form.
3. Motion Tracking The 3&1/2-D motion data captures surface shape, orientation, and deformation in detail, and conveys a great visualization effect. However, the massive amount of data and the lack of defined features for registration and tracking present a challenge in making use of such data for measurement and analysis. In this section, we will review 3-D computer vision methodologies that may be applicable to motion tracking of 3&1/2-D data sets. The techniques mentioned here are surface geometry-based methods. Intensity-based motion tracking methods, such photometric analysis and optic flow are generally methods used to recover 3-D shape and motion directly from 2-D uncalibrated intensity images [30-321. Those methods will not be covered here. Dynamic objects such as human bodies are usually too complex to conform to one type of model. The economy of computation is generally in favor of methods that decompose a complex object into simpler components and model individual components independently. Therefore, in this section we will walk through a collection of methods that can be applied to the division of a dynamic 3-D object into piece-wise quasi-rigid or elastic motion segments, and the tracking of these motion segments. We will describe general methods for surface feature extraction, description, segmentation, and tracking, and 3-D surface modeling. 3.1. 3 - 0 Surface Feature Eztraction Coordinate invariant surface characterization based on differential geometry has been applied for the purpose of 3-D object recognition [34,35]. Objects are segmented into regions of uniform geometric properties and matched with known models. Tracking of quasi-rigid motion can be treated in a similar way. In 1341, two surface measurements: Gaussian surface, K , and mean surface, H , are defined so that they are invariant to translational and rotational transformations. The Gaussian surface measures the sharpness of a feature, and the sign of mean surface indicates the direction (inward or outward) of a feature. Both
432
S.-Y. Lu t3 C. Kambhamettu
Gaussian and mean surfaces are defined by partial derivatives and surface normal of the original surface data. When these measurements are applied to range images, we can assume the surface coordinates (u, w) to be the underlying grid of a range image. Then, surface is represented by
where, f(u,w) is the depth at the grid point. Partial derivatives and the surface normal become:
= [ l ofu]T, sv E [O 1 f v ] T , .%m= [OO fuuIT, s v v = [OO fvvlT, suv = (00 fuvlT , su
(3.2)
where fu, fv, f u u , fvv, fu,, are first and second derivatives. Gaussian curvature function, K , and the mean curvature function, H , can be defined by using the derivatives:
Although in theory the Gaussian and Mean curvatures are invariant to surface orientation, calculations based on partial derivatives are, in fact, very sensitive to surface orientation. The results of K and H curvatures derived from local derivatives may not be accurate enough for the purpose of motion tracking. An easy way to reduce this sensitivity is to redefine the underlying grid locally, such that the grid plane is orthogonal to the surface normal of a small patch. This procedure is described in Fig. 4. Gaussian and Mean curvatures provide a basis for segmenting a 3-D surface into regions of geometric features. Table 1 is a guideline for grouping points into surface patches based on curvature signs, where “e” is a small number. Table 1. Grouping of surface types based on signs of curvature measurements.
K H < -e H <= e&&H H>e
>= -e
>e
peak (none) pit
K <= e&&K >= -e ridge flat valley
K
< -e
saddle ridge minimal surface saddle value
2.7
3 - 0 Vision of Dynamic Objects
433
Procedure 1. Calculation of K and H surfaces of a range image.
(1) Let C = { ( u + i , w + j , f ( u + i , w + j ) I i = -n ,..., 0 ,...,n, a n d j = -n,. . . , 0 , . . . ,n } be a surface patch centered at (u, w). (2) Let (u, w ,f(u,w)) be the new origin. (3) Calculate row and column derivatives, fu, f v at the new origin, by applying Sobel row kernel
[I: 1 :] o
-2
2
[ ; ; ;] -1
and column kernel
-2
-1
to the
origin. (4) Calculate two angles 8 = tan-’(fv/fiL), and q5 3 tan-1(d(f,)2 (fv)2). (5) Define a new reference frame such that surface normal is directed along the new z-axis by a rotational transformation, that rotates around the z-axis by an angle, 8, and is followed by a rotation around the y-axis by q5,
+
A=
[
cos4.cos8
-cosq5.sin8
sin 8
cos 8
-
sin 4 cos 8
sin q5 sin 8
sin4 cos O q5
1.
(6) Apply A to C and obtain a new representation C’, for which the surface normal is along the z-axis of the new coordinate. (7) Points on C’ are now on an irregular grid. Index points on C’ the same way they are indexed on C, i.e. a point s’(i,j) in C’ is the same point s ( i , j ) in C. Convert C’ to a regular grid by an interpolation method such as a weighted Gaussian filter. (8) Calculate the new derivatives f;, f: using Sobel kernels on C’, and K and H at the point (u,w) using (3.3) and (3.4). Fig. 4. Calculation of K and H surfaces of a range image by a localized coordinate transformation.
3.2. Surface Feature Measurements
Shape measurements of surface patches and distance measurements between surface patches are used to establish the association of 3-D features of an object between time frames. Let S = {pili = 1,.. . , n } be a surface, and pi is a point on the surface, and qi, is a unit vector representing surface normal at pi m
Centroid:
c= c p i . a=
(3-5)
I
Mean radius:
Roundness:
(3.7)
434 S.-Y. Lu & C. Kambhamettu
Average orientation:
(3.8)
Smoothness:
(3-9)
The distance between two surface patches can be measured as a weighted sum of the above shape feature measurements, or by overlapping two surface patches and computing the Euclidean distance between them. The centroids and average orientations define the rotation and translation necessary to align two surface patches. Let S1 and SZ be two surfaces that are rotated and translated, such that centroids of both surfaces are at the origin, and surface normals are along the zaxis. Then, the two surfaces can be represented by S1 = {fl(u,w)I(u,w) E R1) and S2 = {f~(u,w)I(u,w) E Rz}, where R1 and R are regions on the u - w plane. The distance between the two surfaces is
c
d=
Ifl(.,.)
- fZ(.,~)I.
(3.10)
(u,.) E RinRz
Methods that match range data points to parametric surface representations are also useful shape metrics [36-391.
3.3. Trajectory Estimation
A curve fitting approach to trajectory estimation of a single target point (or token) is described here. We assume that the target point is traveling under constant acceleration, then the position at time 6t is,
(3.11) where, A, = [a, b, crIT,T = z,y, z , are vectors of acceleration, velocity, and position at an initial time for each axis, respectively. To estimate the position at the time t 1 requires the positions for the last three time frames:
+
pt-I
Pt
=
[F
3
[26t2 26t
We obtain
6t
i][A,
r o [A,
A,
A,]=
I
A,],
A,
A,].
l][A,
o
St 6: L26t2 26t -
A,
1 1
1
-1
[t:]
(3.12)
(3.13)
2.7 3-0 Vision of Dynamic Objects 435 Then the new location can be estimated by letting bt
= 1,
The same constant acceleration model can be extended to average over more than three points. For example, the prediction from four given past points is
3 4
5 4
3 4
9 4
(3.15)
A similar approach can also be extended to a constant jerk model given at least four past locations:
3.4. Token Association
The multiple token tracking problem can be formulated as an optimization problem. There are three types of constraints that are generally true: individual tokens follow a smooth trajectory; the spatial relationship between tokens remains stable; and matching tokens are similar in shape. These constraints can be violated when dealing with nonrigid dynamic objects. However, they are assumed to be true when the sampling interval is much smaller than the velocity of movement. The basic token association problem is defined as follows. Assume that there are two sets of observable tokens Tt 3 {a(i)li = 1 , . . . , N t } and Tt+l = { b ( j ) l j = 1 , . . . ,N t + l } taken at two consecutive time instances t and t 1. The problem is to define an Nt by Nt+l association matrix A that minimizes association errors, where {w(i,j)li = 1 , . . . ,N , , j = 1 , . . . , N t + l } are elements in A, and
+
w(2,j)
=
1, if a ( i ) associate with b ( j ) 0, otherwise.
(3.17)
A legitimate association matrix satisfies the condition that each element in
rt
has at most one match in -rt+l, and vice versa: Nt
Nt+l
c v ( i , j )5 1,Vj = 1,,.., Nt+l, and i= 1
w(i,j)
5 1,Vi = 1,..., N t .
(3.18)
j= 1
The association errors can be defined as a weighted sum of shape similarity and the distance between actual positions and positions on the estimated trajectory. Association methods developed for 3-D object recognition can be extended to motion tracking (40-421. We will describe the design of a Boltzmann machine for computing association probabilities. We adopt the formulation of a Boltzmann machine for
436
S.-Y. Lu & C. Kambhamettu
data association described in [43] here. A reference to Boltzmann machines can be found in [44]. A Boltzmann machine is a network with Nt by Nt+l binary elements, and an associated energy function:
(3.19) where p ( i , j ) is the cost of associate a ( i ) with b ( j ) , and Ta is the annealing temperature. The first term is the total cost of association. The second and third terms provide inhibition to the neurons so that penalties are given to invalid association matrices. The Boltzmann machine attempts to find a global minimum of the energy function by performing simulated annealing to avoid convergence to local minima. On each iteration, the network picks a neuron v ( i , j ) at random and attempts to change its state. If the resulting energy decreases, the network will accept this transition. However, if the change in state causes an increase in energy, the network will sometimes accept the transition. The probability that a positive transition will be accepted is governed by the global temperature parameter, Ta. At each temperature in the annealing schedule, the neurons, v(i, j ) , are activated according to the following rule:
v(i,j)
=
1 - v ( i , j ) , with probability Gi,jPi,j , w ( z , j ) , with probability 1 - Gi,jPi,j ,
(3.20)
where Gi,j is the probability of selecting neuron v(i,j) for possible activation, and Pi,jis the probability of activating or deactivating transition of the neuron. In [43], these two probabilities are defined as:
(3.21) and
p.. ''3
1 1+ ~ a E i , j / T a'
where AEi,j is the change in energy if v ( i , j ) changes its state:
AEi,.j(A,Ta) (1 - 2 ~ ( i , j ) )
(3.22)
2.7 3-0 Vision of Dynamic Objects 437
Procedure 2. Token association using a Boltzmann machine.
(1) Define two token sets: .rt and T ~ + Iand the pair-wise association cost p ( i , j ) . (2) Initiate the association matrix A, such that u ( i , j ) = 1, if i = j , and 0 otherwise. (3) Define constants A and B. Calculate the total energy using (3.19). (4) Initialize the temperature T,. (5) For i = 1 to N t , for j = 1 to N t + l , (5.1) compute AEi,j(A,T,) using (3.23), (5.2) if AEi,j < 0, then (5.2.1) v ( i , j ) = 1 - w ( i , j ) and E = E AEi,j. else (5.2.2) compute Pi,j using (3.21), (5.2.3) generate a random number between ( O , l ) , 6, (5.2.4) if P 2 [, update the transition using step (5.2.1). ( 6 ) Decrease the temperature T . (7) If T is greater than a preset threshold, repeat step (5), otherwise, stop.
+
Fig. 5. Procedure for token association using a Boltzmann machine.
The acceptance of a positive transition (increase in energy) is achieved by comparing the probability of transition, Pi,j, with a random number, [. If Pi,j 2 (, the transition is accepted, otherwise it is rejected. The annealing schedule generally assumes an exponential decaying model that starts with a high temperature, such that neurons are more likely to escape their current state; and gradually cools down and settles into a configuration that is close to the global minimum. A step-by-step procedure is listed in Fig. 5.
4. Surface Nonrigid Motion Analysis The general problem of nonrigid motion analysis can be classified into representation and analysis of nonrigid shape and motion. In [33] nonrigid motion is further classified into articulated, quasi-rigid, isometric, homothetic, conformal, and elastic motion, with increasing computational complexity. Rigid motion preserves the 3-D distances between any two points of an object, where the transformations only include rotation and translation. The object does not stretch or bend, hence some of the differential geometric invariant properties remain constant. For example, curvature, fundamental forms, etc. can be used to track the motion. Articulated motion is piece-wise rigid motion. This class of nonrigid motion of an object is closest to rigid motion behavior, where the object’s rigid parts are connected by nonrigid joints. Examples of this type of motion include animal (skeletal) body motion and robot manipulators. Quasi-rigid motion restricts the deformation to
438 S.-Y. Lu & C. Kambhamettu
a small amount. A general nonrigid motion is quasi-rigid when viewed in a sufficiently short time interval, e.g. between image frames when the sampling rate is high enough. Isometric motion is nonrigid motion which preserves length along the surface as well as angles between curves on the surface. Homothetic motion is a uniform expansion or contraction of a surface. Conformal motion is nonrigid motion which preserves angles between curves on the surface during motion, but not lengths. Elastic motion is a nonrigid motion whose only constraint is some degree of continuity or smoothness. There are many unified approaches that can combine various aspects of representation and analysis. Also, many approaches typically concentrate on global and local modeling of both shape and motion. Study of both the real world examples and application domains indicate that a generalized motion model for nonrigid motion behavior is hard to estimate. One has to study the application domain carefully before designing a specific nonrigid motion model. We will first cover the previous work in motion modeling, then present our work in nonrigid motion analysis. Deformable models are used for “fitting the data” from two different representations: free-form and parametric [45]. Free-form deformable models have no global structure of the template, and are constrained by local continuity and smoothness constraints. Such models use salient features (lines, edges, etc.) and/or physical properties of the object such as the material’s elasticity. Some examples of freeform deformable models are snakes [46], balloons [47], deformable templates [48], deformable super-quadrics [49], and modal models [50]. Parametric deformable models use certain shape information and use either a collection of parameterized curves or the parameterized prototype template to encode the object shape. Parametric deformable models include splines, super-quadrics, and hyper-quadrics, etc. [36]. The ultimate goal of these approaches is to use the estimated model for both reconstruction and analysis. The deformations of the estimated template or model should be able to describe the variety of object deformations. Such a model provides for recognition, analysis, and compression of the data. Parametric models are generally inadequate for analysis and representation of complex, dynamic, real world objects because they solve the fitting problem by functional minimization using partial shape knowledge. Ideally, they require a well-defined set of curves for a given shape. On the other hand, free-form deformable models assume local continuity, smoothness, and availability of elastic properties of the object in hand. In general, surface deformation is formulated as a dynamic problem of nodal displacements, as the result of external loads:
MU -k CU iKU
=R,
where U is a displacement vector, M , C, and K are mass, damping, and material stiffness between each point within the body, respectively. R is the force acting on the node.
2.7
3 - 0 Vision of Dynamic Objects
439
4.1. Parametric Deformable Models
Super-quadrics is an important class of modeling primitives that has recently received increased attention. They are a family of parametric shapes derived from the parametric forms of the quadric surfaces. The super-quadric ellipsoid in canonical position can be generated from [51], a1 COSE' (q)COS"2 ( w )
( q )sinE2( w ) a3 sin"' ( q )
a2 cos"'
(4.2)
for - 1 ~ 1 25 u I n/2, -IT 5 v I IT, where 0 I a1,a2,a3 5 1 are aspect ratio parameters, and E ~ , E Z2 0 are "squareness" parameters in the latitude and longitude planes. Equation (4.2) generates a surface that is super-ellipsoidal having the imDlicit form.
Hyper-quadrics are generalizations of super-quadrics that allow smooth deformation from shapes with convex polyhedral bounds. It may be composed of any number of terms, and the modeled shapes are not necessarily symmetric [52]. Terzopoulos et al. developed hybrid models that can recover the geometric shape of the object at different time instants using both intrinsic and extrinsic constraints. In [49],deformable super-quadrics is presented as a unified motion and shape model. The parameters that can be controlled with this model are global motion, global deformation, and local deformations. These classified motions and deformations are combined into an overall set of parameters that is estimated simultaneously through manipulation of dynamic equations. Chen and Huang [53] developed a similar approach, however, using decomposition-based motion models (global rigid motion, global deformation, local rigid motion, and local deformation) and a priori motion patterns. In [49], a parameterized model such as a super-quadric ellipsoid, is used as an initial model and surface measurements are matched to the initial model through a deformation process that contains four components: rigid motion (translation and rotation); global deformation; and local deformation. A surface point on the deformed model with respect to a reference frame is related to the model frame as:
x = c+ R p .
(4.4)
where c and R are the translation vector and the rotational matrix between the two frames, and p is the same surface point in the model frame. Assume that p is the result of displacement of a point s on the model, then
p=s+d, where d is the displacement in the model frame.
(4-5)
440
S.-Y. Lu B C. Kambhamettu
The deformation dynamics (4.1) now has the following components: U E are the global rigid motion coordinates, qs are the global deformation coordinates, and qd are the local deformation coordinates of the model. Assume that the initial model is a super-ellipsoid (4.3), the global deformation is characterized by the parameter vector: qs = ( u l ,u2, u3, ~ 1 ~, 2 ) ~ . One of the difficulties with this approach is to determine all four displacement components without prior knowledge. If q and qe can be assumed, motion can be represented by just the deformation. (q,?, q:, q r , qz)T, where qc and
4.2. Eigen-Problem Approach
The eigen-problem approach used finite element methods (FEM) to generate eigen-vectors that correspond to model shape where the lowest frequency modes are the rigid body modes of translation and rotation [54,55]. The higher to lower frequency modes represent gross to fine deformation. By borrowing the notion in structure mechanics, the M , C , and K matrices for the three-dimensional case are:
m . .
where H is the d m x d displacement interpolation matrix, B is the d m x 2d strain displacement matrix, and d is the dimensionality of the element, and m is the total number of elements in the assemblage. p is the material-specific density and IE is the material-specific internal damping constant. Following the definitions given in [54], the displacement interpolation matrix for a two-dimensional four-node surface element, can be defined as
H=-
(4.7)
where, hi
= ( 1 + ~ ) ( ly + ),
h2
= ( 1 + ~ ) (-1y) ,
h3 3
( 1 - ~ ) (- ly),
hq
( 1 - ~ ) ( 1 y). +
2.7 3 - 0 Vision of Dynamic Objects 441 Note that the H matrix is a d n x d matrix, where n is the number of nodes in the finite element design, H(i) in (4.5) is the H matrix for the ith element in the assemblage, and is obtained by rearranging the H matrix, such that structural nodal points that match with element nodal points will contain the elements in the appropriate column, and zero otherwise. The design of a three-dimensional finite element model and the corresponding H , B , and E matrices can be found in [54-561. Pentland further applied the modal analysis approach in FEM to diagonalize the matrices in the equilibrium equation, and use the free vibration equilibrium equation with damping neglected to obtain a generalized eigen-problem. That is
and the solution can be postulated to be of the form
U
3
q5sinw(t - t o ) ,
(4.9)
where q5 is a vector of order n = m x d , and w a constant identified to represent the frequency of vibration. Therefore,
Kq5 3 w2Mq5.
(4.10)
The eigen-problem yields n eigen solutions ( w f ,q51), ( w z ,4 2 ) , . . . , (w;, $2), and the eigen-vectors are ortho-normal, i.e. (4.11) and 0 5 w;
I w i - . - 5 w,.2
The eigen-vector q5i is called the a mode’s shape vector, and wi is the corresponding frequency of vibration. Each 4i consists of the (2,y, z ) displacements for each node. The lowest frequency modes are the rigid body modes of translation and rotation. The next-lowest frequency modes are smooth, whole body deformations. The high frequency modes are fine deformations that define the details. Using these modes we can define a transformation matrix that diagonalizes the stiffness, damping and mass matrices. (4.12)
(4.13)
442
S.-Y. Lu €9 C. Kambhamettu
Then,
@-K@ = R ,
(4.14)
Q ~ M Q=)I .
(4.15)
and
To apply the eigen-problem approach to the analysis of 3-D surface measurements of an object, Pentland formulated the problem t o be that of fitting the measurements to an initial model and defining the load matrix, R, t o be the disparity of fitting. Thus the solution to solving the fitting problem is to solve the following equilibrium equation:
KU=R,
(4.16)
u = Q(R + Is)-'Q'TR,
(4.17)
or
where I6 is a matrix whose first size diagonal elements are ones, and the remaining elements are zero. 4.3.
A Feature Correspondence Example
Feature correspondence problem in nonrigid motion is approached using both shape modeling and localized tracking technique. Figure 6 shows two distinct facial expressions of the same subject: normal and smiling. These are rendered images of surface data acquired using a Cyberware laser scanner. This data was provided by Dr. T. S. Huang of the University of Illinois. Nonrigid motion analysis can be used in tracking such facial movements. We will present a brief algorithm for estimating point correspondences and motion parameters using the unit-normal changes under small deformation of surfaces [57].
Fig. 6. 3-D render facial images: normal (left) and smiling (right).
2.7 3-0 Vision of Dynamic Objects 443 The surface under consideration is defined in Monge patch representation as z
= f (s(u,w), y(u, w)),
2
= u,y =
2).
(4.18)
The algorithm involves three basic steps, described as follows. ~~~
~
Procedure 3. Nonrigid motion tracking using point correspondence. (1) Unit-normals and coefficients of the first fundamental form, E , F, G, are computed at each point of the surface before and after motion. This is typically done by invariant fitting of a quadratic surface over a square window and calculating the directional derivatives of this surface. (2) Possible point correspondences are hypothesized using a local search area around the point under consideration. The amount of deviation of the hypothesis point correspondence from the correct correspondence assuming a linear displacement function is then given by:
E R = C z e , , ( E R i ) ?+ (ER2)1, where ER1 and ER2 for each point in a local neighborhood defined by the template area, C, can be computed as
- V k f 17;
.
These deviation measures are minimized w.r.t. the unknowns in the displacement function resulting in a system of linear simultaneous equations. By solving these equations, the set of unknowns can be estimated for all hypotheses. ( 3 ) Error E R is calculated for each hypothesis correspondence by using the estimated unknowns of each hypothesis. The least of these errors corresponds to the correct point correspondence. However, there may be more than one hypothesis corresponding to a minimum error. In such a case, one may choose to output all the probable correspondences, or apply regularization techniques for a better correspondence estimation. Fig. 7. Nonrigid motion tracking using point correspondence.
The motion parameter is a linear displacement function given by, S(U,V ) 3 ( a i U -k
biv
+ C i , a j ' u + bjv + C j , a k u + bkW + C k ) .
(4.19)
444
S.-Y. Lu tY C. Kambhamettu
The nine unknowns in (4.17) define the nonrigid motion involved at each point. The unit normal vector at a given point on the surface before motion is represented by
77 = 1% 77j 77kJ 1
(4.20)
and unit normal of the point after motion is given by (4.21) The coefficients of first fundamental form of the surface can be denoted by El F , G , where, G = 1 f,2 . (4.22) E = 1 f:, F = filfv,
+
+
The result of feature correspondence between normal and smiling faces are presented in Fig. 8.
Fig. 8. The result of tracking facial movements using feature correspondence.
Nonrigid motion analysis is applied to other types of data, such as satellite images. Figure 9 shows the tracking results of two instances of Hurricane Luis (Sept. 6th 1995), approximately 1 minute apart. These images were acquired by the Goes-9 satellite during a rapid scan mode (i.e. images were acquired every minute). These data were provided by Dr. Frederick Hasler of NASA-Goddard. The tracking started with target points chosen along a regular grid initially. Tracking is done by matching points throughout the time sequence. The two images at the left hand side are the original images, with target point overlay, of two consecutive instances in the middle of the sequence. The two images at the right hand side show only the tracked target points. Nonrigid motion analysis of such cloud images is extremely useful in meteorological and climate applications [65]. For example, point correspondence information between the time-sequential cloud images gives an estimation of cloud-drift and velocity. These velocities, in turn, are useful for cloud model verification, weather prediction and climate baseline pathology.
2.7 3-D Vision of Dynamic Objects 445
Fig. 9. Point correspondence applied to cloud tracking using satellite images.
5. Applications
Marker-based systems are widely used by the entertainment and medical industries. Commercial 3-D animation packages generally include the ability to generate 3-D surface motion sequences by animating 3-D models, using trajectory data in real time. This new animation technology has resulted in the proliferation of movie animation, T V advertisements, and arcade games. This development drastically reduces production time and increases realism in animation. Kinematics modeling based on motion data can be used for motion synthesis and allows animators to combine realism and creativity [58-601. However, in this method of animation, model generation and trajectory generation are two separate processes. Therefore, the movement is coarse, jittery and far from real. Marker-based motion analysis systems also prevail at large medical centers whose orthopedic departments use them for diagnostics and posture and gait research.
446
S.-Y. Lu & C. Kambhamettu
Angles and angular velocities of key joints such as ankles, knees, and hips are measured to analyze gait patterns of disabled patients. About 0.3%of the children suffer from cerebral palsy disease. Surgical planning based on gait analysis has effectively reduced the need for repeated surgery. Similar analysis is now being applied to wrist motion employed in typing for carpal tunnel syndrome prevention. Some other applications of marker-based systems include sports analysis for training and equipment design, and race horse training. However, the use of markers is considered too intrusive in applications like telecommunications and multimedia. The possibility of using motion data for telecommunications and multimedia was discussed in Ref. [61]. Facial expression capture is an essential technology to enable online virtual reality. The rendering of virtual environments requires display of facial images and expressions from arbitrary points of view with arbitrary lighting conditions. Hence, low-cost, high-quality facial expression capture is essential to development of a new communication and entertainment paradigm. For animation applications, surface motion data can be used as dynamic models for cutting and pasting to create realistic animation in a fantasy world [62]. Research in nonrigid motion analysis is driven by real-life examples and applications. Specialized and complex shape changes of the object in hand is seen in real examples such as waving trees, melting snow, deforming clouds, moving people and animals. Each such example has its own method of nonrigid movement. Applications of nonrigid motion analysis include a wide variety of areas such as biomedical image analysis, model-based image compression, virtual reality registration, part manufacturing, and meteorology. Specific applications include gait, heart, face, lip and tongue motion analysis, and cloud and polar ice motion analysis. The advantages of having a complete surface representation in biomechanic applications are the availability of volume-related estimates such as torque, the possibility of using anatomic markers, and realistic visualization. Although skeletal structures such as joints are generally well-pronounced during movement, the problem of identification and tracking remains a challenge for 3-D vision motion. The availability of 3-D surface motion data allows the possibility of expanding biomechanic motion analysis from a pure skeletal model to a skeletal-muscular interrelationship model. Applications include muscular evaluation for ergonomics. Workers can be prescreened for their strength and physical conditions to avoid work-related injuries. The analysis of facial movement is useful in many applications ranging from recognition of facial expressions to speech recognition, speech therapy and medical treatments for skin burn and facial injury victims. Surface motion data can be used for detailed analysis of lip movement in speech, for both speech therapy and recognition. Figure 10 shows the lip position of two frames of a 3-D facial sequence: the subject says the “v” sound and “zh” sound in the word “vision”. These 3-D facial images are generated from LLNL’s 3-D motion camera system. Sixteen target points around the lips are extracted. The images at the left-hand side of the figure
2.7 3-0 Vision of Dynamic Objects 447 Frame 32: Subject says"V sound in the word vision
j
Frame 38: Subject says"zh" sound in the wordvision
Fig. 10. Lip movement from a 3-D facial expression sequence.
I
i
448
S.-Y. Lu €4 C. Kambhamettu
are surface normal representation, with and without overlay of target points. The drawings at the right-hand side are 2-D projections of the target points at four different perspectives: front, side, top, and isometric views. Notice that the syllable “zh” requires a wider mouth opening (see front view), and forward protrusion of the lower lip (see side view) than ‘V. Nonrigid motion analysis proves very useful in the study of tongue motion in speech pathology. Figure 11 shows surfaces of a tongue in motion. These images are reconstructed from multiple coronal cross-sectional slices of the tongue [63]. Surfaces were reconstructed for sustained vocalizations of American English sounds. These data were provided by Dr. Maureen Stone of the University of Maryland Medical School.
Other applications of 3-D motion vision are for design and rapid prototyping in manufacturing. The design of personal assisted equipment and facilities such as prostheses, gas masks, orthopedic shoes, and ergonomicwork cells require analysis of dynamic interaction of the equipment and the human body. The use of personalized electronic mobile mannequins in computer-aided design can become the way of the future. Other nonhuman related surface motion analyses include strain and stress analysis of material and structures for design and safety purposes [64].Examples are stress analysis of air bags and car crash analysis. References [l] J. S. Walton, ed., Image-Based Motion Measurement, SPIE Proceedings (Aug. 1990, LaJolla, CA.) [2] S. Y . Lu and R. K. Johnson, A true 3D motion camera from Lawrence Livermore National Laboratory, Advanced Imaging (1995) 51-54. (31 T. W. Calvert and A. E. Chapman, Analysis and synthesis of human movement, Handbook of Pattern Recognition and Image Processing: Computer Vision (Academic Press, 1994).
2.7 3-0 Vision of Dynamic Objects 449 (4) A. Maddocks, Live animation from real-world images: THE POETRY OF MOTION,
Advanced Imaging (July 1994). [5] D. H. Sutherland and K. R. Kaufman, Motion analysis: Lower extremity, Orthopedic Rehabilitation. [6] Micropower Impulse Radar (MIR), Lawrence Livermore National Laboratory web page, http://
[email protected]. [7] D. Nitzan, Three-dimensional vision structure for robot applications, IEEE Trans. PAMI 10, 3 (1988) 291-309. [8] D. Braggins, 3-D inspection and measurement: Solid choice for industrial vision, Advanced Imaging (1994) 369-393. [9] R. Mohr, Projective geometry and computer vision, Handbook of Pattern Recognition and Computer Vision, C. H. Chen, L. F. Pau and P. S. P. Wang (eds.) (World Scientific, 1993) 369-392. [lo] L. L. Grewe and A. C. Kak, Stereo vision, Handbook of Pattern Recognition and Image Processing: Computer Vision (Academic Press, 1994) 239-317. [ll]R. Y. Tsai, A versatile camera calibration technique for high accuracy 3-D machine vision metrology using off-the-shelf TV cameras and lenses, IEEE J. Robotics Automation RA-3 (4) (1987) 323-344. [12] Y. Ohta and T. Kanade, Stereo by intra- and inter- scanline search using dynamic programming, IEEE Trans. PAMI. 7, 2 (1985) 139-154. [13] D. J. Kriegman, E. lliendl and T. 0. Binford, Stereo vision and navigation in buildings for mobile robots, IEEE Tkans. on RA 5, 6 (1989) 792-803. (141 P. h a , A parallel stereo algorithm that produces dense depth maps and preserves image features, Machine Vision and Applications 6 (1993) 35-49. [15] A. F. Laine and G. Roman, A parallel algorithm for incremental stereo matching on SIMD machines, IEEE Tkans. RA. 7, 1 (1991), 123-191. [16] W. E. L. Grimson, Computational experiments with a feature based stereo algorithm, IEEE Trans. PAMI 7, 1 (1985) 17-31. [17] K. L. Boyer and A. C. Kak, Structural stereopsis for 3-D vision, IEEE Trans. PAMI 10,2 (1988) 144-166. [18] J. J. Rodriquez and J. K. Aggarwal, Matching aerial images to 3-D terrain maps, IEEE Tkans. PAMI 12, 12 (1990). [19] S. M. Dunn and R. L. Keizer, Measuring the area and volume of the human body with structured light, IEEE %ns. SMC 19,6 (1989) 135G1364. [20] G. C. Stockman el al., Sensing and recognition of rigid objects using structured light, IEEE Control Systems Magazine (1988) 14,22. [21] J. J. Le Moigne and A. M. Waxman, Structured light patterns for robot mobility, IEEE J. RA 4, 5 (1988) 541-548. [22] Z. Yang and Y. F. Wang, Error analysis of 3D shape construction from structured lighting, Pattern Recogn. 29, 2 (1996) 189-206. [23] H. K. Nishihara, Practical real-time imaging stereo matcher, Optical Engineering 23, 5 (1984) 536-545. [24] S. J. Gordon and F. Benayad-Cherif, 4DI - A real-time three-dimensional imager, SPIE 2348 (1995) 221-225. [25] T. Kanade, M. Okutomi and T. Nakahara, A multiple-baseline stereo method, Proc. DARPA Image Understanding Workshop (1992) 409-426. [26] F. Blais and M. Rioux, BIRIS: A SIMPLE 3-D sensor, SPIE Optics, Illumination, and Image Sensing for Machine Vision 728 (1986) 235-242. (271 T. Kanade, A. Gruss and L. R. Carley, A very fast VLSI rangefinder, Proc. IEEE Int. Conf. on Robotics and Automation, Sacramento, CA (1991).
450
S.-Y. Lu €9 C.Kambhamettu
[28] J. Kramer, P. Seitz and H. Baltes, Inexpensive range camera operating at video speed, Applied Optics 32, 13 (1993) 2323-2330. [29] J. D. Foley, Computer Graphics: Principles and Practice, 2nd edn. (Addison-Wesley, 1990). [30] B. K. P. Horn and B. G. Schunck, Determining optical flow, Artificial Intelligence 17 (1981) 185-203. [31] B. G. Schunck, The image flow constraint equation, Computer Vision, Graphics, and Image Processing 35 (1986) 2Ck46. [32] S. V. Fogel, The estimation of velocity vector fields from time-varying image sequences, CVGIP: Image Understanding 53,3 (1991) 253-287. (331 C. Kambhamettu, et al., Nonrigid motion analysis, Handbook of Pattern Recognition and Image Processing: Computer Vision (Academic Press, 1994) 405-426. [34] P. J. Besl and R. C. Jain, Invariant surface characteristic for 3-D object recognition in range images. Computer Vision, Graphics, and Image Processing 33 (1986) 33-80. [35] S. Z. Li, Toward 3-D vision from range images: An optimization framework and parallel networks, CVGIP: Image Understanding 55,3 (1992) 231-260. [36] R. M. Bolle and B. C. Vemuri, On three-dimensional surface reconstruction methods, ZEEE PAMI 13, 1 (1991) 1-13. [37] J.-P. Thirion and A. Gourdon, The 3-D marching lines algorithm and its application to Crest Lines Extraction, INRIA Report No. 1672 (1992). [38] J.-P. Thirion and S. Benayoun, Image surface extremal points, new feature points for image registration, INRIA Report, No. 2003 (1993). [39] A. Matheny and D. B. Goldgof, The use of three- and four-dimensional surface harmonics for rigid and non-rigid shape recovery and representation, IEEE PAMI 17, 10 (1995) 967-981. [40] W. Y. Kim and A. C. Kak, 3-D object recognition using bipartite matching embedded in discrete relaxation, IEEE PAMI 13,3 (1991) 224-251. [41] P. J. Besel and N. D. McKay, A method for registration of 3-D shape, ZEEE PAMI 14, 2 (1992) 239-256. [42] L. G. Shapiro and R. M. Haralick, Structural descriptions and inexact matching, IEEE Pans. PA MI 3 (1981) 504-519. [43] R. A. Iltis and P. Y. Ting, Computing association probabilities using parallel Boltzmann machines, IEEE Bans. on Neural Networks 4, 2 (1993) 221-233. [44] E. Aarts and J. Korst, Simulated annealing and Boltzmann machines (Wiley, 1989). [45] A. K. Jain, Y. Zhong and S. Lakshmanan, Object matching using deformable templates, IEEE PAMI 18 (1996) 267-278. [46] M. Kass, A. Witkin and D. Terzopoulos, Snakes: Active contour models, Int. J. Computer Vision 1, 4 (1988) 321-331. [47] L. D. Cohen and I. Cohen, Deformable models for 3-D medical images using finite elements and balloons, Proc. ICPR (1992) 592-598. [48] A. Yuille, L. D. S. Cohen and P. W. Hallinan, Feature extraction from faces using deformable templates, Proc. IEEE Conf. on Computer Vision and Pattern Recognition (1989) 104-109. [49] D. Terzopoulos and D. Metaxas, Dynamic 3-D models with local and global deformations: Deformable super-quadrics, IEEE PAMI 13,7 (1991) 703-714. [50] A. Pentland, Automatic extraction of deformable part models, IJCV (1990) 107-125. (511 F. Solina and R. Bajcsy, Recovery of parametric models from ranges images: The case for super-quadrics with global deformations, IEEE PAMI 12, 2 (1990) 131-147. (521 A. J. Hanson, Hyper-quadrics: Smoothly deformable shapes with convex polyhedral bounds, CVGIP, 44 (1992) 592-598.
2.7 3-0 Vision of Dynamic Objects 451 [53] C . W. Chen, T. S. Huang and M. Arrott, Modeling, analysis, and visualization of left ventricle shape and motion by hierarchical decomposition, IEEE Trans. PAMI 16, 4 (1994) 342-356. [54] A. Pentland and S. Sclaroff, Closed-form solutions for physically based shape modeling and recognition, IEEE PAMI 13, 7 (1991) 715-729. [55] A. Pentland and B. Horowitz, Recovery of nonrigid motion and structure, IEEE PAMI 13, 7 (1991) 73tP742. [56] K. Bathe, Finite element procedures in engineering analysis (Prentice-Hall, 1982). [57] C . Kambhamettu, D. B. Goldgof and M. He, Determination of motion parameters and estimation of point correspondences in small nonrigid deformations, Proc. ZEEE Conf. on Computer Vision and Pattern Recognition (1994) 943-946. [58] J.-T. Lee and T. L. Kunii, Model-based analysis of hand posture, IEEE Computer Graphics and Applications (1995) 77-86. [59] H . 3 . KOand N. I. Badler, Animation human locomotion with inverse dynamics, IEEE Computer Graphics and Applications (1996) 5C59. (601 M. van de Panne, Parameterized gait synthesis, IEEE Computer Graphics and Applications (1996) 4tP49. [61] T. Huang and P. Stucki, Introduction to the special section on 3-D modeling in image analysis, ZEEE PAMI 15, 6 (1993) 529-530. (621 New system for 3-D motion capture, Tech Watch, Computer Graphics World (1996). [63] M. Stone and A. Lundberg, Three-dimensional tongue surface shapes of english consonants and vowels, J. of Acoustical Society of America (1996) 3728-3737. (641 P. F. Luo et al. Accurate measurement of three-dimensional deformations in deformable and rigid bodies using computer vision, Experimental Mechanics (1993) 123-131. [65] C. Kambhamettu, K. Palaniappan and A. F. Hasler, Automated cloud-drift winds from Goes-819, SPIE-Int. Symp. on Optical Science, Engineering and Instrumentation (1996).
PART 3 RECOGNITION APPLICATIONS
Handbook of Pattern h o g n i t i o n and Computer Vision (2nd Edition), pp. 455-471 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 3.1 I PATTERN RECOGNITION IN NONDESTRUCTIVE EVALUATION OF MATERIALS
C. H. CHEN Department of Electrical and Computer Engineering University of Massachusetts Dartmouth, N . Dartmouth, M A 02747, USA Nondestructive evaluation (NDE) of materials is now widely used in defect detection and classification, material characterization and discrimination, and machine inspection. To improve the capability of NDE instrumentation, pattern recognition is the key t o automating the inspection process as well as providing information for making objective decisions. This chapter presents the major activities and results of pattern recognition in NDE, including both traditional statistical pattern recognition and modern neural network classifiers for ultrasonic, acoustic emission and eddy current signals and radiographic images.
Keywords: Nondestructive evaluation, ultrasonic signals, acoustic emission, eddy current, radiographic images, defect detection, statistical and neural network classifiers.
1. Introduction
Nondestructive evaluation (NDE), also called nondestructive testing (NDT) or nondestructive inspection (NDI), of materials refers to the characterization, discrimination and prediction of material defects nondestructively. Early effort in NDE was limited to visualization or simply knocking the material being tested. It was only about fifty years ago when ultrasonic sensors and radiographic methods were developed that more reliable NDE instrumentation became available. This progress was marked by the birth of the American Society of Nondestructive Testing (ASNT) in Boston in 1941. Since the late 70’s when digital processing of signals became popular, pattern recognition techniques have been increasingly used in NDE problems. Some early examples of pattern recognition in NDE using eddy current and acoustic emission signals are [l-41. Generally speaking the goals of pattern recognition and signal processing in NDE are to improve inspection reliability, to improve defect detection and characterization, to automate inspection tasks, and to generate information about the material properties to assess the remaining life of a structure. The materials tested vary from bonding adhesives, ceramics, composites, to metals, weldings, and many others for a wide range of applications in the aerospace industry, chemicals, electronics, nuclear reactors, ships, bridges, etc..
456
C. H. Chen
NDE sensors and methods include: ultrasonics, acousto-ultrasonics, radiography (X-ray, neutron, etc.), acoustic emission, eddy current, magnetics, visual, liquid penetrants, lasers and holograms, electromagnetics, thermal imaging, computed tomography, etc. Some useful references in NDE methods are in [5-81. Also ASNT has published a series of NDT handbooks that describe in detail the NDE methods. For each method, features can be extracted from the digitized waveform or image data and pattern recognition techniques are employed to help achieve one or more of the goals stated above. In the last few years, there has been significant progress to demonstrate the feasibility of pattern recognition in NDE. Many challenges r e main however before reliable recognition procedures can be fully implemented in operational NDE systems. In this chapter emphasis will be placed on pattern recognition with the ultrasonic NDE, though other methods like eddy current, acoustic emission and radiography will also be examined. Examples are based on the specific system making use of the software package IUNDE (for interactive ultrasonic NDE), which we have developed [9,101. 2. Pattern Recognition in Ultrasonic NDE Systems
Ultrasonic methods of nondestructive testing are comparatively less expensive, easy to use and reliable. The center frequency of transducers employed in NDE typically ranges from 2MHz to 15MHz or higher depending on the size of defects considered. A single trace of ultrasonic pulse echo is called an A-scan. A sequence of pulse echoes taken as the transducer moves along a line is called a B-scan. The peak value of the pulse echo can give an indication of a hidden flaw or defect. The thickness can also be measured from the spacing between successive echoes. Mathematically the characterization of the defect from the pulse echo is an inverse problem which is not easy to solve except for some very simple defect geometry. To formulate this as a pattern recognition problem, we assume that the defect can be classified as one of several specified categories. Then features are extracted from the digitized data for classification. A simpler recognition problem can be a detection of the existence of a significant defect. The recognition method may also be used to estimate the most probable defect size. Figure 1 is an example of a complete ultrasonic NDE system. The system consists of an ultrasonic pulser/receiver, an IBM PC/AT compatible, a high resolution digital scope which communicates with both pulser/receiver and computer, and an immersion tank for immersion mode operation, in addition to the usual contact mode operation. A typical set of test specimens with different hidden defect geometries in aluminum blocks is shown in Fig. 2. Here T15 refers to a transducer of center frequency 15MHz. For each defect geometry, an A-scan signal is obtained. By using the system as shown in Fig. 1, a C-scan ultrasonic image can be generated by using a mechanical movement of the immersion probe and a peak detector. A high resolution acoustic image may be obtained by further using digital image processing.
3.I Pattern Recognition in Nondestructive Evaluation 457
Base
Devices
Fig. 1. The block diagram of a complete high resolution ultrasonic NDE system.
Image segmentation techniques can also be used to isolate the defect regions of hidden flaws not visible from the material surfaces. Software support is much needed throughout the entire system. The digital system as shown in Fig. 1 has the advantages of providing high resolution data using a high sampling rate (typically 50MHz), automatic processing of such data, and flexibilty to be configured for different requirements. The digital operations can also compensate to some extent for the limitation of the pulser/receiver bandwidth and the transducer frequency response. Of course the best data are provided by the best instruments which can be very expensive. The detailed operation of this system is described in [9,10]. Other NDE systems that incorporate pattern recognition and artificial intelligence capability employ the software packages ICEPAK [ll]and TestPro [12]. Before features are computed, signal processing functions are needed to present different knowledge domains of the signal [13], such as the time domain, frequency domain, the correlation structure, and the impulse response. Features can then be extracted from these domains. A detailed discussion of feature extraction and pattern classification is presented in the next section. On the subject of signal processing in NDE, there are now extensive research publications (see e.g. [14-161). Here spectral analysis, deconvolution and time-frequency analysis are the three main topics. There is a close relationship between the spectrum and the hidden geometrical defects, as presented in the work on ultrasonic spectroscopy [17,18]. Modern spectrum analysis makes it possible to determine a high resolution and a reliable spectrum [19]. Good spectral features can be determined from the high resolution spectrum. However the problem is the lack of quantitative relationship
458
C. H. Chen
3-0 view
6
-m[ \
apply couplant on surface of contact
front view
- -
TISAO: flawless sample TISAI: jlat cut; D = 118” 32 mm T15A2: angular cut; D = 118” 32 mm T15A3: circular hole; D = 118” 32 mm
15 MHz transducer
1?p7pq
-
I
1
I
t
top view
(a) a flawless sample
(b) a flat cut
(c) an angular cut
(d) a circular hole
Fig. 2. Test specimens with different hidden-defect geometries.
available, except in very special cases [20], to calculate defect parameters from the spectrum. The main objective of deconvolution is to determine the impulse response of the defect [21-231, which can be the internal discontinuities of a test specimen. With the effect of source signal removed, the impulse response presumably has a better characterization of the defect. For thickness and depth measurement, the impulse response can be very reliable. It is difficult to determine other parameters such as size and shape of a defect from the impulse response. For different defect categories, the impulse responses should be different and thus a good set of features may be derived from the impulse responses. In general it is difficult to obtain an accurate impulse response, especially in the presence of noise. However deconvolution always brings out some useful information not available from the original signal. To illustrate signal processing functions as provided by IUNDE, Fig. 3 shows, for the test specimens described in Fig. 2, signal processing results in spectrum analysis and deconvolution. The ultrasonic signals normally experience a nonhomogeneous media and thus are time varying in nature which requires time-frequency representation. The short term Fourier analysis, Wigner distribution, and the Wavelet transform are most often used in time-frequency analysis. The pseudo-Wigner distribution can be used to decompose an ultrasonic signal into components [24] and to estimate accurately the bondline reflector thickness [25]. Both the wavelet transform and the Wigner distribution are also useful to extract features for ultrasonic signal classification [26]. The work reported on time-frequency analysis with the ultrasonic signals so far has been very limited, and this is a research topic that remains t o be explored.
3. I Pattern Recognition in Nondestructive Evaluation 459 3. Feature Extraction and Pattern Classification in Ultrasonic NDE Some of the best features that have good physical significance can be listed as follows: (1) amplitude ratio, defined as the ratio of the cumulative sum of the absolute amplitude of the test specimen t o that of a reference specimen. (2) frequency ratio, defined as the ratio of the cumulative sum of the amplitude spectrum of a higher frequency band to that of a lower frequency band. (3) maximum peak correlation value and the root mean square value of the correlation function between the test specimen and the reference specimen. (4) the kurtosis and skewness, normally used as parameters of a probability density, of an amplitude spectrum. Kurtosis is the ratio of the fourth moment t o the square of the second moment. Skewness is the ratio of the third moment t o 1.5 power of the second moment of the amplitude spectrum. (5) the pulse duration, defined as the time difference between the intercepts of the pulse envelope with a line at 10% of the peak amplitude in each waveform. (6) the fractional powers over a certain number (say 8) of frequency bands. (7) the peak spectral value, and the bandwidth of the power spectrum. It is noted that the IUNDE can automatically provide a tabulation of some features listed above, for a given ultrasonic trace, to form a feature vector, in addition to the statistical parameters. Other mathematical features are (see e.g. [13,26,27]): (1) width between major peaks of the impulse response, and the power deconvolution coefficients, defined as the mean value of the deconvolved amplitude spectrum, (2) number of peaks above 25% of the maximum signal amplitude, (3) the position and half pulse width of the largest peak, (4) percentage of area under the largest peak, (5) rise time and fall time of the largest peak, (6) distance between the two largest peaks and the largest to the third peak, (7) percentage of partial power in the first and fourth octant in the cepstral domain, (8) number of peaks above the base line in the phase domain, (9) fractal dimension of the signal, (10) multichannel time/frequency decomposition features [28], (11) partial sum of the derivative of the Wigner distribution, and the Laplacian of the Wigner distribution,
etc. Many other choices are possible which fully exploit the time and frequency and time-frequency domain structures of the signal. These features of course are useful in general in waveform classification. As an example of the “feature-based” method in ultrasonic NDE, an experiment was conducted [lo] on plastic balls of diameters 8/32”, 12/32”, 16/32” and 1”
C. H. Chen
460
T 15A2
---- ,g
I
(b) Reference signal T15AO
(a) Test signal T15A2
FFT-Spectrum
f
0
Burg' s-Sp e c trum
a r = - - - -
(d) Power spectrum of T 1 5 A 2
(c) Power spectrum of T 1 5 A 2
Chirp-Z-Transform
li
9
0
50.00
0.04
0.08
0.12
0.18
Cross-Correlation
. _ _ _ , _ . . _ . . _ . ... _.
020
NonnalizedJrequency (e) power spectrum of T 1 5 A 2
( f ) Cross-correlation of T 1 5 A 2 and T15AO
Fig. 3. Illustrations of several IUNDE signal processing functions.
3.1 Pattern Recognition in Nondestructive Evaluation 461
WienerJiltering
%-------
0.00
0.06
0.10
0.16
O M
0.m
NormaIizedJrequency
(h) Impulse response from T15A2
(g) Cross-spectrum of T15A2 and T15AO
--
Po
moo
so00
woo
1ooo
Ttme(ns)
(i) Impulse response from T15A2
(j) Impulse response from T15A2
(k) Analytic signal of T15A2
(1) Hilbert transform of T15A2
Fig. 3. Illustrations of several IUNDE signal processing functions (continued).
lo000
462
C. H. Chen
suspended inside a water-filled immersion tank. The controlled geometries created provide artificial discontinuities between the water and the balls. A tabulation of major feature values such as the amplitude ratio, frequency ratio, kurtosis of spectrum , and frequency of peak power shows a n almost linear relationship between the feature values and the ball diameter. This along with similar experiments demonstrate that vector measurements of the same defect geometry tend to be clustered together. In pattern classification, only a small number of carefully selected features are used. Evaluation of individual features can be done by using the Fisher criterion [14], or by examining empirically their effectiveness from percentage error. The traditional pattern recognition techniques typically employed in NDE classification [29] are the minimum distance classifier, the Fisher linear discriminant, the maximum likelihood classifier, and the K-mean clustering algorithm. For the hidden defect geometries shown in Fig. 2, both the nearest neighbor decision rule and the neural networks have been employed for classification experiments [10,30]. The neural network classifiers used include the RCE (Reduced Columb Energy) network of Nestor, Inc. [31], and the popular multilayer backpropagation network. Both the nearest neighbor decision rule and the two neural networks provide the same 83.3% correct recognition for the limited data available. Some other results of using neural networks in ultrasonic NDE classification are reported in [32-341. Generally speaking, different neural networks require different training time. The best available classification results from different networks are fairly similar, and may not always be significantly better than the best available result from the traditional statistical classifiers. However neural networks do not require any assumption of the probability distributions of the data, and are computationally very efficient. For a limited amount of data available, which is typically the case in ultrasonic NDE, neural networks with proper training offer the possibility to greatly outperform the statistical classifiers. Classification of defects in composite materials, for example, offers a particular challenge and neural network classifiers when fully developed will provide a means to determine whether the material is suitable for the intended structural application [32]. 4. Pattern Recognition in Eddy Current NDE The use of eddy current is another cost effective method t o detect and locate the defects especially if they are near the metal surface. For this reason, eddy current probes have been used in nuclear reactor and power plant inspection for detection of cracks and corrosions, well before the ultrasonic method became popular in NDE in the early 80’s. Eddy current signals can be displayed in impedance contour plots, or a sequence of waveforms. Features extracted can take the form of Fourier descriptors (see eg. [35]) for impedance contour, or waveform features. Consider the eddy current scan along the circumference of a tube (Y-direction), as the probe moves along the axis (X-direction). The spectral features of a single scan (see e.g.
3.1 Pattern Recognition in Nondestructive Evaluation 463 [36]) can form a feature vector including as components the frequency of the spectral peak in the amplitude spectrum of the Y-component signal, the ratio of the X and Y spectral amplitudes at this frequency, the phase difference between the X and Y spectral phases at the frequency, the amplitude spectral areas in three frequency ranges, normalized with respect to the total spectral areas. Other procedures for extraction of mathematical features are available. When the multi-frequency probes are used, features must be extracted to include essential information from all probe frequencies. Feature dimension reduction is always needed to simplify the classifier structure. Based on the signal extrema as features, the eddy current methods can also be used to classify objects with different shapes such as holes and cracks [37]. Early eddy current signal classification made use of the adaptive learning network [38] and the Fisher linear discriminant approach [l].More recently, neural networks have proven to be very effective [39] for eddy current signal classification using Fourier descriptors as features. In one experiment reported, a set of 61 signals was generated by scanning each defect several times, and each signal is represented by eight Fourier descriptors which serve as input to the neural network classifier. A two-layered artificial neural network was trained by backpropagation algorithm, using a training set of 40 signals to identify four different classes. The remaining 21 signals were used for testing. The neural network classifier classified all signals correctly while the classification using k-means algorithm had one error. It is noted that one advantage of the neural network is that it is easy for working with a largedimensional feature set. The network itself can also give us an indication which features are most important as larger weights are established on such features after significant training.
5. Pattern Recognition in Acoustic Emission NDE Pattern recognition methods have been used for classifying acoustic emission (AE) signals according to their source types [40,41]. Simple time and frequency domain features of the AE waveforms are used in the classification to distinguish one type from another. Sources of the AE in the monitoring application considered are crack growth, crack face rubbing, fastener fretting, mechanical impacts, electrical transients, and hydraulic noise. For each AE event, two simultaneous waveforms were recorded, one from each transducer. For the waveform of an individual transducer, the major features are arrival time and amplitude of the peak, amplitude in six time intervals around and after the trigger, energy in several frequency bands, and total energy in the spectrum. The two-transducer features include ratio of peak amplitudes, ratio of energy in signals, ratio of energies in frequency bands, etc. The mean and variance of each feature are computed, from which the distance of each event from a given class is computed as the sum of squares of the difference between the value of the feature and the mean, normalized by the feature variance. The decision is based on the minimum distance.
464 C.H. Chen
In another reported work, neural networks using backpropagation training are used for leak location [42], i.e. to monitor the source of leaks in the shell of a space station. Again neural networks are shown to be feasible for pattern recognition with AE signals. 6. Pattern Recognition in Radiographic NDE
The improvement in radiography has now made it cost effective to use the radiographic NDE even at real time operation [43]. The pictorial information about the defects provided by the imagery certainly far exceeds what is available from the waveform data. For pattern recognition, the imagery must be digitized first. But then the large amount of image enhancement and processing techniques can be used to improve greatly our ability to interpret the defects and the material properties. To detect the local inhomogeneities in a fiber reinforced composite structure, the segmentation of both X-ray radiography and the ultrasonic C-scan of the same specimen was considered by Jain et al. [44,45]. Four segmentation techniques, viz. simple thresholding, the adaptive thresholding scheme of Chow and Kaneko (see e.g. [46]), the iterative conditional modes method of Besag [47],and the adaptive thresholding scheme of Yanowitz and Bruckstein [48] were examined, but found to be unsatisfactory. A new algorithm based on the adaptive thresholding method proposed by Yanowitz and Bruckstein and the Canny edge detector [49] was proposed for segmentation of real X-ray and C-scan images. Further improvement was obtained by a simple fusion technique which takes the X-ray and C-scan images and the location distribution histogram of events from the acoustic emission to separate the real defects from the spurious components resulting from segmentation, so that a more complete defect map of the specimen can be obtained. To illustrate the use of computer vision in radiographic NDE, we use the excellent results from [44] (courtesy of A. K. Jain) which are shown in Figs. 4-7. Figure 4
a
TS
Cut plies
Tenon insert
0 Hole5mmpl
1”
Fig. 4. Set of test specimens showing defects (after [44], reprinted with permission)
3.1 Pattern Recognition in Nondestructive Evaluation 465
Fig. 5. Results for the specimen CHS (after [44], reprinted with permission).
shows the six test specimens with three kinds of defects inserted: cut plies, teflon insert, and a 5 mm diameter hole. Figures 5-7 show the results of three of the specimens (CHS, CTHS, and TS). In each figure, (a) is the acoustic emission graph, (b) is the C-scan image, (c) is the segmentation of the C-scan image, (d) is the extraction of the defects (in white) in the C-scan image, (e) is the X-ray image,
466 C.H. Chen
Fig. 6. Results for the specimen CTHS (after [44], reprinted with permission).
(f) is the segmentation of the X-ray image, (g) is the extraction of the defects (in white) in the X-ray image, and (h) is the defect map of the specimen. It is noted that the defect map is obtained by drawing an ellipse around each identified defect [44]. The ellipse is centered at the centroid of the connected component and its major and minor axes are determined from the eigenvectors and eigenvalues of the connected component, averaged over the two images.
3.1 Pattern Recognition in Nondestructive Evaluation 467
Fig. 7. Results for the specimen TS (after [44], reprinted with permission).
An interesting comparison between the neural network and Markov random field image segmentation is made in [50] for radiographic images. The paper also mentioned several hybrid approaches. Recently a hybrid approach using a class-sensitive neural network for segmentation of wafer inspection images [51] has been proposed. Neutron radiography testing is, for certain applications, a useful alternative to the
468
C. H. Chen
standard X-ray testing methods. Digital radiography certainly has an advantage [52], for example in bridge inspection. The enormous progress on computer vision as reported in Part 2 of this book will certainly be significant in making digital radiography a dominant approach in NDE.
7. Comments and Further Work Undoubtedly most progress on pattern recognition in NDE has been made only very recently. The interest and demand for reliable pattern recognition in NDE will certainly increase. This presents both a challenge and an opportunity to pattern recognition and related areas. A fully automated NDE system is not an unreasonable expectation and several successes have been reported. A major effort is needed to improve the recognition reliability under the cost constraint. Only a few NDE methods have been considered in this chapter. There are numerous NDE instrumentations which can use digital processing and recognition to improve effectiveness. Also for one type of specimen being tested, several NDE methods can be feasible. Fusion of information from various NDE sensors or methods can be the solution for the best NDE results in many problems. This is certainly another area that requires a major research effort.
Acknowledgements The author gratefully acknowledges Prof. A. K. Jain for permission to use Figs. 4-7, and for the supply of original illustrations for these figures. This work was supported by the Information Research Laboratory, Inc. through SBIR Phase 111 with internal R&D funding.
References [l] P. G. Doctor and T. P. Harrington, Analysis of eddy current data using pattern
recognition methods, in Proc. IEEE 5th Int. Joint Conference on Pattern Recognition, Miami, FL, Dec. 1980, 137-139. [2] R. Shanker et al., Feasibility of using adaptive learning methods for eddy current signal analysis, Electric Power Research Institute Report EPRI NP-723, TPS77-723, Mar.1978. [3] R. K. Elsley and L. J. Graham, Pattern recognition techniques applied to sorting acoustic emission signals, in IEEE Ultrasonic Symp. Proc., Sept.-Oct. 1976, 147-150. [4] W. Y. Chan, D. R. Hay, C. Y. Suen and 0. Schwelb, Application of pattern recognition techniques in the identification of acoustic emission signals, in Proc. I E E E 5th Int. Joint Conference on Pattern Recognition, Miami Beach, FL, Dec. 1980, 108-111. [5] R. Halmshaw, Non-destructive Testing (Edward Arnold, London, 1987). [6] R. C. McMaster, Nondestructive Testing Handboob, Vols. 1and 2 (American Society for Nondestructive Testing, Columbus, OH, 1959). [7] J. Krauthramer and H. Krautramer, Ultrasonic Testing of Materials, Fourth Ed. (Springer-Verlag, New York, 1990). [8] D. E. Bray and R. K. Stanley, Nondestructive Evaluation: A Tool in Design, Manufacturing and Service (McGraw-Hill, New York, 1989).
3.1 Pattern Recognition in Nondestructive Evaluation 469 (91 C. H. Chen, On a high resolution ultrasonic inspection system, in D. 0. Thompson and D. E. Chimenti (eds.), Review of Progress in Quantitative Nondestructive Evaluation (QNDE), Vol. 9A (Plenum Press, New YorkJ990) 959-965. [lo] C. H. Chen, High resolution ultrasonic spectroscopy system for nondestructive evaluation, Final Report on Contract DAAL04-88-C-0003, submitted to US Army Materials Technology Lab., Jan. 1991. [ll] D. R. Hay et al., ICEPAK (Intelligent Classifier Engineering Package) pattern recognition software package, Tektrend International Inc. 1992. [12] A. N. Mucciardi et al., TestPro software package, Informetrics, 1989. [13] C. H. Chen, Pattern recognition for the ultrasonic nondestructive evaluation of materials, Znt. J. Pattern Recogn. Artif. Zntell. 1 (1987) 251-260. [14] C. H. Chen, Signal processing in nondestructive evaluation of materials, in C. H. Chen (ed.), Signal Processing Handbook (Marcel Dekker, New York, 1988) 661-682. [15] C. H. Chen, High resolution spectral analysis NDE techniques for flaw characterization, prediction and discrimination, in C. H. Chen (ed.), Signal Processing and Pattern Recognition in Nondestructive Evaluation of Materials (Springer-Verlag, BerlinHeidelberg, 1988) 155-173. (161 C. H. Chen, Time series analysis for ultrasonic nondestructive testing, in C. H. Chen (ed.), Applied Time Series Analysis (World Scientific, Singapore, 1989) 121-136. [17] 0. R. Gericke, Determination of the geometry of hidden defects by ultrasonic pulse analysis testing, J. Acoustical Society of America 35 (1963) 364-368. [18] 0. R. Gericke, Ultrasonic spectroscopy, U.S. Patent 3538753, Nov. 1970. [19] C. H. Chen and W. L. Hsu, Modern spectral analysis for ultrasonic NDT, in Proc. Conf. on Nondestructive Testing of High-Performance Ceramics, Boston, MA, Aug. 1987, 401-407. [20] K. Honjoh, Y. Sudoh and J. Masuda, Evaluation technique for metal by analysing ultrasonic spectrum, in J. Boogaard and G.M. van Dijk (eds.), Proc. 12th World Conference on Non-Destructive Testing (Elsevier Science Publishers, Amsterdam, 1989). 121) C. H. Chen and S. K. Sin, On effective spectrum-based ultrasonic deconvolution techniques for hidden flaw characterization, J. Acoustical Society of America 87 (1990) 976-987. [22] C. H. Chen and S. K. Sin, High-resolution deconvolution techniques and their applications in ultrasonic NDE, Znt. J. Imaging S y s t e m and Technology 1 (1989) 223-242. [23] S. K. Sin and C. H. Chen, A comparison of deconvolution techniques for the ultrasonic nondestructive evaluation of materials, ZEEE Trans. Image Process. 1 (1992) 3-10. [24] P. Flandrin, Non-destructive evaluation in the time-frequency domain by means of the Wigner-Ville distribution, in C. H. Chen (ed.), Signal Processing and Pattern Recognition in Nondestructive Evaluation of Materials (Springer-Verlag, Berlin-Heidelberg, 1988) 109-116. (251 C. H. Chen and J. C. Guey, On the use of Wigner distribution in ultrasonic NDE, in D. 0. Thompson and D. E. Chimenti (eds.), Review of Progress in Quantitative Nondestructive Evaluation, Vol. 11A (Plenum Press, New York, 1992) 967-974. [26] C. H. Chen and G. G. Lee, On the wavelet transform and its applications to ultrasonic NDE, unpublished report submitted to the US Army Materials Technology Lab., Jun. 1992. [27] R. W. Y. Chan, D. R. Hay, J. R. Matthews, and H. A. MacDonald, Automated ultrasonic system for submarine pressure hull inspection, in C. H. Chen (ed.), Signal Processing and Pattern Recognition an Nondestructive Evaluation of Materials (Springer-Verlag, Berlin-Heidelberg, 1988) 176-186.
470
C. H. Chen
1281 M. Desai and D. J. Shazeer, Acoustic transient analysis using wavelet decomposition, Proc. IEEE Conf. on Neural Networks for Ocean Engineering, Washington, D.C., Aug. 1991, 29-40. (291 Nondestructive Testing Handbook, Vol. 7: Ultrasonic Testing, second edn., chapter on pattern recognition methods (American Society of Nondestructive Testing, 1991). [30] C. H. Chen, Applying and validating neural network technology for nondestructive evaluation of materials, in Proc. IEEE Systems, Man and Cybernetics Society Conf., Cambridge, MA, Nov. 1989. [31] Nestor Development System (NDS lOOO), Nestor, Inc. 1988. (32) C. H. Chen, Application of neural networks and fuzzy logic in nondestructive evaluation of materials, in C. H. Chen (ed.) Fuzzy logic and Neural Network Handbook (McGraw-Hill, New York, 1996) 20.1-20.10. [33] K. Shahani, L. Udpa and S. S. Udpa, Time delay neural networks for classification of ultrasonic NDT signals, in D. 0. Thompson and D. E. Chimenti (eds.), Review of Progress in Quantitative Nondestructive Evaluation, Vol. 11A (Plenum Press, New York, 1992) 693-670. [34] M. Kitahara et al., Neural network for crack-depth determination from ultrasonic backscatter data, in D. 0. Thompson and D. E. Chimenti (eds.), Review of Progress in Quantitative Nondestructive Evaluation, Vol. 11A (Plenum Press, New York, 1992) 701-708. (351 S . S. Udpa, Signal processing for eddy current nondestructive evaluation, in C. H. Chen (ed.), Signal Processing and Pattern Recognition in Nondestructive Evaluation of Materials (Springer-Verlag, Berlin-Heidelberg, 1988) 129-144. [36] J. E. S. Macleod, Pattern classification in the automatic inspection of tubes scanned by a rotating eddy-current probe, in Proc. IEEE 5th Int. Joint Conf. on Pattern Recognition, Munich, Germany, 1982, 214-216. [37] K. Grotz and T. W. Guettinger, Fast pattern recognition method for eddy current testing, in D. 0. Thompson and D. E. Chimenti (eds.), Review of Progress in Quantitative Nondestructive Evaluation, Vol. 1lA (Plenum Press, New York, 1992) 9 19-926. [38] A. N. Mucciardi, Elements of learning control systems with applications to industrial processes, in Proc. IEEE Conf. on Decision and Control, New Orleans, LA, 1972. [39] L. Upda and S. S. Udpa, Eddy current defect characterization using neural networks, Materials Evaluation 48 (1990) 342-353. [40] L. J. Graham and R. K. Elsey, AE source identification by frequency spectral analysis for a n aircraft monitoring application, J. Acoustic Emission 2 (1983) 47-55. (411 R. K. Elsley and L. J. Graham, Pattern recognition in acoustic emission experiments, in Proc. SPIE Symp. on Pattern Recognition and Acoustic Imaging, Newport Beach, CA, Feb. 1987. (42) M. A. F'riesel et al., Acoustic emissions applications on the NASA space station, in D. 0. Thompson and D. E. Chimenti (eds.) Review of Progress in Quantitative Nondestructive Evaluation, Vol. 11A (Plenum Press, New York, 1992) 725-732. [43] R. Grimm, Progress in real-time radiography, in J. Boogaard and G.M. van Dijk (eds.), Proc. 12th World Conf. on Non-Destructive Testing (Elsevier Science Publishers, Amsterdam, 1989), [44] A. K. Jain, M.-P. Dubuisson and M. S. Madhukar, Multi-sensor fusion for nondestructive inspection of fiber reinforced composite materials, in Proc. of the American Society for Composites, Albany, NY, Oct. 1991, 941-950. [45) A. K. Jain and M.-P. Dubuisson, Segmentation of X-ray and C-scan images of fiber reinforced composite materials, Pattern Recogn. 25 (1992) 257-270.
3.1 Pattern Recognition in Nondestructive Evaluation 471 [46] A. Rosenfeld and A. C. Kak, Digital Picture Processing, Vols. 1 and 2 (Academic Press, Orlando, FL, 1982). [47] J. Besag, On the statistical analysis of dirty pictures, J. Roy. Stat. SOC.48 (1986) 259-302. (481 S. D. Yanowitz and A. M. Brackstein, A new method for image segmentation, Comput. Graph. Image Process. 46 (1989) 82-95. [49] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8 (1986) 679-698. [50] F. G . Smith, K. R. Jepsen and P. F. Lichtenwalner, Comparison of neural network and Markov random field image segmentation techniques, in D. 0. Thompson and D. E. Chimenti (eds.) Review of Progress in Quantitative Nondestructive Evaluation, Vol. 11A (Plenum Press, New York, 1992) 717-724. [51] C. H. Chen, G. H. You and P. S. King, Pattern wager image reprentation using neural networks, in Image Processing: Theory and Applications, edited by G. Vernazza (Elsevier Science Publishers, Amsterdam, 1993) 213-217. [52] G. Thomas et al., Nondestructive evaluation techniques for enhanced bridge inspection, in D. 0. Thompson and D. E. Chimenti (eds.) Review of Progress in Quantitative Nondestructive Evaluation, Vol. 13B (Plenum Press, New York, 1994) 2083-2090.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 473-505 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 3.2 I DISCRIMINATIVE TRAINING - RECENT PROGRESS IN SPEECH RECOGNITION
SHIGERU KATAGIRI A TR Interpreting Telecommunications Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan E-mail: katagiriOhip. atr. co.jp and ERIK MCDERMOTT ATR Human .Information Processing Research Laboratories, 2-2 Hakaridai, Seika-cho, Somku-gun, Kyoto 619-02, Japan E-mail: mcdOhip. atr. co.j p
Research in speech recognition possesses specific technical issues based on the acoustic and linguistic nature of speech patterns, and also includes general design issues that are common to a wide range of pattern recognition. In particular, over the recent decade, there has been remarkable progress in discriminative training methodology for speech recognition. Several new, powerful discriminative design methods were developed. Some of them defined in the framework of recently developed Artificial Neural Networks; some were defined through the re-investigation of the traditional discriminant function approach to pattern recognition. The evaluation of these recent methods is still ongoing. However, many experimental results have clearly demonstrated their superiority to the conventional methods based on the Maximum Likelihood estimation principle. The utility of these methods is not limited to speech recognition; they can be applied t o most pattern recognition domains. It should thus be worthwhile giving in this chapter a comprehensive introduction t o these recent, discriminative methods for training speech recognizers. The chapter is organized as follows. After the Introduction in Section 1, we summarize in Section 2 the fundamentals of classifier design from the viewpoint of Bayes decision theory. In Section 3, we survey the recent embodiments of discriminative training. Several important cases including the Time-Delay Neural Networks and the ShiftTolerant Learning Vector Quantization, both epoch-makers in the recent research history of speech recognition, are introduced and their advantages and disadvantages are discussed. In addition to the approaches described in Section 3, there is another important family of discriminative training methods based on the Generalized Probabilistic Descent method. For clarity, we dedicate one section, i.e., Section 4, to an overview of this family of methods. In Section 5, some suggestions are made for further study, and the chapter is summarized in Section 6. Keywords: Speech recognition, discriminative training, artificial neural networks, learning vector quantization, generalized probabilistic descent method, minimum classification error learning, discriminative feature extraction.
473
474 5’. Katagiri tY E. McDermott
1. Introduction Speech (pattern) recognition is a process of mapping a dynamic (variabledurational) wave pattern sample, which belongs to one of M speech classes, to a class index, Cj ( j = 1, . . . ,M ) . We specifically consider the design problem of training the adjustable parameter set P of a recognizer, aiming to achieve the optimal (best in recognition accuracy for all future samples) recognition decision status. One of the fundamental approaches to this problem is the Bayes approach using the following Bayes decision rule, of which rigorous execution is well known to lead to the above-cited, optimal status:
where u? is a dynamic speech wave sample with length To and where C(.)represents the recognition operation. The training goal in this approach is to find a state of P that enables the corresponding estimate pq(CjJuT) (a fuction of P)to precisely approximate the true a posteriori probability density p(Cj)uT), or in other words, to adjust P so that p,p(Cjlu?) can approximate p(CjIup) as precisely as possible. In reality, however, it is rarely attempted to take this direct way of estimating the a posteriori probability functions for the original wave samples due to several reasons including the following: (1) the original speech wave signal is usually too redundant and often possesses information irrelevant to recognition, resulting in a heavy computational load in the recognition process and low recognition accuracy, and (2) it is almost impossible to know the true functional form of the probability function. Actually, in most cases, a speech recognizer is a modular system, consisting of (1) a feature extractor (feature extraction subprocess) and (2) a classifier (classification subprocess) that is further divided into a language model and an acoustic model. Let us represent the adjustable parameter sets of the feature extractor, the acoustic model, and the language model by 9, A, and T,respectively: 9 = 9 u A U T. The feature extraction subprocess converts a speech wave sample to a dynamic pattern xT = ( X I , X ~. ,. . ,X t , . . . , x,) that is a sequence of static (fixeddimensional) F-dimensional acoustic feature vectors, where T is the duration of the dynamic pattern and X t is the tth feature vector of the sequence. The feature vector is generally made up of cepstrum or filter-bank output coefficients. 9 is the trainable parameter set, such as lifter and filter-bank functions, that controls the nature of the feature vectors. The terms “cepstrum” and “lifter” will be described in somewhat more detail in Section 4. Next, the classification subprocess assigns a class index to this converted feature pattern. This assignment is generally performed by using the classification rule,
UT
C(XT)
=
ci ,
iffi = arg max p(xTlcj)p(cj), 3
which is conceptually equivalent to (1.1). Note in (1.2) that in accordance with the Bayes rule of probability, the a posteriori probability is replaced by conditional
3.2 Discriminative Training - Recent Progress in. . . 475 probability (density) and a priori probability, which both can be easily but not accurately estimated based on the well-analyzed Maximum Likelihood (ML) method. In fact, the conditional probability density p(sT1Cj) is often computed by the estimate p*(sTJCj)using the acoustic model of Hidden Markov Models (HMMs); A corresponds to a model parameter set including the HMM state transition probabilities and mean vectors of the Gaussian continuous HMM for example. Also, the a priori probability p ( C j ) is often computed by the estimate p r ( C j ) using a language model such as a bigram model and a Probabilistic Context-Free Grammar, Y then corresponds to the probability values themselves of the bigram model and the grammar. In the same sense as (l.l),substituting accurate estimates for the probabilities in (1.2) enables one to achieve the optimal, minimum classification error status, conditioned by the selection of the feature extraction subprocess. Thus, it seems plausible to consider the accurate estimation of these conditional and a priori probabilities to be a desirable design objective. Most traditional speech recognizers are designed based on the design principle of this ML approach (to classification); specifically, the Expectation-Maximization method or Segmental k-means Clustering are used for training the HMM acoustic model, and the Insideoutside algorithm is used for designing the Probabilistic Context-Free Grammar. Note here that the Minimum Distortion principle, which underlies the design of the reference pattern models of a distance classifier that was widely used prior to HMM classifiers, is fundamentally equivalent to this ML principle. However, this conventional ML-based approach has often been less than satisfactory. ML-designed recognizers have generally required a large set of trainable classifier parameters in order to achieve high classification accuracy, requiring a large set of design samples. In fact, there is a fundamental problem, behind these negative experimental findings, that the functional form of the class distribution (the conditional probability density) function to be estimated is intrinsically rarely known and the likelihood maximization of these estimated functions, performed to model the entire class distribution class-by-class, is not direct with regard to the minimization of classification errors (the accurate estimate of class boundaries). Therefore, due to the intrinsic limitation of the ML-based approach, research concerns on its alternative, i.e. discriminative training, have gradually increased over the last decade. The techniques of feature extraction and probability estimation are described in detail in textbooks such as [l].
2. Classifier Design Based on Bayes Decision Theory 2.1. Bayes Decision Theory
For the general task of classifying M-class patterns, the Bayes theory is formulated as follows (see [2] for details). According to the conventional presentation of this theoretical framework, we here assume a sample to be static (fixeddimensional). A static feature pattern X(E (212)is given. To measure the accuracy,
476 S. Katagiri
tY E. McDermott
the theory first introduces an (individual) loss .&(C(x))that is incurred by judging the given pattern’s class c k to be one of the M possible classes, where C(.)denotes the classification operation as in (1.1). It is obvious that the accuracy of the task should be evaluated over all of the possible samples. Thus, using statistical theory, the theory next introduces an expected loss incurred by classifying x, called the conditional risk, as
L(c(x)Ix) = c g k ( c ( x ) ) l ( x E c k ) p ( c k I x )
(2.1)
k
and also introduces an expected loss associated with C ( . ) ,called the overall risk, as
L: =
J
L(C(X)lX)P(X)dX,
(2.2)
where 1(d)is the following indicator function
1(d)=
1 (if d is true), 0
(otherwise).
The accuracy is accordingly defined by the overall risk. The smaller the overall risk, the better its corresponding classification result. A desirable decision is thus the one that minimizes the overall risk. Consequently, the theory leads to the following well-known Bayes decision rule.
Bayes decision rule: To minimize the overall risk, compute all of the ( M ) possible conditional risks L(Cilx)(i = 1 , . . . , M ) and select the Cj for which L(Cjlx) is minimum. 2.2. Minimum Error Rate Classification Using the general concept of loss enables one to evaluate classification results flexibly. This is one of the big attractions of the Bayes decision theory. However, in practice, a particular loss that evaluates results evenly for all of the possible classes has been used widely. The most natural example of this type of loss is the following error count loss
and then its corresponding conditional risk becomes
L(C(X)IX) = 1 - P(C(x)Ix)
(2.5)
and the overall risk becomes the average probability of error. Thus, the minimization of the overall risk based on (2.4) leads to the minimization of the average probability of error, in other words, the minimum error rate classification (e.g. [2]). According to the above Bayes decision rule, it is obvious that the desirable classification decision is the one that minimizes the conditional risk (2.5), or maximizes
3.2 Discriminative h i n i n g
-
Recent Progress in.. . 477
the a posteriori probability P(C(x)lx). As far as the minimum error classification is concerned, the best classifier design is theoretically achieved by simulating (2.5) as accurately as possible. Hance this way of thinking justifies the ML-based approach of aiming at a correct estimation of the a posteriori probability, or its corresponding a priori probability and conditional probability. However, as cited before, an accurate estimation of these probability functions is difficult in realistic situations where only limited resources are available, and generally this approach does not achieve satisfactory design results.
2.3. Discriminative Training Originally, the probability functions are given for the task at hand. In the MLbased approach, the estimation of these functions constitutes the classifier design phase. Taking into account the fact that these given, true functions are unobservable and essentially difficult to estimate accurately, one may find an alternative approach to the design; that is, one can consider that the design should involve an accurate evaluation based on (2.4) for every sample. This is the very concept of the discriminative training design. However, unfortunately, this point of view has not been explicitly indicated in most conventional discriminative training design attempts. Usually, the design formalization has started with the following decision rule, selecting arbitrary types of discriminant functions, which are not necessarily the probability functions, and also selecting some optimization criteria which are used to design these discriminant functions but do not necessarily have a direct relationship to (2.4),
C(x) = Ci , iff i = arg m v gj(x;A) . 3
The implementation of discriminative training is basically characterized by the following factors: 0
0 0 0
functional form of the discriminant function (classifier structure or measure), design (training) objective, optimization method, consistency with unknown samples (robustness, generalization capability, or estimation sensitivity).
Classifier structure, which determines the type of A, is generally selected based on the nature of the patterns, and this selection often determines the selection of the measure or the selection of gj(x;A). The linear discriminant function is a typical example of a classical classifier structure, and the measure used therein is a linearly-weighted sum of input vector components. A distance classifier that uses the distance between an input and a reference vector as the measure is another widely used example. The design objective is a function used for evaluating a classification result in the design stage and it corresponds to the concept of loss. Usually, an individual
478
S. Katagiri & E. McDermott
loss that is a design criterion for an individual design sample is first introduced; the individual loss for x(E c k ) is denoted by &(x; A). It should be noted here that this individual loss used for the design phase is a function of A. As discussed in the previous section, a natural form of the individual loss is the classification error count as: 0 = k),
~,(x;A)=
1 (otherwise).
Next, similar to the overall risk, an ideal overall loss, i.e. the expected loss, is defined by using the individual loss as
However, since sample distributions are essentially unknown, it is impossible to use this expected loss in practice. For this reason, an empirical average loss, defined in the following, is usually used;
1.
k
n
where n of x, explicitly means that the sample x, is the nth sample of the finite (consisting of N samples) design sample set X = {XI,.. . ,x,, . . . , XN}. In the case of using the error count loss of (2.7), this empirical average loss becomes the total count of classification errors measured over X . In addition to (2.7), several different types of loss forms have been used; e.g. Perceptron loss, squared error loss (between a discriminant function and its corresponding supervising signal), and mutual information. However, these are sub-optimal selections, and it is known that design results obtained by using them are not necessarily consistent with the minimum classification error probability situation (e.g. see [2]). Optimization is the process of finding in a competitive manner with regard to classes the state of A that minimizes the loss over X , and its embodiments can be grouped into heuristic methods and mathematically-proven algorithms. They are also categorized from another point of view; i.e. the batch type search versus the adaptive (sequential) search. Among many approaches such as Error Correction Training, Stochastic Approximation, and Simulated Annealing, the methods based on gradient search over the average empirical loss surface have been widely used due to their practicality and mathematical validity [2-51. Originally, the purpose of design is to realize the state of A that leads to accurate classification over all of the samples of the task at hand, not just the given design samples. It should be recalled that the expected loss was defined as an ideal overall loss. Thus it is desirable that the design result obtained by using design samples be consistent with unknown samples too. Obviously, a pursuit of this nature requires some assumptions concerning unobservable unknown samples, or the entire sample
3.2 Discriminative Training - Recent Progress in. . . 479 distribution of the task. Some additional information is needed that brings (2.9) closer to (2.8). In contrast with ML-based modeling, which introduces a parametric probability function in order to estimate the overall sample distribution, measures to increase the consistency in the discriminative training are generally moderate: the loss is individually defined for every design ample, and the design is fundamentally performed over the given design samples, as shown in the use of the average empirical loss (2.9). In many cases of discriminative training, consistency is merely a result of the selection of the discriminant function or the measure. We have considered the classification of static patterns in this section. The issues discussed above also hold true in the classification of dynamic patterns: the discussion did not require the static nature of patterns. 3. Embodiments of Discriminative Training
The unifying thread of discriminative training approaches is that they incorporate an explicit comparison among the task categories during the design phase. The speech recognition community has long been aware of the need for this kind of class-competitive system design. In the following we.describe some of the landmarks in discriminative training applications to speech recognition.
3.1. Artificial Neural Networks for Speech Recognition Since the late 1980s, there has been a proliferation of work in applying Artificial Neural Networks (ANNs) to problems in speech recognition. For a variety of tasks, ANN applications have shown great promise compared to the dominant approach to speech recognition, ML-based hidden Markov modeling. At the same time, ANNs typically require longer training times, and perhaps more careful tuning compared to their HMM counterparts. Maybe due to these reasons, for the large speech databases in use today, (e.g. the Switchboard database with 240 hours of speech data), there are few complete systems based primarily on ANN techniques. However, one of the best results for the TIMIT database was obtained using a recurrent neural network [6],and the same system has obtained good results on the large Wall Street Journal database. In spite of practical difficulties in using ANNs for speech recognition, there is still considerable interest in this domain. Furthermore, a large body of work exists connecting ANNs with the HMM framework. In the following we introduce multi-layered, feed-forward, artificial neural networks, and describe a few landmark applications to speech recognition. As this is a tremendously active area of research, the approaches we described here should only be viewed as representative, not comprehensive.
3.1.1. Background: multi-layer perceptrons Work in ANN models began more than 40 years ago with the work of Hebb, McCulloch & Pitts, Rosenblatt, Widrow & Hoff, Amari, Kohonen, and other pioneers (e.g. [7]). In the mid 1980s, work by such researchers as Hopfield, Rumelhart
480
S. Katagiri €9 E. McDennott
and McClelland, and Sejnowski led to a resurgence of interest in ANN models of machine learning (e.g. [S]). In particular, during this time, the Back-propagation algorithm for training a Multi-Layer Perceptron (MLP), originally proposed in 1974 [9], was rediscovered and popularized [8]. Back-propagation is essentially a stochastic gradient descent procedure, “back-propagating” the gradient of loss from the output nodes to the input nodes, modifying inter-node connection weights using the chain rule of differential calculus along the way. This method is typically used to minimize the mean squared error loss between the actual network output and the desired output. Successful minimization of this loss indicates that the network was able to learn the task at hand. The representational power of MLP is considerable. It has been shown that a network with a sufficient number of layers and nodes can realize arbitrarily complex decision regions in the pattern space [lo], and can approximate a wide class of functions [ll]. Though there is no guarantee that Back-propagation (or similar optimization procedures) will always make full use of this representational power, practical experience has shown that ANNs trained with Back-propagation on the mean squared error loss are able to learn a wide range of tasks [8]. In the context of pattern recognition, the discriminative power of ANNs therefore derives from both the intrinsic representational power of these networks and the manner in which the networks are trained, i.e. the discriminative training using the mean squared error loss. Usually, there is one output node for each category. During training for a given training sample, the desired output is 1 for the correct category’s output node, and 0 for all other nodes. Thus the network is explicitly asked to recognize a training sample as either in-category or out-ofcategory. Training is performed for all categories simultaneously, and not in the category-dependent approach used in most ML applications. As a result, MLP trained with the minimization of mean squared error loss is typically more discriminative than a classifier trained category by category using ML-based met hods. 3.1.2. Time-delay neural network
As cited above, a speech wave pattern is usually represented in the feature extraction stage as a dynamic sequence of static spectral feature vecttors, and a task of the classifier is to classify such dynamic sequence patterns. An ANN-based classifier must thus somehow pick out speech events, each corresponding to phoneme or word, from a continuous or semi-continuous stream of feature vectors. Such a system would be shift-tolerant in that the precise position of a speech event in the speech signal would not affect the recognition of that event. The approach adopted in the highly influential Time Delay Neural Network (TDNN) [12] is to feed the network a stream of 15 spectral feature frames of 10 msec each (for a total of 150 msec of input speech), and constrain it to learn only from 30 msec input windows of speech that are shifted, 10 msec at a time, over the whole speech input. After the input layer, the first intermediate or hidden layer of the network is able to group together features learned by the first layer and form new,
3.2 Discriminative Training - Recent Progress in... 481 final phoneme activations /b/ /d/ /a1
-
1
150 msec speech input
I
Fig. 1. Time Delay Neural Network for the recognition of /b/, /d/ and /g/.
abstract features: but again, these intermediate abstractions are constrained to windows of 5 frames from the first layer (i.e. to windows of 5 windows of 30 msec of speech). The next layer connects this level of abstract feature representation to the output nodes, which correspond to the target phonemes of the task. The structure of TDNN is illustrated for a 3 class phoneme recognition problem in Fig. 1. The manner in which the connections are constrained forces the network to learn representations of speech events that are roughly independent of where those events occur in the input. This provides the network with a degree of shift-tolerance. Waibel et al. compared a TDNN and a discrete HMM on problems in Japanese phoneme recognition [12,13]. TDNN performed significantly better than its HMM counterpart on the tasks examined, achieving recognition performances in the high 90 percent range, compared to the low 90 percent range displayed by the HMMs used in the experiments. Furthermore, analysis of network performance suggested that the intermediate layers of the network are able to build abstract representations of speech based on the output from the early layers. Thus the lowest layer of the network performs a local pattern analysis, and the higher levels build on that until
482
S. Katagiri i 3 E. M c D e m o t t
finally abstract representations in the next-to-last layer are linked to the output nodes representing individual phonemes of interest. It should be noted that these early experiments were performed in a speakerdependent manner, and the phoneme events used in the experiments were presegmented by hand. The task was therefore simpler and easier than ‘hormal” speech recognition tasks. Nonetheless, these results serve to illustrate the TDNN concept, and suggested that ANN-based approaches to speech recognition might outperform the conventional HMM approach. 3.1.3. Expanding the scope of ANNs f o r speech recognition: incorporation of dynamic programming The TDNN in its original form showed great promise for the small phoneme recognition experiments described above. It should be recalled that the training and testing data for these experiments consisted of 150 msec speech tokens, generated from labeled speech data. In a practical speech recognizer, it is unrealistic to assume that the speech signal can been accurately segmented into its constituent phonemes. A recognizer must be able to handle speech input that may contain a variable number of phonemes or words with unknown segment boundaries. The classic solution to this central problem in speech recognition is the use of Dynamic Programming (DP) to consider multiple, competing segmentations, and choose the likeliest one to generate the classifier output, in an efficient manner (see, for example, [14]). There have been two main approaches to incorporate DP in ANNs: (1) using ANNs to provide local DP/HMM scores, and (2) training global ANN in the context of DP operation. In the following, we briefly describe these two approaches by introducing typical implementations. Bourlard and Wellekens [15] suggested early on that the output of an MLP could be used to provide the local phonemic distance scores needed in a DP-based recognizer. In a further study, the same authors made explicit the probabilistic interpretation of neural network outputs described above, and showed how local outputs (based on a few frames of input speech) could be used in an HMM [16]. The motivation for doing this is to consider that an ANN would provide more discriminative local probabilities, which could then be integrated for word or sentence recognition using the familiar HMM framework. The ANNs in this study are not limited to single frames (feature vectors) of speech, but can incorporate neighoring frames to include more context, as described above for the TDNN, or even recurrent, feed-back connections that act as a memory of previous frames (the extent to which the memory of the network could extend backwards in time is limited by the number of parameters of the network, and the difficulty/ease of training such a network.) Many such “hybrid” models have been proposed. For instance, the recurrent system described in [6] is closely related to the above proposal. Similarly, the TDNN was used to provide local phonemic scores in an DP-based LR-parser and applied successfully to tasks in large vocabulary isolated word recognition tasks [17].
3.2 Discriminative Training - Recent Progress an.. . 483 Though in [16] and [6], recurrent connections can provide the network with a potentially very large context on which to base its frame to frame probabilitiesldistances, these methods still require an independent super-structure, such as an HMM or a DP-based classifier, to integrate the frame-by-frame results into an overall recognition result for the entire speech input. The training of the ANN in each case is performed independently of this super-structure. An attractive alternative is to incorporate the recognizer super-structure (typically based on DP) into the training of the ANN. This is the approach adopted in the Multiple-State TDNN (MS-TDNN) [18,19]. DP is applied to local TDNN phonemic or sub-phonemic (in the following, L‘states”)outputs to generate an overall recognition output (e.g. a word or a string of words) that represents the best phonemic segmentation among all possible segmentations. In this forward recognition phase, the approach is nearly identical to the above integrations of ANNs with the DP-based segmentation or HMMs. However, the key point is that during the training phase, the error function in MS-TDNN is defined over the entire input, and explicitly reflects the loss (the mean squared error in [18], but mutual information in [19]) at the level of the final category outputs of the problem. Furthermore, unlike the original TDNN, the final output (and therefore the loss) incorporates the DP process. The gradient of the loss is now propagated over the best segmentation (the frame-by-frame state sequence with the best overall score) before being propagated down to the usual TDNN connections. For a given frame, the gradient of the loss is propagated only to the TDNN corresponding to the state that is on the best DP path at that frame. Adapting the connection weights in this way is thus much more directly related to the final object of interest (for example, a sequence of words) than training based on the local, frame-by-frame loss definition described in the previous section. The MS-TDNN is therefore an example of a complete, ANN-based paradigm for the design of a speech recognizer. 3.2. Other “Neural” Approaches : Prototype/Kernel-Based
Methods for Speech Recognition In the preceding section, the focus was on recognizer structures based on MLPs. For nearly all the applications reviewed, network nodes essentially follow the Perceptron model - they perform a linear integration of input activations, and pass the result through a nonlinear function. There are, however, many other recognizer structures that are still considered by their creators to be, in some loose sense, analogical to biological neural networks, that do not follow this model. In particular, networks whose representations are formed around prototypes or kernels, acting as (multiple) averages of data from the task in question, are a popular approach and have been extensively studied [20,21]. Compared to the Perceptron-based approach, these methods typically require more parameters (they are closely related to “memory-based” approaches such as the classic Parzen window approach to density estimation), but have the advantage of rapid training time: it is usually easy to
484 S. Katagiri U E. McDermott Bayes Decision Boundary 1
Actual Decision Boundary
/
c
mi
mi
Fig. 2. LVQ2 adaptation, one-dimensional, two-class problem.
obtain a representative set of averages from training data for the task at hand; these averages then form a n initial set of prototypes upon which the network can build to refine its representations so as to better suit the task. In this section we focus on one important approach in this family of ANNs, the Learning Vector Quantization (LVQ) algorithm. 3.2.1. Learning vector quantization Kohonen presents several versions of LVQ [22,23]; here we focus on the more purely discriminant of them, LVQ2. In LVQ2, each category t o be learned is assigned a number of reference (prototype) vectors, each of which has the same number of dimensions as the input vectors of the categories. In the recognition stage, an unknown input vector is classified as the category of the reference vector that has the smallest Euclidean distance to that input vector. This classification scheme means partitioning the vector space into regions, or cells, defined by individual reference vectors. LVQ training adjusts the reference vector so that each input vector has a reference vector of the right category as its closest reference vector. Kohonen, in his formulation of the LVQ2 algorithm, is particularly concerned with the problem of approximating decision lines corresponding t o the minimum error rate classification. Figure 2 helps to illustrate vector adaptation in LVQ2, for a simple one-dimensional situation. For a given training vector x , three conditions must be met for learning t o occur: (1) the nearest class must be incorrect; (2) the next-nearest class (found by searching the reference vectors in the remaining classes) must be correct; (3) the training vector must fall inside a small, symmetric window defined around the midpoint of the reference vectors mi and mj (corresponding to categories Ci and Cj)- this midpoint (in higher dimensions, mid-plane) being the decision boundary effected by the two vectors. If these conditions are met, the incorrect reference vector is moved further away from the input, while the correct reference vector is moved closer, according to:
mi(t + 1) = mi(t)- a(t)(x(t)- mi(t))
3.2 Discriminative Training mj(t
-
Recent Progress in.. . 485
+ 1) = mj(t) + a ( t ) ( x ( t -) mj(t)),
(3-2)
where x is a training vector belonging t o class j , mi is the reference vector for the incorrect category, mj is the reference vector for the correct category, and a(t) is a monotonically decreasing function of time. These requirements, taken together, are aimed at assuring that the decision line between the two vectors will eventually approximate the Bayes decision boundary that corresponds to the minimum classification error probability situation, at the place where the joint density functions cross. At the time of its proposal, there was no proof that LVQ2 did in fact converge t o the optimal Bayes configuration, but it should be noted that it is now known that one can view a slightly modified LVQ as a practical implementation of the Generalized Probabilistic Descent method (GPD) used with a Minimum Classification Error (MCE) criterion [24,25], which will be presented in Section 4. In a statistical benchmark comparing LVQ, Back-propagation and Bolzmann machines, it was found that the simple, quickly trained LVQ performed extremely well [22]. The ease of implementation of this algorithm, and its high classification power, has spawned a tremendous number of applications, in speech recognition as well as t o other areas. 3.2.2. Shift-tolerant LVQ f o r speech recognition As we have described, the LVQ algorithm in its basic form is a method for static pattern recognition. In applying it to problems in speech recognition, one must somehow expand its scope of application t o handle a stream of dynamically varying patterns. As we saw above, this issue is the same issue that faced MLPs, and that spawned a variety of solutions. Among the extremely large number of applications of LVQ to problems in speech recognition, we focus here on a method that closely follows the TDNN model, the Shift-Tolerant LVQ (STLVQ) architecture [26,27]. Figure 3 shows the architecture of the recognition system used for the same /b/, /d/, /g/ task discussed above for TDNN. Each category is assigned a number of reference vectors. The LVQ training procedure is then applied t o speech patterns that are stepped through in time, thus providing the system with a measure of shifttolerance similar t o that of TDNN. To achieve this effect while at the same time capturing a sufficient amount of temporal context for the recognizer, McDermott et al. defined a 7 frame window which is shifted, one frame at a time, over the 15 frame speech token. Each window position yields an input vector of 112 dimensions (7 frames x 16 channels). Given this input vector, LVQ2 is applied as described above. This moving window scheme requires a slightly different recognition procedure than simply finding the closest vector, as there are now several closest vectors, one for each window position. For each window position we calculate the distances between the input vector and the closest reference vector within each category. From this distance measure, each category is assigned an activation value that is high for small distances, low for large distances. After the window has been shifted
486
S. Katagiri €d E. McDermott
Final activations
c.
1-(
Activations /d/ over time
1 12 input values
filterbank I coefficients
o
o
10
-o o o
b iI
II
15 frames, 10 msec frame rate Fig. 3. System architecture, /b/, I d / , /g/ task.
over all 15 frames, the activations obtained at each window position are summed, for each category. The category with the highest overall activation is chosen as the recognized category. This architecture was successfully trained and tested on the same phoneme recognition tasks used in the evaluation of TDNN. Though much simpler than the TDNN architecture, STLVQ yielded very good results on these tasks, comparable with those obtained for TDNN. These results are one set out of many results showing the power of discriminative, neurally-inspired methods based on prototype modeling. 3.2.3. Expanding the scope of LVQ for speech recognition: incorporation of dynamic programming As with TDNN, the question that we then asked about STLVQ was how to extend its scope to handle not just labeled phoneme data, but streams of speech containing variable numbers of phonemes or words. Again, the classic approaches
3.2 Discriminative Tkaining - Recent Progress an.. . 487 t o this fundamental problem were adopted: it was shown how LVQ could be integrated both with the probabilistic HMM framework, and with a distance-based D P framework. We briefly describe two of these possibilities. The first way in which t o combine the discriminant power of LVQ with the dynamic modeling capability of an HMM is to use LVQ to design a discriminative HMM codebook for a discrete density HMM. This was the approach taken in a number of studies [28-301. It was shown that this approach yielded improvements over the k-means clustering approach to HMM codebook design, and could be used for general speech recognition tasks such as word or phrase recognition. Another way was to use HMM or the DP-based segmentation for converting a dynamic speech sample t o its corresponding static feature pattern, suited for the use of the LVQ-trained classifier [31,32]. Compared t o the first case, this enabled one to more directly use the discriminative power of LVQ for the entire input pattern, not just for local spectral frames.
3.3. Discriminative Hidden Markov Models According to the criticisms to the conventional ML-based HMMs, several discriminative versions of HMM have been investigated. In the following we review two of these alternative approaches that attempt to overcome the criticisms: Corrective Training and the discriminative training using the mutual information loss, i.e. the Maximum Mutual Information (MMI) training. 3.3.1. Corrective training The Corrective Training method is an attempt t o overcome the limitations of using ML when the model (say an HMM) is incorrect [33,34]. This is a heuristic reestimation procedure whose goal is to optimize recognition accuracy on the training data. The procedure is to perform a recognition pass over the training data, keeping track of the words that were mis-recognized (or nearly mis-recognized). The model parameters are adjusted to increase the probability of the correct words, and reduce the probability of mis-recognized words; this cycle is then repeated. In outline, this procedure is very similar t o the LVQ procedure, and t o the MCE framework that we will describe below. However, there is no clear notion in Corrective Training of how the parameter adjustments relate to the function being optimized, or even what this function is exactly. In practice, it has been found t o be effective in improving HMM performance compared to ML or the MMI training. 3.3.2. Maximization of mutual information Similar to the Corrective Training, the MMI criterion has been used as a n alternative to ML training to overcome some of the problems mentioned above, in particular the problem of using ML with a n incorrect model. MMI has produced improvements in recognition accuracies in several speech recognition systems using HMMs or ANNs [19,35-381.
488
S. Katagiri & E. McDermott
The MMI criterion derives from a n information theoretic approach to classifier design. The idea is to find the model parameters A that minimize the conditional entropy H A ( C I X )of the random variable C given the random variable X . In speech recognition, this could correspond to finding a value for A that provides as much information as possible about the class random variable C given the input pattern random variable X . The goal in MMI is to minimize the conditional entropy, or equivalently in most speech recognition problems, to maximize the mutual information, which in turn can be achieved by choosing A that maximizes the model-based mutual information, i.e.
Contrasting I c ( X ;A) with the ML criterion shows the difference between MMI and ML. The latter is only concerned with maximizing the class-dependent a posteriori probability PA(C = clX = x), while the former maximizes the dzflerence between PA(C = c J X = x) and the “background” probability PA(X = x). Thus the MMI criterion appears to be more discriminative than the ML criterion. The advantage of MMI-based optimization is that even if the model as defined by A is incorrect, maximizing I c ( X ;A) makes more sense than maximizing the likelihood of a n incorrect model. For all the interesting information-theoretic motivation and asymptotic p r o p erties of MMI, there is still not a direct link between optimizing a model using the MMI criterion and minimizing the probability of classification error (see Section 5.2). This is especially evident when using an incorrect model. Indeed, as in the case illustrated above for ML, it is possible to find situations where though a classifier is powerful enough to separate the categories optimally, training the same classifier using MMI fails to produce the optimal solution, even with large amounts of training data [39]. The fact that MMI is maximal when the true probabilities are learned suggests that the approach shares the limitations of ML. 4. Generalized Probabilistic Descent Method
4.1. Overview
As described in Section 3, the recent discriminative training methods led to a number of successful results. However, the resulting recognizers were not necessarily satisfactory. In particular, they sometimes resulted in less robust recognition; i.e. they generalized poorly to testing data, while achieving rather high accuracy over training data. Main possible causes of this are summarized as follows: 0
Empirical rules such as Corrective Training and LVQ did not have enough of a mathematical basis to guarantee design optimality.
3.2 Discriminative Thining - Recent Progress in.. . 489 0
0
0
As is well known, the minimization of the squared error loss between a classifier output and its corresponding supervising signal, which is generally used for the MLP-based classifiers, is not necessarily equivalent to the minimization of misclassifications [2]. The MMI training does not necessarily imply the minimization of misclassifications. Efforts were much too limited to acoustic modeling, often to the modeling of static acoustic feature vectors, and therefore lacked the global scope of designing an overall process of recognizing dynamic speech wave patterns.
Therefore, there was a clear need to alleviate these problems in the discriminative training approach. One solution is the Generalized Probabilistic Descent method (GPD). GPD was originally developed as a renewed version of the classical, adaptive discriminant function design method called the Probabilistic Descent Method (PDM) [40]. A key point in this revision is to overcome the unsmoothness problem of the lack of smoothness of the original PDM formalization (smoothness will be defined in Section 4.2.2) [24,41]. To achieve this, GPD uses the L, norm form and a sigmoidal function specifically for approximating the classification error count loss of (2.7). In the beginning, the GPD research focus was on analyzing implementational possibilities and investigating relation between GPD and the powerful ANN training algorithm, i.e. LVQ [25]. After [42], however, the focus was moved to investigating the significance of using the proposed, smooth GPD formalization with the smooth error count loss. It was shown that in principle, the GPD design using the smoothed error count loss enabled one to arbitrarily approximate the optimal, minimum classification error probability situation [42]. Then, in deference to this important finding, the GPD came to be sometimes called the Minimum Classification Error/Generalized Probabilistic Descent (MCE/GPD) method [43]. The GPD has been extensively applied to various design experiments for speech pattern classification (e.g. [44-481). The superiority of this method, which guarantees the theoretical validity of training target, is being proven experimentally too. Recent success in large-scale, real-world applications provides clear evidence of the promising nature of this rather new design method [49]. Since the GPD originates in PDM, most of these implementations adaptively update the classifier parameters. However, the formalization concept of GPD can apply to batch-type updating, such as the Steepest Descent method, without any loss of its mathematical rigor. Actually, several examples such as [48] fully achieved the design goal of the GPD by using the standard Steepest Descent optimization method. As said before, a recognizer usually consists of a feature extractor module and a classifier module, and this classifier is further divided into an acoustic model module and a language model module. Obviously, all of these modules are jointly concerned in the final classification decision. Thus, in principle, all of these modules should
490
S. Katagiri €9 E. McDermott
be jointly optimized so as to increase classification accuracy, or in other words, to minimize classification errors. According to this understanding, the GPD was extended as Discriminative Feature Extraction (DFE) that trains both the frontend feature extractor and the post-end classifier consistently under a single design objective of minimizing classification errors [43,50]. A key point in this extension is to use the chain rule of differential calculus. The trainable parameters of the feature extraction module are updated by using the classification result information back-propagated from the classifier module. In the development of DFE, the design focus was extended to the feature extractor, which affects the difficulty (ease) of the post-end classification decision. One may easily note here that the same extension can be directed to the language model and other modules, if there are any. Actually, such extension for the language module was proposed in parallel with development of DFE [51,52]. In addition to many experimental evaluation reports in the literature, both the GPD and the DFE are being further evolved. For example, the GPD has been successfully re-formulated for the special recognition environment called keyword spotting [53,54], and the DFE has been studied from the viewpoint of the Subspace Method [55]; also, the DFE has been further revised as the Discriminative Metric Design (DMD) [56]. Therefore, the GPD, which was originally developed for classifier design, is now forming a new, large family of discriminative design methods for pattern recognition. In the successive subsections, we present the essense of the GPD-based training methodology by selectively introducing implementation examples and several important topics. 4.2. Formalization Pundamentals 4.2.1. Distance Classifier for classifying dynamic patterns: preparation GPD can basically be implemented for any reasonable classifier structure. To introduce the formalization fundamentals of GPD, we specially use a distance classifier, which has been widely used for classifying dynamic speech patterns. Let us assume that an input speech wave sample is converted to its corresponding feature vector squence pattern: i.e. xT = f (u?), where f (.) is a feature extraction function. A distance classifier consists of trainable reference patterns as
where r: is Cj’s bth finite-length dynamic reference pattern, which is defined in the same sample space with that of u?. A classification decision is made by comparing an input (feature vector sequence) pattern and each reference pattern in some distance measurement. The most natural intuition here is to use the decision rule as C(xF) = Ci , iff i = arg mingj(xlT ; A ) (4.2) 3
3.2 Discriminative Training
- Recent Progress an.. . 491
which is essentially the same as (2.6), assuming that the pattern is dynamic and also the discriminant function g j ( x T ; A) is a class-oriented distance measure between { x T } and the set of Cj’s references { r ; } i q p . There are many possibilities for defining the class-oriented distance measure. A typical case is based on the Nearest Neighbor concept and given as follows. First, a Dynamic-Programming (DP)-matching path and its corresponding path distance is introduced between an input pattern and each reference pattern. The path distance is given as T
xT,
where X m ( j , b , T , e ) is the w ( j ,b, r,0)th component vector of to which the 7th component vector of r;, r;,,, corresponds along the 0th matching path of r;; b w ( j , b , T , 8 ) is a local distance, defined by the squared Euclidean distance between these corresponding component vectors; w ; , ~is a weight coefficient. Next, using the path distances, the distance between an input and each reference pattern, i.e. a reference pattern distance, is defined as the smallest (best) path distance for each reference pattern: T b T b d ~ ( x ,1r j ) - mp{De(xl , r j ) ) , (4.4) where operation min is executed by DP-matching. Lastly, the discriminant function that represents the degree to which an input belongs to each class is defined as the reference pattern distance of the closest (in the sense of (4.4)) reference to the input: g j ( x T ;A) = d A ( x T , r:)
cj = arg mindA(zy, r:) b
(4.5)
Accordingly, based on the rule (4.2), the classifier classifies an input to the class having the smallest reference pattern distance among all the (B1+ B2 + . . . + B M ) possible reference pattern distances.
4.2.2. Emulation of decision process The design target for the above classifier is to find the optimal status of A. According to the discriminative training concept, one may attempt to achieve the optimal situation by evaluating the loss for each training input pattern. Given a training sample, its corresponding loss, such as the Perceptron loss and the squared error loss, would be computed and the resulting adjustment amount would be fed back to the updating of the reference patterns. Apparently, such an approach works in practice. However, it actually suffers from a serious mathematical problem; i.e. the computation of the adjustment amount is indispensably based on the gradient calculation of the loss, and the loss for the above classifier is not differentiable in terms of the trainable parameters, the reference patterns: The “min” operation included in the discriminant function (see (4.4) and (4.5)) is not smooth,
492
S. Katagiri & E. McDermott
or in other words, at least first-differentiable with respect to the reference patterns, and consequently the loss is not smooth either. From the viewpoint of mathematical rigor, a desirable formalization of the training method should obviously overcome this unsmoothness problem. To solve the above problem, GPD makes use of the smooth L, norm function and the smooth sigmoidal function in its formulation. The method first replaces the discriminant function (4.5) by
g&T;
[*l
A) = C { D ( x T ,rjb))-C
]
-1/c
(4.6)
where C is a positive constant, and D(xT,r;) is a generalized reference pattern distance between xT and rjb, defined as
<
where is also a positive constant. The most important development concept of GPD is to simulate the entire classification decision process in the training procedure. To do this, GPD introduces a smooth misclassification measure as follows: For xT E Ck,
where p is a positive constant. Clearly, this misclassification measure represents the decision operation in scalar value computation; i.e. d k ( ) > 0 emulates a misclassification, and d k ( ) < 0 emulates a correct classification. Then, the decision result can be directly evaluated by embedding the misclassification measure in a loss as
ck(xT;A) = lk(dk(xT;11)) 7
(4.9)
where l k is a smooth, monotonically-increasing function of the misclassification. The individual losses, each for one design sample, should be reduced with some optimization method. To do this, GPD uses the probabilistic descent theorem [40,42]. (
0
)
Theorem 1. [Probabilistic descent theorem] Assume that a given design sample xT(t) belongs to c k . If the classifier parameter adjustment 6A(xT(t),c k , A) is specified by (4.10) 6A(xT(t),c k , A) = -EUV!!k(xT(t);A) , where U is a positive-definite matrix and
E
is a small positive real number, then
E[6L(A)]5 0 .
(4.11)
3.2 Discriminative Training
-
Recent Progress an.. , 493
hrthermore, if an infinite sequence of randomly selected samples xt is used for learning (designing) and the adjustment rule of (4.10) is utilized with a corresponding (learning) weight sequence ~ ( twhich ) satisfies
~ ( t+) 0 0 , t=l
~ ( t<)0 0~,
and
(4.12)
t=l
then the parameter sequence A ( t ) (the state of A at t ) according to
A(t = 1) = A ( t )
+ dA(zr(t), c k , A ( t ) )
(4.13)
converges with probability one (1) at least t o A* which results in a local minimum of L ( A ) . Since all of the elemental functions, such as the discriminant function and the misclassification measure, are smooth, the adjustment based on (4.10) can be rigorously applied to the trainable parameters, i.e. the reference patterns. The resulting adjustment rule is accordingly given as
where
4j
=
(4.16)
Pj =
(4.17)
(4.18)
and a ( j , b , ~ , 8 indicates ) the componet vector index of zT(t), to which the 7th component vector of r;, rg,,, corresponds along the 8th matching path of r:. In (4.14), the adjustment is done for all of the possible paths and all of the possible
494
S. Katagiri €4 E. McDermott
reference patterns. This is a remarkable distinction from the conventional distance classifier in which the adjustment is selectively done with the “min” operations. Treating W;,~’S as adjustable parameters, one can achieve an adjustment rule similar to (4.14), though we omit the result. See this point in [57]. The use of the L, norm form affords one an interesting flexibility in the implementation. Let C and approach 00 in (4.6) and (4.7). Then, clearly, (4.6) approximates the operation of searching for the closest reference pattern, and (4.7) for the operation of searching for the best matching path. Also, controlling p of (4.8) enables one to simulate various decision rules. In particular, when p approaches 00, (4.8) comes to resemble rule (4.2). 4.2.3. Selection of loss functions The formalization described so far has not specified the type of loss function, such as the classification error count loss or the squared error loss. Among the many possible selections of loss forms, GPD often uses a smooth, classification error count loss as (4.20) where a and p are constants, and this special selection plays a significant role in the analytic study on the discriminative training and also in the practical use of GPD. Actually, one may note that the infinite repetition of GPD adjustment using (4.20) leads at least to the local minimum status of the expected smooth error count loss in the probabilistic descent sense and this status is linked to the ideal, minimum error probability situation. In this section, we briefly decribe this point, according to the result in [42]. Let us assume that the probability measure of a dynamic (feature vector sequence) pattern can be computed appropriately through the probability computation using an HMM and also that the related conditional probability and the joint probability are properly defined. Then, the discriminant function is given as gj(zT;A) = PA(cjlz?)
i
(4.21)
where p~(CjlzT)is an estimate of the a posteriori probability. For simplicity, we first assume that the true functional form, determined by A, of the probability is known. Then, defining the misclassification measure as
for example, we can rewrite the expected loss, defined by using the smooth classification error count loss (4.20) in the GPD formalization, as follows:
3.2 Discriminative Training
-
Recent Progress in.. . 495
where fl is the entire sample space of the dynamic patterns zT's (T < T,,,), and it is assumed that d p ( z T , C k ) = p(zT, ck)dZT. Controling the smoothness of functions such as L, norm and the sigmoidal function, one can arbitrarily increase the approximation accuracy of the last equation to L ( A ) in (4.23). Note here that we use A. Based on this fact, the status of A that corresponds to the minimum situation of L(A) in (4.23), which is achieved by adjusting A, is clearly equal to the A* that corresponds to a true probability, or in other words, achieves the maximum a posteriori probability situation. Accordingly, it turns out that the minimum situation of L(A) can get indefinitely close to the minimum classification error probability
(4.24) where f l k is a partial space of fl that causes a classification error according to the maximum a posteriori probability rule, i.e.
This result may sound quite natural and trivial. However, it seems, the result was the first message that showed that one could achieve the minimum classification error probability situation through discriminative training, which had long been considered a deterministic and empirical training framework. The assumption that A is known is obviously impractical. However, recent results concerning the approximation capability of ANN and Gaussian kernel function have provided useful suggestions for studying this inadequacy. Actually, based on the recent results such as [11,58],one would be able to argue that an HMM with sufficient adjustable parameters has the fundamental capability of modeling a (unknown) true probability function. If L ( A ) (and &) has a unique minimum situation, and if A and L ( A ) (and E ) are monotonic to each other, then the minimum situation corresponds to the case in which gj(zT;A) is equal to the true probability function. It thus turns out that under this assumption, which is softer (more realistic) than that in the above paragraph, even if the true parametric form of the probability function is unknown, the GPD-based discriminative training enables one to fundamentally achieve the minimum classification error probability situation.
496
S. Katagiri €4 E. McDermott
As above, the GPD training using the smooth classification error count loss, i.e. MCE/GPD, successfully provided the fundamental link between discriminative training and the optimal Bayes situation for classification. The effect of the MCE/GPD training will be given in the successive subsection. 4.2.4. Design optimality in practical situations In realistic situations where only finite design samples are available, the state of A that can be achieved by the probabilistic descent training is at most a locally optimal situation over a set of design samples. Morever, in the case of finite training repetitions, A does not necessarily achieve even the local optimum solution. However, the training algorithms of GPD, such as (4.14), have been shown quite useful even in these realistic settings. The discussions in the previous subsection assumed the impractical condition that design samples and classifier parameters are sufficiently available. However, it has also been clear that MCE/GPD possesses a high degree of utility and originality in the practical circumstances in which only finite resources are available. In fact, the more limited the design resources are, the more distinctive from other methods the MCE/GPD result is. A reason for this practical utility is that MCE/GPD always directly pursues the minimum classification error situation, conditioned by given circumstances, while the ML-based approach aims at estimating the entire probability function, and moreover, the conventional, discriminative training methods aim at minimizing the average empirical loss that is not necessarily consistent with the classification error count. The smoothness incorporated in the GPD formalization also contributes toward increasing the practical utility by substantially increasing the number of training samples and making softer the empirical average loss. This point is related to the global search problem of optimal loss status and the increase of training robustness. Details about the point are described in [59]. 4.3. GPD-based Classifier Design
We report on the utility of GPD by summarizing the experimental results of training speech pattern classifiers with MCE/GPD in two tasks: (1) classifying 9-class spoken American Erhyme letters (E-set task) and (2) classifying 41-class spoken Japanese phonemes (P-set task). In the experiments, the multi-reference distance classifier, represented in (4.1), was used. The reference patterns were first initialized by using the conventional, modified k-means clustering, and then were adjusted based on the rules that were basically the same as (4.14). 4.3.1. E-set task The E-set data consisted of the 9-class E-rhyme letter syllables: { b , c, d, e , 9 , p , t , v , z } . Each sample was recorded over dial-up telephone lines from one hundred (50 male and 50 female) untrained speakers. Speaking was done in the isolated
3.2 Discriminative Training
-
Recent Progress i n . . . 497
word mode. Since all of the samples included the common phoneme {e} and they were recorded over telephone lines, this task was intrinsically rather difficult. Actually it has been reported that the achievable conventional recognition rate in this task was usually slightly higher than 60% and at most 70% by using a larger size HMM recognizer. Recognition experiments were done in the multi-speaker mode; i.e. each speaker uttered each of the E-set syllables twice, once for designing and once for testing. Thus, for every class, the design and testing data sets consisted of 100 samples, respectively. The recognizer possesses several factors, such as the number of reference patterns, that may affect its recognition capability. To investigate these factors thoroughly, this task was carefully tested by several separate research groups. We summarize here the results of these separately-conducted experiments [45,57,60]. In [45], the modified k-means clustering results ranged from 55.0% to 59.8% in the case of using one reference pattern for every class. For this small-size classifier, MCE/GPD successfully achieved rates ranging from 74.2% t o 75.4%. In the case of using three reference patterns for every class, the modified Ic-means clustering resulted in the range of 64.1% to 64.9%; with 74.0% t o 77.2% for MCE/GPD. Again, the superiority of MCE/GPD is clearly demonstrated in this large-size case. [60] and [57] investigated a n implementation somewhat different from (4.14), using an exponential-form distance measure and also considering the weights, W;,~’S, to be adjustable. Consequently, in this case too, MCE/GPD increased about 60% of the modified Ic-means clustering with the use of only one reference pattern per class to 79.4%, and also remarkably reached 84.4% by using four references for every class. 4.3.2. P-set task The data of this task consisted of 41-class phoneme segments included in the ATR speech database, which has been widely used as a standard, large-scale Japanese speech database. Each phoneme sample was extracted from 5,240 common words, spoken by a male speaker, by using manually-set acoustic-phonetic labels (about 26,000 samples in total). This sample set was split into two independent sets of roughly equal sizes: one for designing and one for testing. The experiments were conducted in the speaker-dependent mode. According t o [45], the MCE/GPD-trained classifier that used five reference patterns per class (especially ten for every vowel class) achieved the accuracy of 96.2%. Here too, compared with the 86.8% for the Segmental Ic-means Clustering that was used for system initialization, we can clearly see the utility of MCE/GPD. 4.4. Discriminative Feature Eztraction
After [50],the training concept of DFE has been quite extensively investigated in various speech recognition tasks. The main difference in training algorithm formulation between the GPD-based classifier design (in Section 4.2) and DFE is that DFE trains the feature extractor parameter set together with the classifier parameter
498
I
S. Katagiri Ed E. McDermott
#
DFE-designed lifter issued from a uniform initialization -"I
, - ~ .
,
~
I0 0
20
40
60
100
80
120
I
140
Quefrency DFE-designed lifter: locus on the lower quefrencyregion
ii 10 0
5
10
15
20
25
30
Quefrency
Fig. 4. A typical example of DFEdesigned lifter shape.
set under the single training objective of minimizing a loss function. Among many experimental examples, we introduce one example of applying DFE to cepstrumbased speech recognition. Cepstrum is an acoustic feature vector, widely used in the speech recognition field, that is computed by applying the inverse Fourier transform to a logarithmic power spectrum of a speech fragment (usually of 20-30 msec). This is actually an inverse concept of spectrum, and accordingly quefrency is also introduced as a pair concept of frequency. The cepstrum components in the low-quefrency region correspond to the envelope shape of its corresponding logarithmic power spectrum, and it has been demonstrated that these components possess information useful for classifying phonemes. A question here is how one can extract a desirable set of low-quefrency components. Usually, a lifter, which is a quefrency-domain filter, is used for this purpose, and the design of lifter shapelsize has been one of important research issues in the speech recognition area. Actually, many investigations were done, but most of them were inadequate due to the lack of consideration of directly linking the lifter design with the design of post-end classifiers [61]. In [50], DFE was used to determine a useful shape of the lifter that was attached to its post-end MLP classfier. The task used for evaluation was to recognize static 128-dimensional cepstrum vector patterns of 5 Japanese vowels; a set of 1750 patterns was used for training and another set 1750 patterns was used for testing. The conventional setting of using the uniform shape (1 for the low quefrency region and 0 for the remaining region) lifter provided 85.5% over testing data through some efforts to experimentally find an appropriate length of lifter. In contrast, DFE achieved 88.7% over the same testing data set, automatically determining a suitable shape of lifter. Figure 4 shows a typical example of DFE-designed
3.2 Discriminative Training - Recent Progress in. . . 499 lifter. It de-emphasizes two quefrency regions, i.e. (1) the high-quefrency region that corresponds to pitch harmonics and spectral minute structure, and (2) the lowerquefrency region (0-2 quefrency region). that is dominated by the bias and slant of the overall spectrum, while enhancing the region of 3-20 quefrency that mainly corresponds to the vowel class identity information, i.e. formant structure. The obtained lifter shape as well as the improvement in recognition accuracy demonstrate that the DFE training successfully extracts suitable for the classification and accordingly facilitates the recognition decision. 5. Remarks
The research efforts over the last decade, started with the development of ANNbased recognizers, have led to sound improvements in the discriminative training for speech recognition. In particular, the development of GPD has successfully provided a general and systematic framework for investigating four major technical issues (see Section 2.3), as well as made the practical contribution of increasing the achievable recognition accuracy of existing speech recognizers. However, there are obviously still many unsolved research subjects in this context. In this section, we summarize some of these subjects for future study. The section consists of four subsections, each focusing on one selected subject. However, the discussions in the subsections are closely related to each other. 5.1. Selection of Discriminant Function
The functional form of the discriminant function is determined by the system stucture or the measure that represents the degree to which an input pattern belongs to one of possible classes. Usually, the framework of the system module, such as the distance classifier and the filter-bank feature extractor, has been empirically selected, and the selection of the framework consequently determined the functional and computational form of the resulting discriminant function. In most cases including the new GPD-based family, the discriminative training method simply aims to optimize the adjustable parameters, which determine the state of the preset functional form of the discriminant function. Thus, it is still a fundamental difficulty to optimize the functional form itself with the discriminative training methods. The conventional selection of the functional forms, based on scientific knowledge and experiences, has been successful to some extent. However, it is not guaranteed that these existing selections provide a truly desirable form for the recognition decision, and therefore the development of new methods to design the optimal functional form is clearly needed. A possible solution would rely on the persevering analysis and modeling of speech sample: The system structure, or in other words, the discriminant function form, must indispensably reflect the nature of samples at hand. This issue also has a close link to the issue of consistency with unknown samples. The use of modeling suited for recognition will increase the sample separability and accordingly facilitates classification.
500
S. Katagiri & E. McDermott
5.2. Selection of Design Objective The discussions in the previous sections naturally suggest that a desirable design objective (loss form) be the smooth classification error count loss. However, this new objective still suffers from a question of how t o control the smoothness. Actually, this control has usually been done empirically or heuristically. Essentially, the issue of smoothness control is tightly linked t o the important issues, such as the resource (trainable parameters and design samples) availability, the optimization scheduling, and the consistency with future samples. Moreover, especially in speech recognition, there are several possible linguistics-oriented choices of setting design objectives, e.g. a phoneme error count loss and a word error count loss. A system having high phoneme recognition accuracy is often expected to recognize words and sentences accurately. However, in practice, this expectation does not hold true. Therefore, it is not so easy t o conclude'that one type of objective, such as the smooth error count loss, to be the best selection. For further study, we summarize in the following paragraph the relationship between the case of using the smooth error count loss and the MMI training, which is another important discriminative training approach. The training target of MMI is to select the state of A so as to increase the mutual information between class C j ( j # k) and a sample xT E c k , which is defined in the following, as much as possible:
Note here that according to most classifier design examples, only the conditional probabilities each are functions of A. Let us consider the effect of maximizing the mutual information in the GPD framework. For convenience, we use a negative value of mutual information. Then, the goal of GPD design is t o minimize this negative measure. The negative mutual information is rewritten as
Here, defining the logarithmic likelihood, in p A ( z r f I c k ) , as the discriminant function, one can treat the bottom line expression of (5.2) as a kind of misclassification measure:
3.2 Discriminative Tkaining - Recent Progress in.. . 501
Then, the inequality,
-Ik(xr; A) 2 &(zT; A)
(5.4)
holds true, and therefore maximizing the mutual information leads at least to minimizing the misclassification measure (5.3). Consequently, it turns out that a classifier design based on MMI has the same effect as a GPD design that uses the misclassification measure (5.3) and the linear loss function. Note here that the loss used is not a smooth error count but a simple linear function of the misclassification measure.
5.3. Openness Problem of Recognition Task Usually, the formalization of pattern classification assumes that the number of classes is preset and samples to be classified are extracted (segmented) beforehand. That is, the classification is formalized as a closed (in the sense that the classification task framework is closed) task. However, in realistic situations, patterns (of prescribed classes) inevitably exist together with the ones that are considered not to belong to any of the prescribed classes, and the recognition of such embedded class patterns is often called an open (in the sense that the classification framework is open) task. For example, speech patterns spoken by some speakers, which are the recognition targets, usually exist with various types of acoustic signals such as speech utterances of other speakers and background noise. There have been possibly two major solutions to the openness problem of speech recognition task. The first solution is to selectively extract target speech patterns (segments) prior to the classification. Standard acoustic techniques such as an appropriate design of microphone are used for this purpose. Some advanced techniques such as key word spotting should also be investigated. The second solution is to formalize the open task as a closed task by incorporating an additional class, sometimes called garbage class, with the original preset classes. A question here is how one can apply discriminative training, for which significant classification power has been demonstrated but whose formalization is originally suited for a closed task setting, to the open tasks. According to the word spotting strategy, GPD has actually been extended to the minimum spotting error training [53,54]. Also, several discriminative training methods have been tested for recognizers having the garbage model. These recent attempts have produced some interesting results, but they are still in a preliminary stage. Serious research efforts must be further made to cope with this realistic problem in speech recognition. 6. Summary
In this chapter, we have summarized the recent status of research on discriminative training for speech recognition. The approach based on the discriminative
502
S. Katagiri €4 E. McDermott
training was enlivened by the re-advent of research concerns on ANN, and it has been greatly elaborated from various viewpoints, including the GPD methodology. Discriminative training is now being established as a standard approach to speech recognizer design at least for some limited frameworks of usage. However, in order to develop recognizers that can handle more realistic circumstances such as open task settings appropriately, the approach must clearly be further advanced.
Acknowledgements We thank our colleagues Alain Biem, Eric Woudenberg, and Hideyuki Watanabe for their valuable discussions and assistance during the preparation of materials for this chapter.
References [l]L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, 1993). [2] R. Duda and P. Hart, Pattern Classification and Scene Analysis (John Wiley and Sons, 1973). [3] N. Nilsson, The Mathematical Foundations of Learning Machines (Morgan Kaufmann Publishers, 1990). [4] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, I E E E Trans. PAMI PAMI-6 (1984) 721-741. [5] K. Fukunaga, Introduction to Statistical Pattern Recognition (Academic Press, 1972). [6] A. J. Robinson, An application of recurrent nets to phone probability estimation, IEEE %ns. Neural Networks 5 (1994) 298-305. [7] R. Rosenblatt, Princzples of Neurodynamics (Spartan Books 1959). [8] D. E. Rumelhart et al., Learning representations by error propagation, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumerlart and J. McClelland (eds.) (MIT Press, 1986) 318-362. [9] P. J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences (PhD dissertation, Harvard University, 1974). [lo] R. Lippmann, An introduction to computing with neural nets, I E E E A S S P Magazine (1987) 4-22. [ll]K. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks 2 (1989) 183-191. [12] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. Lang, Phoneme recogniI E E E ICASSP88 1 (1998) tion: neural networks vs. hidden Markov models, PTOC. 107-1 10. [13] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. Lang, Phoneme recognition using time-delay neural networks IEEE Trans. A S S P ASSP-37 (1989) 328-339. [14] H. Ney, The use of a one-stage dynamic programming algorithm for connected word recognition, I E E E Trans. A S S P 32 (1984) 263-271. 115) H. Bourlard and C. J. Wellekens, Speech pattern discrimination and multilayer perceptrons, Computer Speech and Language 3 (1989) 1-19. [16] H. Bourlard and C. J. Wellekens, Links between Markov models and multilayer perceptrons, IEEE Trans. PAMI 12 (1990) 1167-1178. [17] H. Sawai, TDNN-LR Continuous speech recognition system using adaptive incremental training, PTOC.I E E E ICASSP91 (1991) 53-56.
3.2 Discriminative h i n i n g - Recent Progress in. .. 503 (181 P. HafTner, M. Franzini and A. Waibel, Integrating time alignment and neural networks for high performance continuous speech recognition, Proc. IEEE ICASSP91 (1991) 105-108. [19] P. HafTner, A new probabilistic framework for connectionist time alignment, Proc. ZCSLP94 (1994) 1559-1562. [20] T. Kohonen, Self-organitation and Associative Memory (2nd edn.) (Springer, 1988). [21] F. Girosi and T. Poggio, Networks and the best approximation property, Biological Cybernetics 63 (1990) 169-176. [22] T. Kohonen, G. Barna and R. Chrisley, Statistical pattern recognition with neural networks: benchmarking studies, Proc. ZEEE ICNN88 I (1988) 61-68. (231 T. Kohonen, The self-organizing map, Proc. IEEE 78 (1990) 1464-1480. [24] S. Katagiri, C.-H. Lee and B.-H. Juang, A generalized probabilistic descent method, Proc. ASJ Fall Conf. (1990) 141-142. (251 S. Katagiri, C.-H. Lee and B.-H. Juang, Discriminative multi-layer feed-forward networks, in Neural Networks for Signal Processing, s.-Y. Kung and B.-H. Juang (eds.) (IEEE, 1991) 11-20. (261 E. McDermott and S. Katagiri, Shift-invariant, multi-category phoneme recognition using Kohonen’s LVQ2, Proc. IEEE ZCASSP89 (1989) 81-84. [27] E. McDermott and S. Katagiri, Shift-tolerant LVQ for phoneme recognition, IEEE Trans. SP 39 (1991) 1398-1411. [28] H. Iwamida, S. Katagiri, E. McDermott and Y. Tohkura, A hybrid speech recognition system using HMMs with an LVQ-trained codebook, Proc. IEEE ZCASSPSO 1 (1990) 489-492. [29] D. G. Kimber, M. A. Bush and G. N. Tajchman, Speaker-independent vowel classification using hidden Markov models and LVQ2, Proc. IEEE ICASSP90 1 (1990) 497-500. [30] G. Yu, W. Russell, R. Schwartz and J. Makhoul, Discriminant analysis and supervised vector quantization for continuous speech recognition, Proc. IEEE ICASSP90 2 (1990) 685-688. [31] S. Katagiri and C.-H. Lee, A new hybrid algorithm for speech recognition based on HMM segmentation and learning vector quantization, ZEEE Trans. SAP 1 (1993) 421-430. (321 A. Duchon and S. Katagiri, A minimum-distortion segmentation/LVQ hybrid algorithm for speech recognition, J. Acowt. SOC.Jpn. (E) 14 (1993) 37-42. [33] T. Applebaum and B. Hanson, Enhancing the discrimination of speaker independent hidden Markov models with corrective training, Proc. IEEE ICASSP89 (1989) 302-305. [34] L. Bahl, P. F. Brown, P. V. de Souza and K. L. Mercer, A new algorithm for the estimation of hidden Markov model parameters, Proc. IEEE ICASSP88 (1988) 493-496. [35] L. Bahl, P. F. Brown, P. V. de Souza and K. L. Mercer, Maximum mutual information estimation of hidden Markov parameters for speech recognition, Proc. IEEE ICASSP86 (1988) 49-52. [36] P. F. Brown, The acoustic-modeling problem in automatic speech recognition, PhD Thesis, MU-CS-87-125 (Carnegie Mellon University, 1987). [37] Y. Normandin, Hidden Markov models, maximum mutual information estimation, and the speech recognition problem, PhD Thesis (McGill University, 1991). [38] V. Valtchev, J. J. Odell, P. C. Woodland and S. J. Young, Lattice-based discriminative training for large vocabulary speech recognition, Proc. IEEE ICASSP96 I1 (1996) 605-609.
504
S. Katagiri & E. McDermott
[39] P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo and M. A. Picheny, Decoder selection based on cross-entropies, Proc. ZEEE ICASSPSS I (1988) 2CL23. [40] S. Amari, A theory of adaptive pattern classifiers, IEEE Trans. EC EC-16 (1967) 299-307. [41] S. Katagiri, C.-H. Lee and B.-H. Juang, New discriminative training algorithms based on the generalized probabilistic descent method, in Neural Networks for Signal Processing, s.-Y. Kung and B.-H. Juang (eds.) (IEEE, 1991) 299-308. 142) B.-H. Juang and S. Katagiri, Discriminative learning for minimum error classification, IEEE Trans. S P 40 (1992) 3043-3054. (431 S. Katagiri, B.-H. Juang and A. Biem, Discriminative feature extraction, in Artificial Neural Networks for Speech and Vision, R. Mammone (ed.) (Chapman and Hall, 1994) 278-293. [44] W. Chou, B.-H. Juang and C.-H. Lee, Segmental GPD training of HMM based speech recognition, in Proc. ZEEE ZCASSP92 1 (1992) 473-476. [45] T. Komori and S. Katagiri, GPD training of dynamic programming-based speech recognizers, J. Acomt. Soc. Jpn. (E) 13 (1992) 341-349. [46] E. McDermott and S. Katagiri, Prototype-based discriminative training for various speech units, Proc. IEEE ICASSP92 1 (1992) 417-420. [47] E. McDermott and S. Katagiri, Prototype based discriminative training for various speech units, Computer Speech and Language 8 (1994) 351-368. [48] D. Rainton and S. Sagayama, Minimum error classification training of HMMsimplementation details and experimental results, J. Acomt. Soc. Jpn. (E) 13 (1992) 379-387. [49] B.-H. Juang, Automatic speech recognition: problems, progress & prospects, Handout at the 1996 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing ( 1996). [50] A. Biem and S. Katagiri, Feature extraction based on minimum classification error/ generalized probabilistic descent method, in Proc. IEEE ICASSP93 2 (1993) 275-278. [51] X. Huang, M. Belin, F. Alleva and M. Hwang, Unified stochastic engine (USE) for speech recognition, Proc. IEEE ICASSP93 2 (1993) 636-639. (521 K.-Y. Su, T.-H. Chiang and Y.-C. Lin, A unified framework to incorporate speech and language information in spoken language processing, PTOC.IEEE ICASSPS2 1 (1992) 185-188. [53] T . Komori and S. Katagiri, A minimum error approach to spotting-based speech recognition, IEICE Trans. Inf. €4 Syst. E78-D (1995) 1032-1043. [54] T. Komori and S. Katagiri, A novel spotting-based approach to continuous speech recognition: minimum error classification of keyword-sequences, J. Acous. SOC.J p n . (E) 16 (1995) 147-157. [55] H. Watanabe and S. Katagiri, Discriminative subspace method for minimum error pattern recognition, in Neural Networks for Signal Processing V , F. Girosi, J. Makhoul, E. Manolakos and E. Wilson (eds.) (1995) 77-86. [56] H. Watanabe, T. Yamaguchi and S. Katagiri, A novel Approach to pattern recognition based on discriminative metric design, in Neural Networks for Signal Processing V , F. Girosi, J. Makhoul, E. Manolakos and E. Wilson (eds.) (1995) 48-57. [57] P.-C. Chang and B.-H. Juang, Discriminative training of dynamic programming based speech recognizers, ZEEE Trans. S A P 1 (1993) 135-143. [58] H. Sorenson and D. Alspach, Recursive Bayesian estimation using Gaussian sums, Automatica 7 (1971) 465-479. [59] S. Katagiri, A unified approach to pattern recognition, Proc. ZSANN94 (1994) 561-569.
3.2 Discriminative Training
-
Recent Progress i n . . . 505
[60] P.-C. Chang and B.-H. Juang, Discriminative template training for dynamic programming speech recognition, Proc. ICASSP92 (1992) 493-496. [61] Y. Tohkura, A weighted cepstral distance measure for speech recognition, IEEE Trans. ASSP 35 (1987) 301-309.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 507-534 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 3.31 STATISTICAL AND NEURAL NETWORK PATTERN RECOGNITION METHODS FOR REMOTE SENSING APPLICATIONS*
JONATLI BENEDIKTSSON Laboratory for Information Technology and Signal Processing Department of Electrical and Computer Engineering, University of Iceland, Hjardarhaga, 2-6 107 Reykjavik, Iceland Classification of remote sensing data by statistical methods and neural networks is discussed. For the statistical methods both pixel and spatial classifiers are considered. The performance of enhanced statistics is investigated in terms of feature extraction for the statistical classifiers. The feature extraction methods reviewed and applied are decision boundary feature extraction and discriminant analysis. The classification results obtained by using enhanced statistics in classification of hyperdimensional data are excellent and show the classifiers to be able t o distinguish between classes with similar spectral properties. Classification methods based on consensus from several data sources are also considered with respect to classification of hyperdimensional data. The consensus theoretic methods need weighting mechanisms t o control the influence of each data source in the combined classification. The weights are optimized in order to improve the combined classification accuracies. A nonlinear method which utilizes a neural network is used and gives excellent results in experiments.
Keywords: Classification, data fusion, feature extraction, neural networks, remote sensing statistical enhancement.
1. Introduction
Remote sensing is the science of deriving information about an object from measurements made at a distance from the object, i.e. without actually coming in contact with it [l].Usually remote sensing data are image data acquired from sensors on earth-orbiting satellites or airplanes. The sensors operate in the visible to microwave spectral range. The data obtained are pixels (picture elements) which are spectral in nature and give different characteristics of the data at different frequencies. The pixels are of some specific resolution (typically 10 to 30 m square) depending on the sensor used. Often data from multiple spectral ranges are used (multispectral data) but such data provide an efficient way of obtaining a wide variety of information from the data. For classification, the classes are specified by the analyst (information classes) and must be of information value. The spectral *Supported in part by the Icelandic Research Council and the Research Fund of the University of Iceland.
508 J. A . Benediktsson
responses of the materials in the data characterize the information classes. However, the materials do not have a single spectral response, rather they have a set of spectral responses that characterize the material. It is important to note that there is no such thing as a pure pixel in remote sensing data. The value of a pixel reflects what is on a specific area on the ground but the size of this area depends on the resolution of the sensor. Consequently, mixed pixel considerations [2] are important in remote sensing research. Currently available remote sensing data are, e.g. multispectral, multitemporal, hyperdimensional or data from multiple sources. It is important to be able to design classifiers that can handle all these data types in order to extract as much information as possible from the data, and perform classification with sufficient accuracy. The multisource data can be data from different spectral sensors, geographic data, and even nonnumerical data such as ground cover maps. Methods need to be developed that can handle all these different data types and weight them according to their importance. On the other hand, the major problem for hyperdimensional data is that in all practical applications a limited number of training samples are available. A limited number of training samples usually poses a problem for the estimation of parameters such as mean vectors and covariance matrices which are used in designing a statistical classifier. If the parameters are not carefully estimated, the classifiers may not generalize well, i.e. they possibly give low accuracy for data outside of a training set or may not even work at all. Since for classification of hyperdimensional data, there are usually never enough training samples available, it is needed to apply methods which are based on estimating the best statistics. Here, a recently proposed method of enhanced statistics [3] is investigated in order to see how well the method performs on hyperdimensional data in terms of classification accuracy for several different classification methods. Neural networks for classification of remote sensing and geographic data will be discussed and applied in experiments. The principal reason for using neural network methods for classification is that these methods are distribution-free. Since remote sensing and geographic data are in general of multiple types, the data of various types can have different statistical distributions. The neural network approach does not require explicit modeling of each data type. In addition, neural network methods have been shown to approximate posterior probabilities in the mean-squared sense [4]. Therefore, neural networks are desirable alternatives to conventional statistical methods when it is difficult to model the data explicitly. Neural networks can be looked at as optimizers and may be used to combine and optimize the classification capabilities of several individual classifiers based on consensus theory [5,6]. Here, the use of neural networks for such optimization will be discussed. The organization of the chapter is as follows. First, statistical classifiers will be discussed. Feature extraction for statistical classifiers is then reviewed, followed by a discussion on statistical enhancement. Thereafter, the neural network approach is introduced and then a section on consensus theory follows. Finally, the methods are applied in experiments and conclusions are given.
3.3 Statistical and hreuml Network Pattern Recognition . . . 509
2. Statistical Classifiers Several different types of statistical classifiers have been proposed [7-lo].The two major categories are supervised and unsupervised classifiers. Supervised classifiers are based on learning with a teacher in contrast to unsupervised schemes that have no teacher. Unsupervised classifiers such as clustering algorithms are important in remote sensing especially for preprocessing. Here, however, supervised classifiers will almost solely be discussed. Supervised statistical classifiers can be split into two main groups, pixel and spatial classifiers, both of which will be reviewed here. Several variants of these general types have been proposed. One important group is layered or tree classification algorithms [ll-131. Only the most commonly used statistical classifiers in remote sensing are discussed in this section.
2.1. The Gaussian Mazimum Likelihood Classifier A pattern is most commonly thought of as something having spatial or geometrical character. An n-dimensional pattern in the remote sensing context is a
measurement vector X :
X =
[ ::j, Xn
where zi,i = 1,. . . ,n corresponds to the ith measurement (the measurement from the ith wavelength band or sensor channel) on a given ground resolution element. In the statistical framework, a two-class pattern recognition problem can be formulated as w2
where X is a measurement vector, wi denotes class i , and pi(wiIX) is the conditional probability of class i given X (posterior probability) [7,10].According to (2.1) a measurement vector X is said to belong to the class i which maximizes the posteriori probability pi(wi1X). Using Bayes’ theorem, (2.1)can be written for pattern recognition purposes as
where Pi stands for the prior probability of class w i . Equation (2.2) is the most common form of the Bayes classifier for minimizing error [2].To use (2.2)it is necessary to know the class conditional probability density functions pi(X1wi). Either parametric or nonparametric approaches can be used for that purpose [7,8].
510 J. A . Benediktsson
In the parametric case, it is often convenient to use the Gaussian assumption, i.e. assume the classes are Gaussian distributed. The multidimensional Gaussian distribution is given by
for a pattern from class wi with mean vector Mi, covariance matrix C i , and inverse where T is the transpose operator. By taking logarithms covariance matrix of both sides in (2.3), (2.2) and inserting the result into (2.2), straightforward calculations lead to the Gaussian ML (maximum likelihood) Classifier (shown here for a two-class problem)
Equation (2.4) can easily be extended for multiclass problems.
2 . 2 . The Minimum Euclidean Distance Classifier For a two-class problem, the minimum Euclidean distance (MD) classifier is based on the decision rule
( X - M I ) T ( X - Ml) 3 ( X - M 2 ) y X - M 2 ) .
(2.5)
w2
When X is Gaussian distributed and C1 = Cz = I where I is the identity matrix, the Gaussian ML classifier becomes the MD classifier. 2.3. The ECHO Classifier
The ML and MD classifiers above are based on extracting spectral variations from the data. Those classifiers are sometimes referred to as pixel classifiers. Other methods are based on using spatial information or context in classification of image data. The spatial information can be grouped in two distinct domains: (1) The information of spatial correlation of gray variations, and (2) the information of spatial dependency of class labels [14]. Neither the gray variations nor the class labels of the image are deterministic. Therefore, a stochastic model is often desirable for representing the random variations in either case and a random field model has become popular in this regard [14-161. The basic idea of random field modeling is to capture the intrinsic character of images with relatively few parameters in order to understand the nature of the image and provide either a model of gray value dependencies for texture analysis or a model of context for decision making. By introducing these spatial parameters along with spectral models into the classification process, improvement in classification accuracy can be achieved.
3.3 Statistical and Neural Network Pattern Recognition . . . 511
The ECHO (extraction and classification of homogeneous objects) classifier [ 171 is an example of a spatial classifier which has been applied successfully in classification of remote sensing data. The ECHO classifier incorporates not only spectral variations but also spatial ones in the decision-making process. It uses a two-stage process, first segmenting the scene into statistically homogeneous regions, then classifying the data based upon a ML classification scheme. The ECHO classifier uses exactly the same training procedures and statistics as a conventional ML pixel classifier. However, it offers selectable parameters for the analyst to vary the degree and character of the spatial relationship used in the classification. By selecting these parameters, one can vary the spatial characteristics used from little or non to those incorporating image texture over a block area surrounding each pixel. With a proper choice of parameter values, the ECHO classifier usually provides higher accuracies than pixel classifiers and frequently requires less computation time for a given classification task.
3. Feature Extraction Feature extraction is an important preprocessing method for classification and can be viewed as finding a set of vectors that represent an observation while reducing the dimensionality. In pattern recognition, it is desirable to extract features that are focused on discriminating between classes. Although a reduction in dimensionality is desirable, the error increment due to the reduction in dimension has to be without sacrificing the discriminative power of classifiers. The development of feature extraction methods has been one of the most important problems in the field of pattern analysis and has been studied extensively. Feature extraction methods can be both unsupervised and supervised, and also linear and nonlinear. Here we concentrate on linear feature extraction methods. Below three different linear feature extraction methods are discussed. For all these methods a feature matrix is defined and the eigenvalues of the feature matrix are ordered in a decreasing order along with their corresponding eigenvectors. The number of input dimensions corresponds to the number of eigenvectors selected [7]. The transformed data are determined by
Y =@X,
(3.1)
where CP is the transformation matrix composed of the eigenvectors of the feature matrix, X is the pattern in the original feature space, and Y is the transformed pattern in the new feature space. 3.1. Principal Component Analysis
One of the most widely used transforms for signal representation and data compression is the principal component (Karhunen-Loeve) transformation. To find the necessary transformation for X to Y in (3.1), the global covariance matrix for the original data set C X is estimated. Then the eigenvalue-eigenvector
512 J. A . Benedaktsson
decomposition for the covariance matrix C X is determined, that is,
where A is a diagonal matrix with the eigenvalues of C X in a decreasing order and GT is a normalized matrix with corresponding eigenvectors of EX. With this choice of the transformation matrix in (3.1), it is easily seen that the covariance matrix for the transformed data is C y = A. Although the principal component transformation is optimal for signal representation in the sense that it provides the smallest mean squared error for a given number of features, the features defined by this transformation are not optimal with regard to class separability [7]. In feature extraction for classification, it is not the mean squared error but the classification accuracy that must be considered as the primary criterion for feature extraction. 3.2. Discriminant Analysis
The principal component transformation is based upon the global covariance matrix. Therefore, it is explicitly not sensitive to inter-class structure. It often works as a feature reduction tool because classes are frequently distributed in the direction of the maximum data scatter. Discriminant analysis is a method which is intended to enhance separability. A within-class scatter matrix, C W , and a between-class scatter matrix, C B , are defined [7]:
cw
=
cPic,
(3.3)
i
(3.4)
Mo =
C PiMi .
(3.5)
i
The criterion for optimization may be defined as
where t r ( ) denotes the trace of a matrix. New feature vectors are selected to maximize the criterion. The necessary transformation from X to Y in (3.1) is found by taking the eigenvalue-eigenvector decomposition of the matrix E$ C g and then taking the transformation matrix as the normalized eigenvectors corresponding to the eigenvalues in a decreasing order. However, this method does have some shortcomings. For example, since discriminant analysis mainly utilizes class mean differences, the feature vectors selected by discriminant analysis are not reliable if mean vectors are near to one another. Since the lumped covariance matrix is used in the criterion,
3.3 Statistical and Neural Network Pattern Recognition . . . 513 discriminant analysis may lose information contained in class covariance differences. Also, the maximum rank of C g is K - 1for a K class problem since C g is dependent on Mo. Usually CW is of full rank and, therefore, the maximum rank of C $ C g is K - 1. This indicates that at maximum K - 1 features can be extracted by this approach. Another problem is that the criterion function in (3.1) generally does not have direct relationship to the error probability. 3.3. Decision Boundary Feature Extraction Lee and Landgrebe [18] showed that both discriminantly informative features and discriminantly redundant features can be extracted from the decision boundary itself. They also showed that discriminantly informative feature vectors have a component which is normal to the decision boundary at least at one point on the boundary. Further, discriminantly redundant feature vectors are orthogonal to a vector normal to the decision boundary at every point on the boundary. In [18],a decision boundary feature matrix (DBFM) was defined to extract discriminantly informative features and discriminantly redundant features from the decision boundary. The rank of this DBFM is the smallest dimension where the same classification accuracy can be obtained as in the original feature space and the eigenvectors of the DBFM, corresponding to non-zero eigenvalues, are the necessary feature vectors to achieve the same classification accuracy as in the original feature space [18]. A nonparametric procedure is used to find the decision boundary numerically. The normal vectors to the decision boundary, Ni, are estimated using a gradient approximation. Then the effective decision boundary feature matrix is estimated using the normal vectors as a
Next, the eigenvalue-eigenvector decomposition of the effective decision boundary feature matrix, CEDBFM,is calculated and the normalized eigenvectors corresponding to non-zero eigenvalues are used as the transformation matrix from X to Y in (3.1). Theoretically, the eigenvectors corresponding to non-zero eigenvalues will give the same classification accuracy as in the original feature space. However, in practice, few eigenvalues are actually zero. Thus, a threshold is used.
4. Statistical Enhancement of Remotely Sensed Data A well-trained classifier must successfully model the distribution of the entire data set, but the modeling must be done in such a way that the different classes of interest are as distinct from one another as possible. Therefore, it is desired to have the density function of the entire data set modeled as a mixture of class densities [3], i.e. for a K-class problem K
514 J. A . Benediktsson
where X is the measured feature vector, p is the probability density function describing the entire data set to be analyzed, and ai is the weighting coefficient for class i, and pi is a class conditional density. For a well-trained classifier, the probability density function of the entire data set, p ( X l O ) , can be modeled by a combination of K Gaussian densities. Therefore, in (4.1), Oi contains the mean and covariance matrix for a Gaussian component. However, it is critical that (4.1) is a good model for the classes in the data. Shahshahani and Landgrebe [3] accomplished the modeling by an iterative calculation based on both the training samples and a systematic sampling of all the pixels in the scene. Their method is called enhanced statistics. In the method, the statistics are adjusted or enhanced so that, while still being defined by the training samples, the collection of class conditional statistics better fit the entire data set. This amounts to a hybrid, supervised/unsupervised training scheme with at least three possible benefits [3]: (1) The process tends t o make the training set more robust, providing improved generalization t o data other than the training samples, (2) The process tends to mitigate the Hughes phenomenon [7], thus allowing one to obtain greater accuracy with a limited training set, (3) An estimate is obtained for the prior probabilities of the classes. The estimate is a result of the use of the unlabeled samples, and is something that cannot be computed from training samples alone. 5. Neural Network Classifiers Neural network classifiers have been demonstrated t o be attractive alternatives to conventional classifiers for classification of remote sensing and geographic data (19-211. A neural network is a n interconnection of processing units called neurons. Each neuron receives input signals, xj,j = 1 , 2 , . . . , N , which represent the activity at the input or the momentary frequency of neural impulses delivered by another neuron to this input [22]. In the simplest formal model of a neuron, the output value or the frequency of the neuron, oi, is often approximated by the function
where C is a constant and q5 is a nonlinear function, e.g. the threshold function which takes the value 1 for positive arguments and 0 (or -1) for negative arguments. The wij’s are called synaptic eficacies or weights, and 29i is a threshold. A onelayer neural network has only one layer of weights and no hidden neurons, but a multilayer network has many layers of weights and one or more layers of hidden neurons [23]. In the neural network approach to pattern recognition the neural network operates as a black box which receives a set of input vectors (observed signals) and produces responses oi from its output neurons i (i = 1 , . . . ,L where L depends on the number of information classes). A general idea followed in training of neural networks is that the desired outputs are either oi = 1, if neuron i is active for the current input vector, or oi = 0 (or -1) if it is inactive. The weights are
3.3 Statistical and Neural Network Pattern Recognition . . . 515 learned through an adaptive (iterative) training procedure in which a set of training samples is presented to the input. A neural network gives an output response for each sample. The actual output response is compared to the desired response for the input and the error between the desired output and the actual output is used to modify the weights in the neural network. The training procedure ends when the error is reduced to a prespecified threshold or it cannot be minimized any further. Then, all of the data are fed into the network to perform the classification, and the network provides at the output the class representation for each input vector. Neural network classifiers are distribution free and that is very important, especially when parametric modeling cannot be applied. A neural network with one layer can be used to discriminate linearly separable data (two-layer neural networks can form decision regions which are convex). By applying neural networks with two or more layers, arbitrarily shaped decision regions can be formed. The backpropagation algorithm (or the multilayer perceptron) [22] is the best known neural network algorithm. It is a multilayer neural network algorithm which can be used to discriminate data that are not linearly separable. But a problem with the backpropagation is that its training process can be computationally very complex and convergence may be slow since the training process is iterative. This is a serious drawback, especially when the dimensionality of the data is very high. Most neural network methods are based on the optimization (minimization) of a cost functional. The most commonly used minimization approach is gradient descent optimization of the cumulative squared error at the output of the network. Gradient descent has been shown to be computationally wasteful. Here, we apply the less wasteful conjugate-gradient optimization [24]. 5.1. Neural Networks i n Classification of Remote Sensing Data
Several authors have used the neural network approach in classification of remote sensing data. Most of the neural networks have been based on backpropagation. Paola and Schowengerdt [25] have reviewed the use of backpropagation for remote sensing and have concluded that, although the neural network algorithm has several unique capabilities, it will become a useful tool in remote sensing only if it is made faster, more predictable, and easier to use. Wilkinson et al. [26] have successfully integrated neural and spatial approaches for classification. Benediktsson et al. [21] have used several different networks for classification of hyperdimensional data, and Gopal and Fischer [27] have used different neural learning schemes in comparison with statistical methods and found the fuzzy ARTMAP to perform best in classification of an urban area. Neural networks have particularly shown promise in the fusion of multisource data. The principal reason for using neural network methods for classification of multisource remote sensing/geographic data is that these methods are distributionfree. Since multisource data are in general of multiple types, the data from the various sources can have different statistical distributions. The neural network approach
516
J. A . Benediktsson
does not require explicit modeling of the data from each source. In addition, neural network methods have been shown to approximate class-conditional probabilities in the mean-squared sense [4]. Consequently, there is no need to treat the data sources independently as in many statistical methods. Benediktsson et al. [5] successfully used a parallel neural network architecture based on backpropagation and compared their results to statistical methods. Wan and F'raser [28] have proposed a method which can be used for data fusion, multitemporal classification, and contextual classification based on self organizing feature maps (SOM). Chiuderi et al. [29] used Kohonen maps and counterpropagation in data fusion. Carpenter et al. [30] have proposed the use of ART neural networks in classification of multisource data. 6. Hybrid Statistical/Neural Network Methods Based on Consensus Theory 6.1. Consensus Theory
Consensus theory [31-351 is a well-established research field involving procedures with the goal of combining single probability distributions [36] to summarize estimates from multiple experts (data sources). Consensus theory relies on the assumption that the experts make decisions based on Bayesian decision theory and is related to the theory of stacked generalization [37]. In most consensus theoretic methods each data source is at first considered separately. For a given source an appropriate training procedure can be used to model the data by a number of sourcespecific densities that will characterize that source. The data types are assumed to be very general. The source-specific classes or clusters are therefore referred to as data classes, since they are defined from relationships in a particular data space. In general there may not be a simple one-to-one relation between the user-desired information classes and the set of data classes available since the information classes are not necessarily a property of the data. In consensus theory, the information from the data sources is aggregated by a global membership function, and the data are classified according to the usual maximum selection rule into the information classes. The combination formula obtained is called a consensus rule. Several consensus rules have been proposed. Probably the most commonly used consensus rule is the linear opinion pool (LOP) which has the following (group probability) form for the information class w j if n data sources are used:
c n
Cj(X)=
~iPi(WjI%),
(6.1)
i= 1
where X = [XI,. . . ,z,] is a compound vector consisting of observations from all the data sources and Xi's (i = 1 , . . . , n) are source-specific weights which control the relative influence of the data sources. The weights are associated with the sources in the global membership function to express quantitatively the goodness of each source.
3.3 Statistical and Neural Network Pattern Recognition . . . 517 The linear opinion pool has a number of appealing properties. For example, it is simple, yields a probability distribution, and the weight X i reflects in some way the relative expertise of the ith expert. Also, if the data sources have absolutely continuous probability distributions, the linear opinion pool gives an absolutely continuous distribution. In using the linear opinion pool, it is assumed that all of the experts observe the input vector X. Therefore, (6.1) is simply a weighted average of the probability distributions from all the experts and the result is a combined probability distribution. The linear opinion pool, though simple, has several weaknesses; e.g. it shows dictatorship when Bayes' theorem is applied, i.e. only one data source will dominate in making a decision. It is also not externally Bayesian (does not obey Bayes' rule). The reason it is not externally Bayesian is that the linear opinion pool is not derived from the joint probabilities using Bayes' rule. Another consensus rule, the logarithmic opinion pool (LOGP), has been proposed to overcome some of the problems with the linear opinion pool. The logarithmic opinion pool can be described by n
or
where XI,
. . . , A, are weights which should reflect the goodness of the data sources.
The logarithmic opinion pool differs from the linear opinion pool in that it is unimodal and less dispersed. Also, the logarithmic opinion pool treats the data sources independently. Zeros in the LOGP are vetos; i.e. if any expert assigns pi(wjlzi)= 0, then Lj(X) = 0. This dramatic behavior is a drawback if the density functions are not carefully estimated. The logarithmic opinion pool is externally Bayesian, but it is computationally more complicated than the linear opinion pool.
6 . 2 . Weight Selection Schemes i n Consensus Theory The previous section focused on consensus rules, but the weight selection schemes for these rules were not addressed. The weight selection schemes in consensus theory should reflect the goodness of the separate input data sources, i.e. relatively high weights should be given to data sources that contribute to high accuracy. There are at least two potential weight selection schemes. The first scheme is to select the weights such that they weigh the individual data sources but not the classes within the sources, e.g. use heuristic measures which rank the data sources according to their goodness. These heuristic measures might be, e.g. stage-specific classification accuracy of training data, overall separability or equivocation [31].
518
J. A . Benediktsson
The second scheme is to choose the weights such that they not only weigh the individual data sources but also the classes within the sources [5]. This scheme consists of defining a function
Y=f(Z,A), where 2 contains source-specific posteriori discriminative information like pi ( w j ]xi) or Zog(pi(wj[zi))and A is a weight matrix corresponding to the source-specific weights in (6.1) and (6.3). Here we are interested in the case when f is a nonlinear functional. Then, e.g. a neural network can be used to obtain a mean square estimate off and the consensus theoretic classifiers with equal weights may be considered to preprocess the data for the neural network. The neural network is trained to learn the mapping from the source-specific posteriori discriminative information to the information classes. Therefore, the neural network can be used to optimize the classification capability of the consensus theoretic classifiers.
7. Experimental Results The area used in experiments is the region surrounding the volcano Hekla in Iceland 138,391. Hekla is one of the most active volcanos in Iceland. It sits on the western margin of the Eastern volcanic zone in South Iceland. Hekla is a ridge, built by repeated eruptions on a volcanic fissure and reaches about 1500 m in elevation and about 1000 m above the surroundings. AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) data from the area were collected on June 17th 1991, which was a cloud-free day in the area covered. The AVIRIS sensor operates in the visible to mid infrared wavelength range, i.e. from 0.4 pm to 2.4 pm. It has 224 data channels and utilizes 4 spectrometers. During the data collection in Iceland in 1991, spectrometer 4 was not working properly. This particular spectrometer operates in the near-infrared wavelength range, from 1.84 pm to 2.4 pm (64 data channels). These 64 data channels were deleted from the data set along with the first channels for all the other spectrometers, but those channels were blank. When the noisy and blank data channels had been removed, 157 data channels were left. Four full AVIRIS frames were used in the data analysis. Each frame consisted of 614 columns and 512 lines. Two data channels from the AVIRIS data are shown in Figs. 1 and 2. Figure 1 shows data channel 26 from the visible region of the spectrum (centered at 0.64 pm) but Fig. 2 illustrates data channel 138 from the mid infrared range (centered at 1.64 pm). The difference in spectral responses is apparent for these two data channels. For example, the white regions at the center of Fig. 1 are snow. The spectral response of snow is very low in the mid infrared range of the spectrum. Thus, the corresponding regions are dark in Fig. 2.
3.3 Statistical and Neural Network Pattern Recognition
Fig. 1. AVIRIS data channel number 26.
Fig. 2. AVIRIS data channel number 138.
.. .
519
520
J . A . Benediktsson Table 1. Training and test samples for information classes in the experiment on the AVIRIS data. Class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#
Information Class
Training Size
Test Size
Andesite Lava from 1991 Andesite Lava from 1980 Andesite Lava from 1970 Old Unvegetated Andesite Lava Andesite Lava with Sparse Moss Cover Andesite Lava with Moss Cover Andesite Lava with Thick Moss Cover Lichen Covered Basalt Lava Rhyolite Hyaloclast it e Scoria Lava Covered with Tephra and Scoria Volcanic Tephra Snow Firn and Glacier ice
1659 1182 978 2562 1008 528 2863 1674 202 2062 275 350 1654 528 242
1511 1162 922 2444 1008 495 2733 1023 202 1979 275 350 1608 484 216
17767
16412
Total
Fifteen information classes were defined in the area, and 34179 samples were selected from the classes. Approximately 50% of the reference samples were used for training, and the rest were used to test the data analysis algorithms (see Table 1). The statistical analysis were performed using MultiSpeca on a Power P C Macintosh computer. The data were assumed to be Gaussian distributed. However, a problem with using conventional multivariate statistical approaches such as the Gaussian ML method for classification of hyperdimensional data is that these methods rely on having nonsingular class-specific covariance matrices. When n features are used, the training data for each class must include at least n + 1 independent samples to ensure that the matrices will be nonsingular. Therefore, the estimated covariance matrices may become singular in high-dimensional cases involving limited training samples. The statistical analysis of the data was performed in the following manner. First DBFE and DA feature extraction was performed. For both methods, the cumulative eigenvalues (variance) for the different feature sets are shown in Table 2. From Table 2 it is clear that much smaller feature sets were obtained with the DA aMultiSpec is an interactive computer program for analysis of hyperdimensional remote sensing data which was developed by David A. Landgrebe and Larry Biehl of the School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana, U.S.A. For further information look at http://dynamo.ecn.purdue.edu/Nbiehl/MultiSp~.html.
3.3 Statistical and Neural Network Pattern Recognition . . . 521 Table 2. Number of features as a function of the cumulative eigenvalues. Cumulative Eigenvalues
DBFE
20% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 99%
1 2 3 4 5 7 8 10 12 15 18 23 28 36 51 86
DA
1
2 3 4 7 15
approach as was expected. The feature sets in Table 2 were classified using the MD classifier, the Gaussian ML method and the ECHO classifier based on both original statistics and enhanced statistics. The results of the statistical classifications are shown in Figs. 3 to 16. Figures 3 and 4 show the training accuracies as a function of the number of features for the DBFE and DA methods respectively when original statistics were used but Figs. 5 and 6 show the corresponding test results. Figure 7 displays the time of classification and training for the classification methods as a function of the number of features. From these figures it is evident that the ECHO method gave the best a’ccuracies both for training and test data regardless of the feature extraction method applied. The ML classifier also showed good performance in terms of classification accuracies but the MD method did not achieve high accuracies as was expected. It is important to note that the ECHO method was also faster in training and classification than the ML algorithm when more than 50 features were used (see Fig. 7). The Hughes phenomenon was observed for all classification methods in Figs. 5 and 6, i.e. the test classification accuracies decreased if the number of features increased significantly when the statistics were based on the same training data in low and high dimensional spaces. Figures 8 and 9 show the training classification accuracies as a function of the number of features when enhanced statistics were used for the feature sets obtained by the DBFE and DA methods. The corresponding test results are shown in Figs. 10
522
J. A . Benediktsson
II
I I
80-1
- , Y
,,-. _ _ -- _ - - - -_ - - __ _ _ ---
- _- - -
$ 70.: ;* 0 3
8 -
I 1
- - _ _-
I 1
60-11
2
I1 It
0
50
I'
7\
I
40 i I
-
- ML with Original Statistics ECHO with Original Statistics MD with Original Statistics
30
Fig. 3. Classification accuracies for training data using original statistics and DBFE as a function of the number of features.
Fig. 4. Classification accuracies for training data using original statistics and DA as a function of the number of features.
and 11. From the results in Figs. 8 t o 11 it can be seen that the ECHO method with enhanced statistics gave the highest accuracies for both training and test data regardless of the feature extraction method used. A similar trend was also seen for
3.3 Statistical and Neural Network Pattern Recognition . . . 523
-
-.c
-
;
90I I
-E
80 - 1 I
l
$ 7 0 . :. 3 !
i 5
60-1
2
I/
r
,
_
I
- - - - - - _ _ _ _ _ _- -- -- _ -
I'
- - _ _- - - _
----_
I I
0
I1 11
Y
I
40
:
-ECHO with Original Statistics - - MD with Original Statistics
I
30
$0
Fig. 5. Classification accuracies for training data using original statistics and DBFE as a function of the number of features.
30
*O
I I
I
I
ECHO with Original Statistics
t
10' 0
I
. - - ML with Original Statistics
20
40
60 80 100 Number of Features
120
140
160
Fig. 6. Classification accuracies for test data using original statistics and DA as a function of the number of features.
original statistics. In addition the classification accuracies of the ECHO method improved the most by using enhanced statistics instead of original statistics. This result was observed regardless of the feature extraction method used.
524
J. A . Benediktsson 3500 3000 -
/ I
/ / /
2500 -
-
/. /'
Fig. 7. Time of classification for the different classifiers as function of the number of features used.
100-
- - - L r
I r
'
90-
I
I I
80- I
-' $
f
1
70.:
8
1I
g
601
5 -
, \ - - - - - - _ - - - -_-_- --- - - _- _- - - - _ - _ _ _ _ _
7;
6
.
50 I
- - ML with Enhanced Statistics - ECHO with Enhanced Statistics - - MD with Enhanced Statistics
I
40+
30 0
20
40
60
80
100
120
140
160
When many features were used, the Hughes phenomenon was observed even for enhanced statistics. For example, the test accuracy of the ECHO with the DBFE (in Fig. 10) had its maximum at only 20 features and the Hughes phenomenon
3.3 Statistical a n d Neural Network P a t t e r n Recognition . . . 525 100
II !'I'
0
1
40
I
jOp 20
ECHO with Enhanced Statistics
- - MD with Enhanced Statistics I
20
40
.
60 80 100 Number of Features
120
140
0
Fig. 9. Classification accuracies for training data using enhanced statistics and DA as a function of the number of features.
-
kS
1
70.!
!
' ---
8 4:
I I
- - - - _ _- - - _ _ _
* _ _ _ _ _ _ - - - - - - - -
---__
9 60.-
2!
I\: !I,
0
I
50 4
40
i : I
30
- . - ML with Enhanced Statistics -ECHO with Enhanced Statistics - - MD with Enhanced Statistics
io
Fig. 10. Classification accuracies for test data using enhanced statistics and DBFE as a function of the number of features.
was observed after that. Although the ECHO classifier showed excellent and improved performance in terms of classification accuracies with enhanced statistics, it is important to note that enhancing the statistics is a computationally intensive
526
J. A . Benediktsson I
100,
90 I 1
80-
-5 ozs
: a
F
I
I
,’
-
------
’
‘ I
70- I I
----- -_ --
If
II
- - - _- - --- _
60-1
t
:
I
50-1
d 401. I I
30 2o
I
- - ML with Enhanced Statistics -ECHO with Enhanced Statistics - - MD with Enhanced Statistics
t
10 0
20
40
60
80
100
120
140
160
Fig. 12. Time needed for enhancing of statistics as a function of the number of features.
iterative process. As seen in Fig. 12, the time needed for the enhancement increased significantly as the number of features grew. Here the same conditions (parameters) were used for the enhancement regardless of the number of features, resulting
3.3 Statistical and Neural Network Pattern Recognition . . . 527
5’ 0
20
40
80 100 Number of Features
60
120
140
io
Fig. 13. Number of iterations needed for enhancing of statistics as a function of the number of features.
(in most cases) in a reduced number of iterations needed as a function of the number of features (see Fig. 13). The probability maps obtained in classification were much improved by using enhanced statistics instead of original statistics as can be seen in Figs. 14 and 15. These figures show that the classification accuracies are only one indicator to measure the performance of a particular classifier. In the ideal classification case, each pixel would be classified with high confidence, i.e. the likelihood of a pixel belonging to one particular class would be high. A probability map gives this information and dispiays the highest likelihood of classification for each pixel. In Fig. 14, the probability map for the ECHO classifier based on original statistics and 51 DBFE features is shown. In the figure, dark tone indicates low likelihoods in contrast to the light tone which indicates high likelihoods. It can be seen that in the image most of the high lighted areas are rectangular which is due t o the fact that these are the training areas. By looking just outside these areas the colors become darker, i.e. the highest likelihood for a particular pixel outside of a training area is in most cases significantly lower than inside a training area. It is important t o design classifiers for the whole data, i.e. classify pixels outside of the training areas with high likelihoods. That is accomplished by the use of enhanced statistics. Figure 15 shows the enhanced probability map corresponding to the original probability map in Fig. 14. Figure 15 is much lighter which demonstrates the desired effect of enhanced statistics, i.e. the highest likelihoods for the pixels in the image have increased and the light rectangular areas are not as easily detected as before. It is interesting to note that the darkest areas in the enhanced image correspond to snow which was, however, classified correctly with high accuracy.
528
J. A . Benediktsson
Fig. 14. Probability map: Original statistics.
Fig. 15. Probability map: Enhanced statistics.
3.3 Statistical and Neural Network Pattern Recognition . . . 529
Fig. 16. Classified map using 51 DBFE channels and the ECHO classifier.
Figure 16 shows a classified map generated by using the DBFE feature set corresponding to 95% cumulative eigenvalues and the ECHO classifier trained with enhanced statistics (corresponding to the probability map in Fig. 15). The classified map is not only excellent according to overall classification accuracies (training accuracy: 99.7%, test accuracy: 96.2%) but also according to geologists that know the area well. The map can in many respects be considered to be more detailed than existing geological maps. In fact, in experiments the ECHO and ML classifiers could accurately distinguish between several geological units with very similar spectral properties. This was in particular the case for the classes Andesite Lava from 1991, 1980 and 1970 (classes 1, 2 and 3). In order to compare the statistical methods applied above to the consensus theoretic based LOGP, the data were subdivided into several uncorrelated “data sources”. The correlations between the spectral channels can be visualized as shown in Fig. 17 where the brightness ind:cates correlation. The lighter the tone, the more correlated are the spectral bands (the black regions from channels number 106 to 110 are water absorption bands [7].) By analyzing Fig. 17, it was determined that three data sources should be used in the consensus theoretic classification. These three data sources (or feature sets) were not highly correlated with other data sources: (1) Data source number 1: Data channels number 1 to 53, (in the visible region of the spectrum), (2) data source number 2: Data channels number 54 to 105 (in the near infrared region), and (3) data source number 3: Data channels number 111 to 150 (mid infrared region). Therefore, data source number one consisted of 53 data channels, data source number two of 52 data channels, and data source number three of 40 data channels. The information classes were modeled by the Gaussian
530
J . A . Benedaktsson
Fig. 17. The correlation image for the 157 data channels in the AVIRIS imagery.
Table 3. Overall Accuracies for ML Classification Methods Applied to the Individual Sources. Method Source # 1 (53 channels) Source # 2 (52 channels) Source # 3 (40 channels) Number of Samples
Training Accuracy
Test Accuracy
98.2% 95.4% 88.4%
86.1% 74.1% 65.3%
17767
16412
distribution for all the data sources. The classification accuracies for the individual data sources using the ML approach are listed in Table 3. The classification accuracies obiained with the training data were used to select a heuristic weighting for the LOGP [31]. The heuristic weights were selected as 0.45 for source number one, 0.4 for source number two, and 0.35 for source number three. The LOGP was also optimized with conjugate gradient backpropagation (CGBP). The results for the LOGP and the statistical classifiers were also compared to results obtained by the use of individual conjugate gradient backpropagation with 40 hidden neurons (CGBP40) and conjugate gradient backpropagation with no hidden layer (CGBPO), trained on the whole data set (157 data channels). The CGBP40 achieved relatively high test accuracies (85.4%), higher than the ones achieved by the CGBPO (78.9%), ML (82.2%), ECHO (82.2%), and MD (62.2%) algorithms applied to the 157 data channels. However, all the LOGP versions, i.e. the versions
3.3 Statistical and Neural Network Pattern Recognition . . . 531
based on equal weights, heuristic weights, and optimized by the CGBP, outperformed the individual CGBP40 in terms of test accuracies. The LOGP optimized with the CGBPO achieved the highest overall test accuracy for the methods based on the 157 data channels (93.9%). It is interesting that the results obtained by using the CGBPO (training: 97.4%, test: 93.9%) for optimization were better in terms of training and test accuracies than the optimization results achieved with the CGBP40 (training: 94.3%, test: 93.9%) which may indicate that the optimization problem is linear in this case. Another indication for the linearity is the fact that the heuristic and equal weighting of the LOGP also gave high accuracies (training: 99.7%, test: 93.0%). These results demonstrate that the LOGP is a n attractive method for classification of hyperdimensional data. It avoids the Hughes phenomenon by using subsets of the data channels for training. Therefore, it is computationally much less demanding than the Gaussian ML applied to the whole data set. Here, the LOGP methods also showed excellent improvement in terms of test accuracies when compared t o the single source classifications in Table 3. Although the individual multilayer CGBP40 classifier performed well in this experiment it was outperformed by other classifiers in terms of accuracies. The
Table 4. Overall accuracies for ML classification methods applied t o the individual sources. Method
Training Accuracy
Test Accuracy
ML (157 channels) MD (157 channels) ECHO (157 channels) CGBP4O (157 channels) CGBPO (157 channels) DA-ML (Original) DBFE-ML (Original) DA-ML (Enhanced) DBFEML (Enhanced). DA-ECHO (Original) DBFEECHO (Original) DA-ECHO (Enhanced) DBFE-ECHO (Enhanced) LOGP (Equal Weights) LOGP (Heuristic Weights) LOGP (Optimized with CGBP4O) LOGP (Optimized with CGBPO)
100.0% 68.0% 100.0% 95.7% 90.1% 95.2% 99.9% 91.5% 99.4% 98.2% 99.9% 97.2% 99.7% 99.7% 99.7% 94.3% 97.4%
82.2% 62.2% 82.2% 85.4% 78.9% 89.2% 94.3% 86.8% 94.5% 96.1% 94.5% 96.7% 96.2% 93.0% 93.0% 91.0% 93.9%
Number of Samples
17767
16412
532
J . A . Benediktsson
neural network approach should be considered more appropriate for classification of data that cannot be modeled relatively easily by a statistical model, e.g. multisource data, or be used as an optimizer for other classifiers. The summarized training and test classification results for the experiments are listed in Table 4. In the table “Enhanced” and “Original” DA and DBFE methods refer to the feature sets determined with 95% cumulative eigenvalues. For the DBFE method this feature set consisted of 51 dimensions but the DA set was seven-dimensional. From the table it can be seen that the ML and ECHO classifiers outperformed all other methods in terms of classification accuracies of training data when the reduced feature sets were used. On the other hand, these methods were not as successful as the multilayer neural network and the consensus theoretic classifiers in terms of test accuracies when the original 157 data channels were used. 8. Conclusion The use of enhanced statistics and feature extraction for AVIRIS data has been discussed. The classification results obtained by using enhanced statistics were excellent, especially for the ECHO classifier. The results demonstrate that enhanced statistics, feature extraction, and the choice of classification method are all important when hyperdimensional data, such as AVIRIS data, are analyzed. The experimental results also showed that consensus theoretic classification methods can be appropriate when hyperdimensional data, such as AVIRIS data, are classified. Consensus theory overcomes two of the problems with conventional statistical classifiers. First, using subsets of the data as individual data sources reduces the computational burden of a multivariate statistical classifier. Second, a smaller data set can also help in providing better statistics for the individual sources when the same number of training samples are used as for the original data. Here, hybrid statistical/neural network consensus theoretic classifiers outperformed all other methods based on the original 157 band data set in terms of classification accuracies of test data. These results not only demonstrate that the hybrid consensus theoretic approach is attractive for classification but also show the optimization capabilities of neural networks. However, statistical methods based on feature extraction and enhanced statistics usually give better performance when a convenient statistical model can be determined.
Acknowledgements This work was done in part while Dr. Benediktsson was a visiting scholar in the School of Electrical and Computer Engineering, Purdue University, W. Lafayette, Indiana. The assistance of Prof. David A. Landgrebe and Larry Biehl of the School of Electrical and Computer Engineering is in particular acknowledged. The author also wishes to thank his colleagues at the University of Iceland, Dr. Johannes R. Sveinsson and Dr. Kolbeinn Arnason for their contribution.
3.3 Statistical and Neural Network Pattern Recognition . . . 533
References [l] D. A. Landgrebe, The quantitative approach: concept and rationale, in Remote Sensing - The Quantitative Approach, P. H. Swain and S. Davis (eds.) (McGraw-Hill, New York, 1978) 1-20. [2] B. M. Shahshahani and D. A. Landgrebe, Classification of multi-spectral data by joint supervised-unsupervised learning, technical report, TR-EE 94-1, School of Electrical Engineering, Purdue University, 1994. [3] B. M. Shahshahani and D. A. Landgrebe, The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon, IEEE Trans. Geoscience and Remote Sensing 32 (1994) 1087-1095. [4] D. W. Ruck, S. K. Rogers, M. Kabrisky, M. E. Oxley and B. W. Suter, The multilayer perceptron as an approximation to a Bayes optimal discrimination function, I E E E Trans. Neural Networks 1 (1990) 296-298. [5] J. A. Benediktsson, J. R. Sveinsson, 0. K. Ersoy and P. H. Swain, Parallel consensual neural networks, I E E E Bansactions on Neural Networks 8 (1997) 54-64. [6] J. A. Benediktsson, J. R. Sveinsson and P. H. Swain, Hybrid consensus theoretic classification, IEEE Trans. Geoscience and Remote Sensing 37 (1997) 833-843. (71 K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. (Academic Press, New York, 1990). [8] J. A. Richards, Remote Sensing Digital Image Analysis: An Introduction, 2nd ed. (Springer-Verlag, Berlin, 1993). [9] R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis (John Wiley and Sons, New York, 1973). [lo] P. H. Swain, Fundamentals of pattern recognition in remote sensing, in Remote Sensing - The Quantitative Approach, P. H. Swain and S. Davis (eds.) (McGraw-Hill, New York, 1978). [ll] S. R. Safavian and D. A. Landgrebe, A survey of decision tree classifier methology, IEEE Bans. Syst. Man Cybern. 21 (1991) 660-674. [12] C. E. Brodley and P. E. Utgoff, Multivariate decision trees, Machine Learning 19 (1995) 45-77. [13] B. D. Ripley, Pattern Recognition and Neural Networks (Cambridge University Press, New York, 1996). [14] Y. Juhng, Bayesian contextual classification of noise-contaminated multi-variate images, Ph.D. Thesis, School of Electrical Engineering, Purdue University, 1994. [15] B. Jeon and D. A. Landgrebe, Classification with spatio temporal interpixel class dependency contexts, I E E E Trans. Geoscience and Remote Sensing 30 (1992) 663-672. [16] A. H. Schistad Solberg, T. Taxt and A. K. Jain, A Markov random field model for classification of multisource satellite imagery, IEEE Trans. Geoscience and Remote Sensing 34 (1996) 10&113. [17] R. L. Kettig and D. A. Landgrebe, Classification of multispectral image data by extraction and classification of homogeneous objects, I E E E Trans. Geoscience Electronics 14 (1976) 19-26. [18]C. Lee and D. A. Landgrebe, Feature extraction based on decision boundaries, IEEE Trans. Pattern Anal. Mach. Intell. 15 (1993) 388-400. [19] J. A. Benediktsson, P. H. Swain and 0. K. Ersoy, Neural network approaches versus statistical methods in classification of multisource remote sensing data, I E E E Trans. Geoscience and Remote Sensing 28 (1990) 54&552. [20] J. A. Benediktsson, P. H. Swain and 0. K. Ersoy, Conjugate-gradient neural networks in classification of multisource and very-high dimensional remote sensing data, Int. J. Remote Sensing 14 (1993) 2883-2903.
534
J. A . Benediktsson
[21] J. A. Benediktsson, J. R. Sveinsson and K. Arnason, Classification and feature extraction of AVIRIS data, IEEE %ns. Geoscience and Remote Sensing 33 (1995) 1194-1205. [22] C. M. Bishop, Neural Networks for Pattern Recognition (Clarendon Press, Oxford, 1995). [23] D. E. Rumelhart, G. E. Ilinton and R. J. Williams, Learning internal representation by error propagation, in Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1,D. E. Rumelhart and J. L. McClelland (eds.) (MIT Press, Cambridge, MA 1986) 318-362. [24] E. Barnard, Optimization for training neural nets, IEEE Trans. Neural Networks 3 (1992) 232-240. [25] J. D. Paola and R. A. Schowengerdt, A review and analysis of backpropagation neural networks for classification of remotely-sensed multi-spectral imagery Int. J . Remote Sensing 16 (1995) 3033-3058. [26] G. G. Wilkinson, F. Fierens and I. Kanellopoulos, Integration of neural and statistical approaches in spatial data classification, Geog. Syst. 2 (1995) 1-20. [27] S. Gopal and M. Fischer, A comparison of three neural network classifiers for remote sensing classification. Proc. 1996 Int. Geoscience and Remote Sensing Symposium (IGARSS’96), Lincoln, Nebraska, May 27-31, 1996, 787-789. [28] W. Wan and D. Fraser, A SOM framework for multisource data and contextual analysis, Proc. 7th Australasian Remote Sensing Conf., 1994, 145-150. [29] A. Chiuderi, S. Fini and V. Cappellini, An application of data fusion to landcover classification of remote sensed imagery: a neural network approach, Proc. 1994 IEEE Int. Conf. Multisensor Fusion and Integration for Intell. Syst. Las Vegas, Nevada, 1994, 756-762. [30] G. A. Carpenter, M. N. Gjaja, S. Gopal and C. E. Woodcock, ART neural networks for remote sensing: vegetation classification from Landsat TM and terrain data, Proc. 1996 Int. Geoscience and Remote Sensing Symp. (IGARSS’96), Lincoln, Nebraska, May 27-31, 1996, 529-531. [31] J. A. Benediktsson and P. H. Swain, Consensus theoretic classification methods, ZEEE Trans. Syst. Man Cybern. 22 (1992) 688-704. [32] C. Berenstein, L. N. Kanal and D. Lavine, Consensus rules, in Uncertainty in Artificial Intelligence, L. N. Kanal and J. F. Lemmer (eds.) (North Holland, New York, 1986). [33] R. F. Bordley, Studies in mathematical group decision theory, Ph.D. Thesis, University of California, Berkeley, 1979. [34] C. Genest and J. V. Zidek, Combining probability distributions: a critique and annotated bibliography, Stat. Sci. 1 (1986) 114-118. [35] R. L. Winkler, Combining probability distributions from dependent information sources, Mgmt. Sci. 27 (1981). 479-488. [36] R. A. Jacobs, Methods for combining experts’ probability assessments, Neural Comput. 3 (1995) 867-888. [37] D. H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241-259. (381 J. A. Benediktsson, K. Arnason, A. Hjartarson and D. A. Landgrebe, Classification and feature extraction with enhanced statistics, Proc. 1996 Int. Geoscience and Remote Sensing Symp. (IGARSS’96), Lincoln, Nebraska, May 27-31, 1996, 414-416. [39] J. A. Benediktsson and J. R. Sveinsson, Optimized classification of data from multiple sources, Proc. NORSIG’96 Helsinki, Finland, September 24-27, 1996, 75-78.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 535-565 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 3.4 I MULTI-SENSORY OPTO-ELECTRONIC FEATURE EXTRACTION NEURAL ASSOCIATIVE RETRIEVER
H.-K. LIU and Y.-H. JIN Department of Electrical Engineering, University of South Alabama, Mobile, Alabama 36688-0002,USA NEVILLE I. MARZWELL Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California 91109-8099,USA and SHAOMIN ZHOU Beckman Instruments, Brea, California, USA Optical pattern recognition and neural associative memory are important research topics for optical computing. Optical techniques, in particular, those based on holographic principle, are useful for associative memory because of its massive parallelism and high information throughput. The objective of this chapter is to discuss system issues including the design and fabrication of a multi-sensory opto-electronic feature extraction neural associative retriever (MOFENAR). The innovation of the approach is that images and/or 2-D data vectors from a multiple number of sensors may be used as input via an electrically addressed spatial light modulator (SLM) and hence processing can be accomplished in parallel with high throughput. A set of Fourier transforms of reference inputs can be selectively recorded in the hologram. Unknown image/data can then be applied to the MOFENAR for recognition. When convergence is reached after iterations, the output can either be displayed or used for post-processing computations. We included experimental results that demonstrate the ability of the system to recognize and/or restore input images. Keywords: Multi-sensory opto-electronic feature extraction, neural associative retriever, optical pattern recognition, associative memory, fourier transforms, index terms.
1. Introduction Optical pattern recognition and neural associative memory are important research areas for optical computing. Optical techniques, [l-191 in particular, those based on the holographic principle, are useful for associative memory because of its massive parallelism and high information throughput. Some examples include the associative memory systems using four-wave mixing, and self-pumped phaseconjugation in photorefractive crystals and planner holograms. Associative holographic memory has also been proposed via angularly-multiplexed reference beams generated by optical fibers, conventional mirrors and beam splitters. There has been 535
536
H.-K. Liu
et al.
much interest in the possibility of utilizing the advantages of optics to implement various computing architectures that display superior characteristics than those achieved using electronics. Optics offers the potential of more parallel input and processing architectures, and greater flexibility of design for high density interconnect structures. For the purpose of illustrating the operations of the optical pattern recognition systems, several optoelectronic architectures have been described. One neural network architecture described in this chapter is referred to as “MOFENAR”, for “Multisensory Optoelectronic Feature Extraction Neural Associative Retriever”, utilizes currently available spatial light modulator technology, together with holographic and photorefractive material devices. These elements function together to allow associative memory operation through many parallel channels at once. The network allows the input of data simultaneously from different sensor sources. These data are compared simultaneously, with composite associations being made, based upon a previously recorded interconnect hologram located at the Fourier plane. The system capability makes possible the application
--
CCDCamera CCD Camera
Novelty Sensor
-CCD
CCD Camera
- Novelty Sensor
i
/
Remote & Non-Contact
Modulator Input Image
Transform Lens
Detector
Fig. 1. A multi-input multi-channel optical pattern recognition system.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 537 of the MOFENAR architecture for analysis of many different types of real input signals, either sequentially or simultaneously. An example of the overall system architecture is represented by the block diagram as shown in Fig. l. The top part of the figure shows four CCD Camera and Novelty Sensor Units each of which is receiving input independent of the others. When a new event appearing in the field of view of one of the sensors, the sensor will alert the switching control unit with a time and date signal generator. The control unit can remotely control the video recorder for record keeping and displaying of the input signals from a space shuttle as shown in the upper right corner of Fig. 1. In the meantime, the input signal is applied to the input ports of an electronically addressed spatial light modulator as shown in the lower part of the system which is redrawn in Fig. 2 and described below. Figure 2 shows a multi-channel pattern classification system architecture. A laser beam from a HeNe laser is collimated by a spatial filter and lens combination and is then divided into two beams, a n object beam and a reference beam. The object beam illuminates the input pattern and is Fourier transformed by the lens at its focal plane where a holographic matched filter is recorded. A multi-channel replication holographic optical element replicates the input into an N x N array of Fourier transforms at the matched filter plane. In the multi-channel pattern classification system the reference beam is angularly multiplexable. During the recording of the matched filter, only one object beam at a time is selected by an electronically controlled shutter array pinhole spatial filter and one reference beam angle is used to make the hologram. A different reference beam angle is used t o record a different group of objects until all the objects are recorded. The developed hologram becomes the multi-channel matched filter. In the reading process, the matched filter is replaced exactly at its original location and the reference beam is turned off. The signal representing the correlation between the input pattern and the \
Spatial Light Modulator
Electronically Controlled Shutter Array
Fig. 2. A multi-channel pattern classification system architecture.
538 H.-K. Liu et al.
pattern used in the recording of the matched filter can be found a t the focal plane of the regenerated reference beam. These abilities allow the proposed architecture to be applied to many current fields of interest, including parallel database search, image and signal understanding and synthesis, robotic manipulation and locomotion, natural language processing, in addition to real-time multi-channel real pattern recognition. The MOFENAR architecture is based on iterative optical data array comparisons, and relies upon many principles basic to optical correlation architecture such as the Vander Lugt and the joint Fourier transform. The MOFENAR, however, displays a high-level capability of optically multiplexing each input and comparing this input with many different reference matrices, and creating an oscillatory optical resonance mode in which input variation and error is eliminated in the reconstructed output through a convergent nonlinear optical thresholding process. In the following, Sec. 2 contains a description of a MOFENAR prototype breadboard, including an analysis of its operational capability. Section 3 describes the specific hardware requirements for the MOFENAR prototype, including optical and electronic components, and both input and output data formats. Results of the neural associative retrieval system’s computer simulation are interpreted and conclusions drawn from the simulation are presented. Laboratory experimental results of a single-sensory and multi-sensory optoelectronic feature extraction neural associative retriever are presented to support the theoretical predictions. Section 4 is the conclusion. 2. System Design and Operation 2.1. Experimental S y s t e m Design
The MOFENAR may be described as a bi-directional feedforward/feedbackward architecture with multi-channel input and processing capability as shown in Fig. 3.
1 SLM
P1
BS1
LI
IMH P2
Fig. 3. Prototype “MOFENAR” Neural Net Architecture.
3.4
Multi-Sensory Opto-Electronic Feature Extraction . . . 539
An expanded, collimated laser beam is used to provide the input to the system. An Argon Ion (514 nm) and frequency-doubled YAG (532 nm) laser are candidates t o be selected as the source. The optimal operating wavelength and laser power are to be determined by the characteristics of the specific SLM and PCM obtained for construction of the Phase I1 prototype design. The input laser beam is spatially filtered and collimated t o the required diameter to optimally couple the laser beam to the input spatial light modulator (SLM). In the signal processor, the input is arranged in a two-dimensional array. This data pattern is then electronically transferred to the SLM. The SLM at plane P1 spatially encodes this pattern onto the input laser beam. The laser beam passes through a beamsplitter (BSl), a Fourier lens (Ll), and is focused at the Fourier plane of L1 (P2). BS1 is a polarizing cube beamsplitter placed between the SLM device and L1. This beamsplitter transmits all incident light polarized in one particular orientation, and reflects at ninety degrees all incident light polarized in the orthogonal orientation. This device performs in the same manner regardless of the propagation direction of the incident light. The “Interconnect Matrix Hologram”, (I.M.H.) is placed a t the Fourier plane of L1. This hologram consists of a multiplexed array of transmission functions achieved via a translatable mask. This device multiplies the incident focused pattern from the input SLM by its transmission function. After the focused input pattern passes through and is multiplied by the I.M.H., it is again Fourier transformed by a second single convex lens, L2. At the Fourier plane of this lens, the distribution of light consists of the image of the input patterns recorded in the I.M.H. At this point a photorefractive optical crystal is placed, oriented, and illuminated by two opposing laser beams in such a way as t o create a counter-propagation “phase conjugate beam” t o each of the incident signal beams. This “phase conjugate mirror” (PCM1) also introduces a nonlinear weighting of reflectivity coefficients, which is dependent upon the intensity of the incident signal beam. This effect results in more intense incident signals being more heavily weighted than less intense signals. If the PCMl has a natural thresholding effect, then an additional device is not needed. Otherwise, a thresholding device is required (not shown) in front of the PCM1. The counter-propagation beams generated by PCMl exactly retrace their incident paths through the I.M.H. 2nd onto the polarizing cube beamsplitter. The counter-propagation beam’s polarization orientation at this point is orthogonal to the orientation of the original beam. This is accomplished by correct orientation of polarization of the PCMl’s “pump” beams, from which the phase conjugate beam is created. Thus, the counter-propagation phase conjugated beams are polarized orthogonally with respect t o the input beam, and upon impinging the polarizing cube beamsplitter from the opposite direction, are reflected a t ninety degrees. This reflected phase conjugate beam now passes through a second cube beamsplitter (BS2). This beamsplitter is not a polarizing cube beam splitter like BS1, but is structured such that a constant percentage of incident light is transmitted, and the
540 H.-K. Liu et al.
remainder is reflected, regardless of the incident beam’s polarization orientation or direction of propagation. The transmission/reflection ratio may be chosen to be any desired value, and this will be maximized to the largest value possible while still providing sufficient reflected intensity to be recorded by a detector. This small reflected portion is imaged onto a high resolution camera placed one focal length (in optical path length) beyond the second convex lens. The large portion of the phase conjugated beam that is transmitted by the nonpolarizing cube beamsplitter is imaged onto a second PCM (PCM2) also located beyond BS2. PCM2 is identical in operation to PCM1, with the exception that the phase conjugate beam which is produced is polarized in the same orientation as the incident beam, unlike PCM1. This is accomplished by orienting the pump beam’s polarization parallel to that of the incident beam. The counter propagating phase conjugated beam produced by PCM2 retraces the optical path of the incident beam until it is again reflected by PCMl and reverses its direction. Thus an iterative oscillatory optical wave is created, whose amplitude continuously gains intensity with each pass (assuming the amplification of the PCMs is greater than the losses in a single path) in a manner determined by the nonlinearity of the PCMs’ reflectivity until the gain is saturated. At this point a steady state oscillation is established, and a constant output intensity distribution is detected by the camera, giving the neural network’s “answer” to the question of the input’s information content. 2.2. Experimental S ys tem Operation The architecture described above demonstrates the associative recall capabilities of the MOFENAR neural net system, along with its fault-tolerance and error correcting capability. Large data storage capacity will be demonstrated in the form of the Interconnect Matrix Hologram. It can be seen by examining Fig. 3 that the two PCMs of the MOFENAR architecture form a partially reflective cavity in which the input beam will oscillate. The simplest example of such oscillation, and how it produces a useful optical output signal, is given here. If an input pattern or binary vector set is addressed to the input spatial light modulator, and the SLM is illuminated by the input laser, the pattern or vector set will be optically introduced into’the MOFENAR system. The resultant intensity distribution is retro-reflected by the phase conjugate mirror, which can be described as imparting a “time reversal” upon the incident pattern, and thus sending it back the exact same path that it came. Also performed at this point by the nature of the phase conjugation process is a nonlinear thresholding of the optical intensity at plane P3. This thresholding is performed to weigh the most intense parts of the reconstructed image more heavily to remove spurious signals and noise. By more heavily weighting the stronger signals, and “time reversing” the result, the original input image corresponding to these reconstructions will be favored.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 541 The “time reversed” beam created by PCMl will however not travel back to the original input plane. Because of the intentional orthogonality of the pump beams used to create the phase conjugated, or “time reversed’’ input beam, this “time reversed” beam is itself polarized orthogonally to the incident beam, and is thus reflected upon its return by the polarizing cube beamsplitter in front of the I.M.H. This beam then travels through an equivalent distance to the plane of the second phase conjugate mirror, PCM2. At this point, a small constant percentage of the “time reversed” beam has been diverted to the output detector. This output consists of a reconstruction of the input which is more similar to its closest matches within the reference array. By successive iteration through the resonant cavity of the neural net, the output approaches exactly the reference pattern most closely represented by the input. The major attribute of the neural network is the high rate of information processing that may be achieved. This rate is a function of the degree of parallelism, or number of channels that are implemented, as well as the speed with which the architecture may cycle through successive input patterns. The two independent elements affecting the system speed are the input spatial light modulator and the phase conjugate mirrors. High resolution spatial light modulators are available which can operate continuously at several hundred frames per second, and operation at over two thousand frames per second has been demonstrated. There is evidence [2&33] indicating that the phase conjugate mirrors will pose a lower limit on system speed than the spatial light modulator, and we therefore examine here the parameters involved in the PCM response time, which is a function of the incident optical intensity, the pump beam optical intensity, and the required reflective gain of the PCM. We may calculate the optical gain required of the two PCMs by considering the losses involved in a single pass through the neural architecture. We refer again to Fig. 3. By requiring that the power at successive equivalent points in the iterative optical path be equal, we ensure a steady optical oscillation, which will converge to the desired output pattern, as described earlier. This analysis allows us to then solve for the required gain of the PCMs, which in turn gives indications of the time required by the PCMs to achieve such a gain. We characterize the individual components’ power outputs, or throughputs for the passive components, by using Values that are representative of what is currently available: ~ i ,= input laser intensity = 0.1 W/cm2 I p ~ = p PCM pump beam intensity = 1.0 W/cm2 TSLM= input SLM transmission = 0.10 TBSl = beamsplitter 1 transmission = 0.90 T I M H = avg. inter. matrix hologram transm = 0.10 G P C M= ~ phase conjugate mirror 1 gain = G1 T B S= ~ beamsplitter 2 transmission = 0.90 G P C M=~ phase conjugate mirror 2 gain = G2
542
H.-K. Liu et al.
Using these values, together with the assumption that the 1.M.H multiplexes the input signal into an 11 x 11 array of patterns (see Section 3.1.2), we first find that the optical intensity in one of the recalled images of interest to the right of the 1.M.H in Fig. 3,
I= -
Iin
x TSLM x TBSl x 8 M H
# orders 100 mW x 0.1 x 0.9 x 0.1 121
= 7.44 pw
.
Following the reconstruction plane of interest through the prototype architecture, and imposing the same intensity value upon the light for the successive passage through the same point of the system (when the light has reflected off both PCMs one time, and is at the same point just beyond the I.M.H), we obtain
or
=1
(3)
which requires that
(GPCM~)(GPCMZ) = 152.
(4)
Thus to overcome the losses in each iterative path through the neural net architecture, a combined gain of 152 is required. This gain may be split between the two PCMs, and the factorization will be determined by the available intensity at each PCM along with the associated response time to achieve the required total gain. This is as far as a quantitatiue temporal analysis of the MOFENAR performance can practically go, for the following reasons. Photorefractive four wave mixing (the basis of phase conjugation) is a physical process which achieves a steady state optical condition when the input signal is applied constantly. The investigations that have been reported in the literature into the response of photorefractive crystals have for the most part concentrated on this steady state operation, where reflective gains of up to 100 have been reported. For the “MOFENAR neural net architecture, we are instead concerned with the response of the photorefractive elements when presented with pulses of short duration, on the order of or less than the “response time” of the material itself.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 543 3. Hardware Consideration and Experimental Demonstrations The proposed MOFENAR architecture consists of currently available optoelectronic devices and device technology components combined t o achieve high speed pattern recognition and optical signal processing. The hardware elements available today are described, and the system integration and software design requirements are discussed in the following section. 3.1. Hardware Considerations As presented earlier, an important aspect of the proposed architecture is that it calls for largely available devices and device technology to construct a breadboard prototype system. The major elements of the system include the spatial light modulator, multifocus Fourier transform architecture, interconnect matrix hologram, and photorefractive phase conjugate mirrors to be used at the ends of the iterative optical cavity. A discussion of the current performance capability and availability of each follows. 3.1.1. Spatial light modulator
The input data t o the “MOFENAR” technology originates at one or more sensors measuring the field of interest. The signals from these sensors must be incorporated onto the input light beam. This is accomplished through the use of an addressable spatial light modulator. Such a device transfers an input optical or electronic signal onto a coherent collimated beam of light, in most cases a laser. Optically addressed devices include the Hughes Liquid Crystal Light Valve (LCLV). Electrically addressed devices include the Magneto-Optic SLM (MOSLM) , the STC Ferroelectric Liquid Crystal SLM (FLCSLM), the liquid crystal television (LCTV) SLM, and the Texas Instruments Deformable Mirror Device (DMD). The areas which can be used to quickly determine the apparent device of choice are speed and availability. The liquid crystal devices are currently capable of no more than 100 frames/second. The higher speed devices are the Texas Instruments DMD and the MOSLM. We select the MOSLM as an example for the purpose of discussion. The MOSLM device has also been shown to be more robust a device than any of the liquid crystal SLMs. Results of successful military shock and vibration tests performed upon the 128 x 128 array, as well as its documented wide temperature range of operation were recently presented. Table 3.1 gives specifications of MOSLM and STC FLMSLM for comparison. The MOSLM outperforms the STC device not only in speed, but is also available in higher resolution, and achieves much higher contrast. The STC ferroelectric SLM demonstrates greater transmission and wavelength flexibility. The MOSLM device consists of an array of individually addressable pixels. These pixels may be switched to two oppositely magnetized homogeneous states, as well as to a heterogeneously magnetized state, referred t o as the “neutral” magnetization state. The pixels operate by rotating the plane of polarization of incident
544
H.-K. Liu et al. Table 3.1. Characteristics of two available 2-D spatial light modulators. Performance Category
MOSLM
STC FLMSLM
Resolution Contrast Ratio Transmission Wavelength Range Frame Rate
482, 1282,2562 > 10,000 : 1 > 14% 514 to 850 nm > 1000 f/s
> 100 : 1 > 65% 0.5 to > 1 nm > 100 f/s
1282
light clockwise or counterclockwise, depending upon which of the two homogeneous magnetic states the pixel is switched to. When an incident linearly polarized beam is transmitted through the device, the output polarization is rotated clockwise or counterclockwise in a manner dictated by the pixels’ switched states. This polarization modulation is translated t o a n amplitude or phase modulation by placing an analyzer polarizer in the path of the transmitted beam. When the polarizer’s transmission axis is oriented perpendicular to one of the two output polarization angles, the light passing through pixels switched to that state is extinguished, while that passing through pixels switched to the opposite homogeneous state is partially transmitted. This modulation is referred t o as “binary amplitude only” modulation, and is shown below in Fig. 4.
Fig. 4. Principle of Operation of the MOSLM.
If alternatively, the analyzer polarizer’s transmission axis is oriented perpendicular t o the bisector of the two output polarization angles, the transmitted amplitude is the same for all pixels, but the phase of light from pixels switched in one homogeneous state is 180 degrees out of phase with that of the light from pixels switched in the opposite homogeneous magnetization state. This modulation is referred to as “binary phase only” modulation. For non binary operations, liquid crystal devices with lower speed need be considered.
3.4
Multi-Sensory Opto-Electronic Feature Extraction
...
545
3.1.2. Multifocus Fourier transform device
To accomplish the desired replication of the input pattern’s two-dimensional Fourier transform, the inherent structure of the MOSLM spatial light modulator may be utilized. However, t o achieve intensity uniformity throughout the replication plane, an alternative holographic solution would consist of custom manufacturing a single hologram that would be placed after the input MOSLM, and which would result in the required uniform replication of the Fourier transformed input pattern. Figure 5 shows an example of a Fourier replication due t o pixelation of an input function. The computer simulation program used t o demonstrate this feature assumes a fixed focal length, and thus only shows a fixed portion of the replicated field, but in fact the replication period may be adjusted at will by varying the Fourier lens.
Fig. 5. Example of Fourier pattern replication due to input SLM pixel array.
The MOSLM, as well as other similar devices, contains opaque rows and columns between the rows and columns of active pixels. This becomes a two-dimensional grating when the MOSLM is used in a Fourier system, such as here, and the result is a replication at the Fourier plane of whatever image is programmed onto the MOSLM. The spacing of the replications depends upon the focal length of the Fourier transforming lens, the spacing of the pixels in the MOSLM, and the incident wavelength. Thus by choosing the focal length of the Fourier lens, we achieve a twodimensional periodic replication of the desired Fourier transform pattern. The intensity of the replicated Fourier transforms falls off with increasing order. This falloff was experimentally measured using a 1282 MOSLM array and an incident 633 nm laser beam, and values are given in Table 3.2, along with the required recording time for each t o cancel the intensity variation. If one assumes that the efficiency of the recorded hologram is proportional t o the recording time, then by inverting the intensity value of each order, and using that number as a relative recording time, the net transmission will be a constant. Thus the 1.M.H hologram recorded using these first five orders of the MOSLM would require a recording time up t o 100 times longer for the outer orders than for the strongest zero order replicated pattern.
546
H.-K. Liu et al. Table 3.2. Experimental 128’ array replicated output intensities, and calculated corrective recording times. Order
Measured Intensity
Recording Time
0
47.5 pw 27.3 35.5 3.00 7.95 1.40 2.00 0.45 0.90 0.85 2.30
1.00 T(0) 1.74
+1 -1 +2 -2 +3 -3 +4 -4 +5 -5
1.34 15.8 6.00 33.9 23.8 105.0 52.8 55.9 20.7
Binary optic gratings are another applicable technology which potentially could be used to complement the effect of the SLM grating itself, making the replication intensity more uniform.
3.1.3. Interconnect matrix hologram (I.M.H.) This element, placed at the Fourier plane of the MOFENAR architecture, must enact the required phase and amplitude modulation upon the incident optical beam to satisfy the learning rules which are to be implemented within the MOFENAR system. This element must also display high resolution recording capability in order to enact the desired transmission function at each of the locations of the replicated optical Fourier transform incident from the input MOSLM. The device will be encoded in the following manner. Refer to Fig. 6. A desired reference data pattern, a ( z , y ) b ( x , y ) , is first encoded on the spatial light modulator, which in turn encodes the pattern onto the
+
Aperture,
Plxelated S.L.H.,
‘*I ......
& 1.tl.H. n
F.T.0
4 F.T. -1
+ F.T. Fig. 6. Interconnect Matrix Hologram Recording Process.
-2
3.4
Multi-Sensory Opto-Electronic Feature Extraction
...
547
laser beam. This pattern is then optically Fourier transformed and replicated, as discussed above. An aperture is placed just in front of the I.M.H., equal in size t o a single replicated Fourier pattern, and centered over one of the replicated order locations. A holographic recording is then made of the joint Fourier transform intensity pattern. Note that only the section of the I.M.H. illuminated by the aperture is recorded. After recording one Fourier hologram in the I.M.H., the input laser is blocked, the input MOSLM is reprogrammed with a different desired pattern (either a(z,y), b(z,y) or both are changed), and the aperture is translated to an adjacent location on the I.M.H. This process is repeated throughout the plane of the I.M.H., until a complete array of reference holograms is achieved. The recording time is varied to correct for the incident intensity variation. By using a recording time that is inversely proportional to the intensity of the particular replicated order, the efficiency of the I.M.H. may be altered t o cancel this variation and produce equivalent outputs for any of the recorded orders. Table 3.2 presents calculated recording times for the case of the 128 x 128 MOSLM array. Noting the dramatic fall off in intensity for the device in going from the zero order to the plus and minus fifth orders, this seems a reasonable limit of replications to attempt in a prototype MOFENAR architecture. This results in 121 different reference data patterns being stored in the I.M.H. When these data patterns have been recorded, the aperture is removed. Because the non-uniformity of the diffraction pattern may cause a problem in the reading of the hologram and in the amplification, reflection, and thresholding operations, it may be necessary that an HOE be utilized to achieve a more uniform replication. The drawback here is the reduction of light throughput, as limited by the HOE’S diffraction efficiency. 3.1.4. Phase conjugate mirrors
Another, perhaps most challenging element of the proposed breadboard prototype are the phase conjugate mirrors (PCMs). These devices produce a phase conjugated, opposing beam t o that which is incident upon the surface of the mirror. Such a beam exactly retraces the path of the incident beam, and therefore may be described as a “time reversed” equivalent to the incident beam. It is this “time reversal” of the incident wave that makes possible the iterative optical processing of the input pattern. The architecture that has been studied over the last several years to accomplish phase conjugation is known as “four wave mixing”, and is shown in Fig. 7. Interference of the incident signal beam and a plane wave pump beam 1 forms an intensity pattern in the crystal. This intensity distribution creates a charge migration in the photorefractive crystal that results in a corresponding refractive index change. Pump beam 2, also a plane wave, enters the crystal from the opposite direction as pump beam 1, and is diffracted by the index modulation. The result
548 H.-K. Liu et al.
Signal Beam
\
Fig. 7. Phase conjugate mirror architecture.
of this diffraction is the creation of a phase conjugated beam counter propagating along the path of the incident signal beam. If the intensity of pump beam 2 is much greater than that of the signal beam, the diffracted intensity of the phase conjugated beam may exceed that of the incident signal beam, and gain is achieved. This gain is simply a result of coupling between the incident and plane wave beams. Such amplification has been previously shown in BaTiOs. Here we present some specific points regarding the performance and characteristics of four photorefractive materials used in four wave mixing experiments: LiNb03 is a well studied material, and is perhaps more readily available than others. It has a small electro-optic coefficient, fast response time, and small diffraction efficiency. The response time increases for smaller grating periods, while sensitivity (gain) increases with period. Sensitivity is low due to small carrier drift and diffusion lengths. BaTi03 has a large electro-optic coefficient, and relatively slow response times (100 ms for approximate 100 mW/cm2 intensity). Its response time decreases for smaller grating periods. Its gain increases for smaller grating periods, and for less intense signal beams. Amplified phase conjugation has been demonstrated in BaTiO3, and this material is capable of very good performance under correct conditions. BSO has a small electro-optic coefficient, poor sensitivity, and fast response time. It is strongly optically active, and poorly understood (e.g. input-output polarization relationship, transmitted-diffracted beam interaction, polarization states). SBN is another material that is not well documented. It has been suggested that it can be made to perfor; [21] comparably to BaTi03. It has a slow response time (in seconds with 1 W/cm2 intensity). In general, for all of the above materials the steady state gain saturates for large pump/input beam intensity ratios (Ipump/Iinput >> 1). As this ratio gets smaller, e.g. less than 1000, the gain decreases. Two requirements of the PCM for the MOFENAR application include the ability to achieve gain between the incident and reflected phase conjugate beam, as well as the ability to perform nonlinear optical thresholding upon the incident intensity distribution. The first requirement is necessary due to attenuating effects present
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 549 elsewhere within the cavity, namely losses due to absorption and scattering. The second requirement exists to achieve the desired self correcting, adaptive capability described above. A third requirement of the PCM is that it respond at least as quickly as the input pattern cycles. This is seen by recalling above that it is the incident beam itself which forms the grating within the PCM that creates the phase conjugate beam. Therefore every time the input changes, and thus the incident beam distribution changes, the desired grating in the PCM must be reformed before the new incident beam will be phase conjugated. Finally, an important aspect of this element of the MOFENAR architecture is the availability of photorefractive crystals. An investigation into availability led to the conclusion that there are in fact few sources available to those interested in simply purchasing crystals to experiment with. Much of the research that has been performed in this area has been done by those who have the capability to grow such devices themselves (or whom are directly associated with those who can grow the devices). There are devices available however, and one centimeter cube single crystals of both BaTiOs and BSO can be obtained within a reasonable amount of time. 3.2. Experimental Demonstrations
There are many examples of the applications of optical pattern recognition techniques [34-521. In this section, we present two optical pattern recognition experiments to demonstrate the feasibility of the MOFENAR. First we discuss the input/out format to the system. The input to the MOFENAR architecture consists of electrical signals from one or more detectors, which are fed to the input MOSLM. The formatting of these signals requires standard, consistent mapping onto the MOSLM for efficient associative recall to be achieved. To accomplish this input data formatting for the SLM, software may be written which processes the individual signals from each of the utilized detectors. Frame grabbers are available for the acquisition, digitization, and processing of input images and signals. Real images may also be processed by the MOFENAR architecture. Examples of real image input and multi-sensor input architecture that are feasible with the neural net are shown in Figs. 8(a) and (b). The output of the MOFEhAR architecture consists of the reconstructed reference pattern which is found by the iterative optical neural network algorithm implemented in the MOFENAR system. This output may be read directly in the case of real image analysis and pattern recognition, or may be analyzed with a dedicated electronic state circuitry custom tailored to the specific application in which the MOFENAR is utilized. The optically generated output of the MOFENAR may also be analyzed optically, using standard optical processing architecture such as the Vander Lugt correlator or a digital optical signal processing network.
550
H.-K. Liu et al.
m-+3? Input I
n-
S.L.H. Pattern
Input 3
(4
Real Image
A r r a y Detector
.
S.L.H. P a t t e r n
(b) Fig. 8. (a) Real image input to the MOFENAR neural net; and (b) One or multi-sensor MOFENAR input architecture.
M
COMPLETE
OBJECT
Pmnu OBJECT (RECOUING)
Fig. 9. Optical feedback SOFENAR by using volume holographic medium and conventional mirror.
First, the experimental demonstration of an architecture of single-sensory optoelectronic feature extraction neural associative retriever (SOFENAR) is accomplished. Then, a MOFENAR with electronic feedback and angularly or spatially multiplexed is demonstrated.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 551 3.2.1. Single-sensory opto-electronic feature extraction neural associative
retriever (SOFENAR) First, we discuss two kinds of SOFENARs with optical feedback and electronic feedback, respectively. The experimental demonstration for the scheme with optical feedback is presented. a. SOFENAR with Optical Feed-back An optical feedback SOFENAR is shown in Fig. 9. In the recording (memorizing) step, the shutter is turned on, and the object beam carrying complete input information passes through a Fourier transform lens (FL), interferes with a reference beam to write a dynamic grating in a holographic medium. The grating includes three components, 10l2 lrI2, 0,. and 0;)where 0 represents the Fourier transform of the object 0, r denotes the reference beam, and symbol “*” means phase conjugation. In the retrieval step, the shutter is turned off, the object beam carrying incomplete input information (0’)to read out the hologram t o generate a part of the related reference beam ( T ’ ) , the reference beam is counter-propagated by using a mirror (M) t o read out the same hologram again, so that the corresponding complete object, F{O‘O*)~T~~O*}, can be obtained from output plane via a beam splitter (BS), where 0‘ means the Fourier transform of the incomplete input object 0’ and F represents Fourier transform operation. In this scheme, the holographic medium can be either planner or volume materials. Figure 10 shows experimental results of 2-D image associative sensing by using the optical feedback SOFENAR, two images are shown respectively in (a) and (b). The left part and right part represent a n incomplete input object and a completely reconstructed complete object, respectively. The reversed reconstructed-objects in the vertical direction comes from the reverse imaging of the optical setup. In addition t o the images shown in Fig. 9, the image of a finger print is used to demonstrate the feasibility of the system. Results are shown in Figs. 11-15.
+
(4
(b)
Fig. 10. Experimental results. (a) Image object 1, and (b) Image object 2. The left part and the right part are the incomplete input objects and reconstructed complete objects, respectively.
552
H.-K. Liu et al.
Fig. 11. The original image of a finger print stored in a PC and reproduced from a laser printer.
Fig. 12. The retrieved image of the finger print with an original input that is limited by an aperture of 1/5 of the diameter centered at the image.
Fig. 13. The retrieved image of the finger print with an original input that is limited by an aperture of 1/2 of the diameter centered at the image.
3.4 Multi-Sensory Opto-Electronic Feature Extraction .. , 553
Fig. 14. The retrieved image of the finger print with an original input with its right half blocked.
Fig. 15. The retrieved image of the finger print with an original input with its left half blocked.
Fig. 16. The a laser printer.
.educed from
554
H.-K. Liu et al.
Fig. 17. The retrieved image of the resolution chart with an original input that is limited by an aperture of 1/5 of the diameter centered a t the image.
Fig. 18. The retrieved image of the resolution chart with an original input that is limited by an aperture of 1/2 of the diameter centered at the image.
An Air Force resolution chart is used to show the resolution of the system in the retrieved image. Results are shown in Figs. 16-18. From these results, we may conclude that when the diameter of the aperture of the image is 1/5, only incomplete retrieval can be obtained. When the diameter is increased to 1/2, reasonable retrieval is obtainable. Also, the image retrieval process is asymmetrical in the horizontal direction due to the thick volume recording property of the photorefractive crystal.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . .
555
COMPLETE
FL2
OBJECT OUTPUT
HOLOGRAPHIC MEDIUM
:.___.___._.____ :...........F b____- - {
DETECTOR:
THRESHOLD
1 ---:
COMPLETE
PARTIAL OBJECT (RECOLLING)
Fig. 19. Electronic feedback SOFENAR. The holographic medium is either a thick volume or a thin planner material.
b. Electronic feedback SOFENAR The electronic feedback SOFENAR is shown in Fig. 19. In the memorizing step, both shutter 1 and shutter 2 are turned on, and the interference grating is recorded with a holographic medium passing through FL1. In the sensing step, shutter 1 is turned off and shutter 2 is turned on at the beginning, the incomplete input object generates a partial reference beam incident into a detector. The electronic signal output from the detector passes though a thresholding device and a controller to reverse the states of these two shutters as long as the intensity of the reconstructed reference beam is larger than a certain value set in advance. The original reference beam passes through shutter 1 t o read out the complete object from output plane via FL2. The SOFENARs described above can be expanded to become a MOFENAR by using angularly and/or spatially multiplexed recording with either optical and/or electronic feedback arrays. We first present the architecture of electronic feedback MOFENAR with angular multiple5ing and spatial multiplexing and then show a n experimental feasibility demonstration. 3.2.2. Multi-sensory opto-electronic feature extraction neural associative retriever ( M O F E N A R )
a. Electronic feedback angularly-multiplexed MOFENAR The architecture of the optical feedback SOFENAR (see Fig. 9) may be modified to form a MOFENAR by using angularly multiplexed reference beam in a volume
556
H.-K. Liu et al.
Fig. 20. Angularly multiplexed MOFENAR with an electronic feedback, a photorefractive crystal, and a Dammann grating.
holographic medium, as shown in Fig. 20. An N x M (1x 3 in the Fig. 20) Dammann grating (DG) is used to generate a N x M angularly multiplexed replicates of an incident plane-wave reference. Lenses L2 and L3 are used to image the rear surface of the DG to a photorefractive crystal (PRC). The passage of each angularly multiplexed reference beam in the array can be individually controlled by using a corresponding shutter array (SA), which is located on the focal plane of L2. In the recording step, a series of inputs from a spatial light modulator (SLM) are Fourier transformed with lens L1 and recorded on the PRC with a different reference beam by controlling the SA. In the retrieval step, all elements of the SA are turned off, and an incomplete object input from the SLM is Fourier transformed and used to read out a part of the related reference beam which is then focused by lens L5 and detected with a certain element of the detector array (DA). The electronic signal output from the detector element is then used to control the SA and SLM (after removing the incomplete input object from the SLM and turning on the corresponding element of the SA) via the threshold device and the controller. The specific reference beam passing through the SA, generates a complete object out of the incomplete input object. This output is obtained at the image plane of lens L4. b. Electronic feedback spatially-multiplexed MOFENAR bl. Architecture An electronic feedback spatiaily-multiplexed MOFENAR can be constructed by using a spatially-multiplexed reference beams in a planner holographic medium, as shown in Fig. 21. In this architecture, the configuration of the reference beam is the same as that of the previous figure. A Dammann grating (DG2) is used to generate an N x M spatially-multiplexed replications of the Fourier transform of the input object. A shutter array (SA2) is used to individually control the propagation state (passing or blocking) of each replicated beam, and only one replicate at one time is allowed to address the holographic medium. If one input is recorded at one location
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 557 REFERENCE BEAM
OUTPUT
.....................-...................... Fig. 21. Spatially-multiplexed MOFENAR with an electronic feedback, a planner holographic medium, and two Dammann gratings.
of the hologram, then N x M different inputs may be recorded (e.g. N = M = 3 in Fig. 21). At the beginning of the retrieval, SA2 is turned on and SA1 is turned off. The partial reference beam generated by the related incomplete input is then detected, thresholded, and used t o turn off SA2 and turn on the element of SA1 to read out the corresponding complete image. b2. Experimental Results The experimental set-up of a three-channel electronic-feedback MOFENAR are shown in Figs. 21 and 22. Attention is called to Fig. 22. The laser beam comes from the 514.5 nm line of a Coherent Innova 306 Ar+3 laser which passed through a half-wave plate (HPl), and is split by a polarizing beam splitter (PBS) into two; an ordinary beam and a n extraordinary beam. The transmitted extraordinary beam is expanded, spatially-filtered, and collimated using lens L1, a pin-hole spatial filter, and lens L2. This is the object beam. A double Mach-Zehnder interferometer is used to provide three input paths for inputs of A, B, and C. Three shutters SH1-SH3 can be used to select the specific input for recording and recalling. The input object beam is then Fourier transformed by lens L3 and applied t o a photorefractive crystal (PRC). The reflected ordinary beam from the PBS is converted to an extraordinary beam using a half-wave plate (HP2) which then illuminates a 1 x 3 optical fan-out element (OFE) and generates three reference beams with equal intensities. The transmission of the three reference beams are controlled by three shutters (SH4SH6). During learning, the six shutters are turned on a pair at a time (SH1 and SH4, SH2 and SH5, SH3 and SH6) t o record three individual complete input objects via angular multiplexing in the PRC. At the beginning of the retrieval process, shutters SH4-SH6 are turned off, and one of the shutters SH1-SH3 (e.g. SH1) is turned on to apply an incomplete object (e.g. A) to reconstruct the corresponding reference
558 H.-K. Liu et al. HP1-HPZ: HALF-WAVE PLATES
MI-M7: MIRHOAS
BSI-BS4 BEAM SPLITTERS Ll-L5: LENSES SH1-SH6: SHUTTERS PBS: POLARIZING BEAM SPLITTER PRC: PHOTOREFRACTIVE CRYSTAL PHI PIN-HOLE SPETIAL FILTER OFE: OPTICAL FAN-OUT ELEMENT (1 X 3) D1-D2; DETECTORS A-C: INPUTS
2
D1 Fig. 22. The experimental set-up of a MOFENAR with electronic feedback.
Fig. 23. The three reference beams reconstructed by three complete inputs corresponding respectively to the image of space shuttle DISCOVERY (left spot); ATLANTIS (middle spot); and COLUMBIA (right spot).
beam and unavoidably also due to cross-talk some other reference beams. These reference beams are sensed by a ;hoto-detector array D1 with a proper threshold for the suppressing of the cross-talk terms. The correctly associated reference beam which is selected after thresholding is then feedback to turn on the corresponding shutter (e.g. SH4) to reconstruct the complete image of the object (e.g. A) through lens L5 and detector array D2. In the experiment, the objects chosen are three different space shuttle images (i.e. DISCOVERY, ATLANTIS, and COLUMBIA). Figure 23 shows the three reconstructed reference beams when the three incomplete objects are applied.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . .. 559
(b)
(c)
Fig. 24. Associative retrieval experiment; input with DISCOVERY. (a) Incomplete input, (b) reconstructed reference, and (c) retrieved.
(b)
(c)
Fig. 25. Associative retrieval experiment; input with ATLANTIS. (a) Incomplete input, (b) reconstructed reference, and (c) retrieved.
560 H.-K. Liu et al.
(b)
(c)
Fig. 26. Associative retrieval experiment; input with COLUMBIA. (a) Incomplete input, (b) reconstructed reference, and ( c ) retrieved.
Figures 24-26 show three sets of associatively retrieved images. In each of the figures, (a) shows the incomplete input object; (b), the reconstructed reference beam; and (c), the retrieved complete object. 4. Conclusions
We have presented the results of a thorough theoretical and experimental investigation on the characteristics and capabilities of a novel optical neural network pattern recognition architecture. We have presented detailed analyses and experimental results and have constructed experimental systems utilizing state-of-the-art optical elements and devices which are either currently available or can be obtained in a reasonable time and manner. The major elements of this investigation include the theoretical design and explanation of the proposed system architecture and operation, a computer simulation investigation into the implementation of desired learning rules and performance of the system in various applications, and an investigation of the hardware elements required for the implementation of the breadboard system. The individual elements required for implementation of such a system are currently available. A spatial light modulator which is currently capable of demonstrating frame rates of over 1000 per second has been documented, and standard dichromated gelatin holographic technology is shown to allow the resolution necessary
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 561 for a prototype implementation such as is proposed. The phase conjugate mirrors show great promise, and experimental application of various crystals as phase conjugate mirrors is investigated and referenced. However, the body of experimental data on this new field of optics is as yet not sufficient for a definite conclusion t o be drawn on the exact performance of such materials in the proposed MOFENAR system. Some important questions regarding these materials are raised in the following section. A mathematical model of the MOFENAR principle of operation was derived. This formalism was subsequently translated t o computer code form, and implemented t o provide simulation results of the performance of such a system. These results were documented and presented, which support the conclusion that the MOFENAR system does indeed offer the capability of adaptive, fault tolerant operation. Using both angularly multiplexed and spatially multiplexed MOFENARs described above, a large number (an order of thousands) of objects may be memorized and sensed, in parallel. The designed MOFENAR with electronic feedback is of large information capacity, good discrimination capability, and high image quality. As shown above, the MOFENAR architecture offers significant potential capability to perform parallel multi-sensory recognition of input patterns or vector sets originating from multiple sensors, which measure either identical or different types of signals. This capability has direct potential applications in several fields, including parallel database search, image and signal understanding and synthesis, and robotics manipulation and locomotion, in addition to real-time multi-channel real pattern recognition. In the following sections, we examine some of the possible applications that may utilize the capability of the MOFENAR neural net, and indicate how they might be investigated using the proposed prototype system.
4.1. Real Images The most obvious application of the MOFENAR architecture is a neural pattern recognition tool, for use in NASA and/or industrial environments. The input patterns may be directly fed to a spatial light modulator from a standard or nonstandard video camera. This input may be one of a known set of potential inputs. The set of known potential inputs may be stored in matrix form in the Interconnect Matrix Hologram. Thus, the input pattern, which may contain noise or variations, will create a closest “match’’ with the most similar stored reference image, and the thresholding, iterative optical processing of the MOFENAR will distinguish and recreate this reference pattern at the output plane of the neural network. This output may either be presented for direct visual confirmation, or may be used as an input into a n optical post processor, such as a standard optical correlator.
562
H.-K. Liu
et al.
One example of such a utilization of the MOFENAR is robotics vision. The proposed neural network offers the capability of a relatively compact optical processing system to handle imperfect visual input and draw conclusions based upon its own reference library of patterns. This may allow real-time vision analysis to be performed, which could be translated to independent robotics locomotion and environmental interaction. 4.2. Related Vector Sets
The second major area in which the MOFENAR neural network might be well utilized is in the processing of large and complex data sets, either from a single detector or source, or from more than one detector simultaneously. Known data patterns which are to be searched for may first be encoded in the Interconnect Matrix Hologram of the architecture, as described in Section 4.1. The input SLM may then be fed continuous information from detectors aimed at the field(s) of interest, and the MOFENAR architecture will in parallel compare each input frame with all of the stored reference data patterns in the I.M.H. Any input data frame which contains a pattern sufficiently similar t o one of those stored in the IMH will result in the ideal reconstruction of only that pattern at the output plane of the neural network. Not only does this allow for direct large scale comparison and recognition of data to be accomplished in parallel, but it also presents the capability of inferring information from the input, and establishing new recognition rules to be sought and found. This is due t o the fact that each input pattern will result in some output that indicates how the input data pattern compares and is similar or dissimilar to, the data patterns stored in the I.M.H. Thus one may attempt t o “teach” the neural network to recognize any and all levels of data patterns present in the incoming signals, both on the level of the discrete data itself and on higher levels of data field. Documentation of the “MOFENAR” output in a training mode may allow the repeated ability to infer information from the input of previously unusable or unseen data patterns. One specific application is a common problem which currently requires massive electronic computing capability. This is weather condition analysis. Individual sensors measuring humidity, temperature, wind velocity, and pressure for example, at one or several locations, and representing a single time or a set of consecutive temporally separated measurements, may be translated t o spatial binary representation, relayed to the input spatial light modulator of the MOFENAR, and processed. The Interconnect Matrix Hologram in this case could consist of experimentally recorded data from the same sensors, which is known to precede specific weather patterns or conditions. The output from the MOFENAR in this case may be evaluated on different levels as described above, and statistical or empirical relationships between the output weather condition vector set and the actual measured weather condition may be drawn.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 563 In summary, a few potential NASA and commercial applications are summarized below:
NASA Applications (a) (b) (c) (d) (e) (f) (g)
Planetary exploration in-situ data analysis and information screening Space surveillance specific object identification and navigation guidance Space image understanding and classification Space station automated rendezvous and docking Space habitation and utilization environmental evaluation and assessment Navigation collision avoidance in Moon and Mars Satellite repair, maintenance and sensing
Commercial applications (a) (b) (c) (d) (e)
Criminal finger prints random access memory and retrieval for police Security check commercial building entrances Automobile plate identification Large capacity free space interconnection for future computers Border patrol and illegal drug traffic prevention
Acknowledgement The authors would like to acknowledge the support of the research work by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautical and Space Administration. The work of Standard International Corporation, and Professor Pochi Yeh of the University of California at Santa Barbara was funded under a NASA-SBIR contract No. NAS7-1307.
References [l] H. C. Longuet-Higgins, Nature 217 (1968) 104. [2] D. Gabor, IBM J. Res. Dewel. 13 (1969) 156. [3] D. Psaltis and N. Farhat, in ICO-13 Conf. Digest (International Commission for Optics, Amsterdam, 1984), paper A1-9. [4] A. D. Fisher and C. L. Giles, in Proc. of IEEE, Compcon Spring (Institute of Electrical and Electronics Engineers, New York, 1985) 342. [5] G. J. Dunning, E. Marom, Y. Owechko and B. H. Soffer, J. OSA. A2, 13 (1985) 48. [6] H. J. Caulfield, Opt. Commun. 55 (1985) 80. [7] H. Mada, Appl. Opt. 24 (1985) 2063. [8] S. Y. Kung and H. K. Liu, SPIE PTOC.613 (1986) 214. [9] H. K. Liu, S. Y. Kung and J. Davis, Opt. Eng. 25 (1986) 853. [lo] B. H. Soffer, G. J. Dunning, Y. Owechko and E. Marom, Opt. Lett. 11 (1986) 118. [ll]Y. Owechko and B. H. Soffer, Opt. Lett. 16 (1991) 675. [12] A. Yariv, S. K. Kwong and K. Kyuma, Appl. Phys. Lett. 48 (1986) 1114. [13] 0. Changsuk and P. Hankyu, SPIE Proc. 963 (1988) 554. [14] H. Yoshinaga, K. Kitayama and H. Oguri, Opt. Lett. 16 (1991) 669. [15] J. W. Goodman, Introduction to Fourier Optics (McGraw-Hill, New York, 1968).
564
H.-K. Liu et al.
[16] D. Z. Anderson, D. M. Lininger and M. J. O’Callahan, Competitive learning, unlearning and forgetting in optical resonators, IEEE Proc. Conf. on Neural Information Processing Systems - Natural and Synthetic (1987). [17] Y. Z. Liang, D. Zhao and H. K. Liu, Multifocus dichromated gelatin hololens, Appl. Opt. 22 (1983) 2351. [18] J. Davis and J. Waas, Current status of the magneto-optic spatial light modulator, SPIE 0-ELase 89, San Diego. [19] J. Waas and M. Waring, Spatial Light Modulators; A User’s Guide, in The Photonics Design and Applications Handbook 1989, Teddi C. Laurin (ed.) (Pittsfield, MA, 1989). [20] Y. Fainman, E. Klancnik and S. Lee, Optimal coherent image amplification by twowave coupling in photorefractive BaTiO3, Opt. Eng. 25, 2 (1986). [21] G. Valley and M. Klein, Optimal properties of photorefractive materials for optical data processing, Opt. Eng. 22, 6 (1983). 1221 G. Rakuljic, R. Ratnakar et al., Self-starting passive phase conjugate mirror with Ce-doped strontium Barium Niobate, Appl. Phys. Lett. 50, 1 (1987). [23] G. Gheen and L. Cheng, Optical correlators with fast updating speed using photorefractive semiconductor materials, Appl. Opt. 27, 3 (1988). [24] P. Yeh et al., Photorefractive nonlinear optics and optical computing, Opt. Eng. 28, 4 (1989). (251 F. Laeri, T. Tschudi and J. Albers, Coherent CW image amplifier and oscillator using two-wave interaction in a BaTiO3 crystal, Opt. Commun. 47, 6 (1983). [26] J . Fienberg et al., Photorefractive effects and light-induced charge migration in Barium Titanate, J. Appl. Phys. 51, 3 (1980). [27] D. Ledoux and J. Huignard, Two-wave mixing and energy transfer in BaTiO3 application to laser beamsteering, Opt. Commun. 49, 4 (1984). [28] V. Vinetskii et al., Dynamic self-diffraction of coherent light beams, Sou. Phys. Usp. 22, 9 (1979). [29] A. Marrakchi and J. Huignard, Diffraction efficiency and energy transfer in two-wave mixing experiments with Bil2SiOzo crystals, Appl. Phys. 24 (1981). [30] H. Rajbenbach, J. Huignard and B. Loiseaux, Spatial frequency dependence of the energy transfer in two-wave mixing experiments with BSO crystals, Opt. Commun. 48, 4 (1983). [31] B. Fischer et al., Amplified reflection, transmission, and self-oscillation in real-time holography, Opt. Lett. 6, 11 (1981). [32] J. P. Huignard and A. Marrakchi, Coherent signal beam amplification in two-wave mixing experiments with photorefractive BilzSiOzo crystals, Opt. Commun. 38, 4 (1981). [33] A. Yariv, Phase conjugate optics and real-time holography, IEEE J. Quantum Electron. QE-14 (1978) 650-660. 134) A. Vander Lugt, The effects of small displacements of spatial filters, Appl. Opt. 6, 7 (1967) 1221-1225. [35] D. Casasent and A. Farman, Sources of correlation degradation, Appl. Opt. 16, 6 (1977) 1652-1661. [36] A. Shimizu and M. Hase, Entry method of fingerprint image using prism, Trans. Inst. Electronic Commun. Engineers Japan, Part D, J67D,5 (1984) 627. [37] L. Cai, S. Zhou, P. Yeh, Y.Jin, N. Marzwell and H. K. Liu, Translation sensitivity adjustable compact optical correlator and its application for fingerprint recognition, Opt. Eng. 35 (1996) 415. [38] A. B. Vander Lugt, Signal detection by complex spatial filtering, IEEE Trans. Inf. Theory IT-10 (1964) 139-145.
3.4 Multi-Sensory Opto-Electronic Feature Extraction . . . 565 [39] V. V. Horvath, J. M. Holeman and C. Q. Lemmond, Holographic technique recognizes fingerprints, Laser Focus 6 (1967) 18-23. [40] F. T. Gambe, L. M. Frye and D. R. Grieser, Real-time fingerprint verification system, Appl. Opt. 31 (1992) 652-655. [41] C. S. Weaver and J. W. Goodman, A technique for optically convolving two functions, Appl. Opt. 5 (1966) 1248. [42] K. H. Fielding, J. L. Horner and C. K. Makekau, Optical fingerprint identification by binary joint transform correlation, Opt. Eng. 30, 12 (1991) 1958-1961. (431 J. Ohta, J. Sharpe and K. Johnson, An optoelectronic smart detector array for the classification of fingerprints, Opt. Commun. 111 (1994) 451-458. [44] Z.-K. Chen, Y. Sun, Y.-X. Zhang and G.-G. Mu, Hybrid optical/digital access control using fingerprint identification, Opt. Eng. 34,3 (1995) 834-839. [45] A. Vander Lugt, Practical considerations for the use of spatial carrier-frequency filters, Appl. Opt. 5,11 (1966) 176C1765. [46] L. Sadovnik, A. Rizkin, 0. Rashkovskiy and A. A. Sawchuk, All-optical invariant target recognition based on intensity-to-phase coding, Opt. Eng. 35 (1996) 423. [47] X. J. Lu, C. Y. Wrigley and D. A. Gregory, Basic parameters for miniature optical correlators employing spatial light modulators, Opt. Eng. 35 (1996) 429. (481 R. Burzynski, M. K. Casstevens, Y. Zhang and S. Ghosal, Novel optical components: second-order nonlinear optical and polymeric photorefractive materials for optical information storage and processing applications, Opt. Eng. 35 (1996) 443. [49] X. Yang and Z. H. Gu, Three-dimensional optical data storage and retrieval system based on phase-code and space multiplexing, Opt. Eng. 35 (1996) 452. [50] M. Montes-Usategui, J. Compos, J. Sallent and I. Juvells, Complex sidelobe removal by a multichannel procedure, Opt. Eng. 35 (1996) 514. [51] A. Gonzalez-Marcos and J. A. Martin-Pereda, Digital chaotic output from an optically processing element, Opt. Eng. 35 (1996) 525. [52] P. Blonda, V. la Forgia, G. Pasquariello and G. Satalino, Feature extraction and pattern classification of remote sensing data by a modular neural system, Opt. Eng. 35 (1996) 536.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 567-578 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
[CHAPTER 3.5
I
CLASSIFICATION OF HUMAN CHROMOSOMES - A STUDY OF CORRELATED BEHAVIOR IN MAJORITY VOTE
LOUISA LAM Hong Kong Institute of Education, Northcote Campus, 21 Sassoon Road, Hong Kong E-mail: 1lamOnc.ied.edu.hk and CHING Y.SUEN Centre for Pattern Recognition and Machine Intelligence, Concordia University, Montreal, Quebec H3G 1 M8, Canada Methods for combining multiple classifiers have been developed for improved performance in pattern recognition. This paper examines nine correlated classifiers from the perspective of majority voting. It demonstrates that relationships between the classifiers can be observed from the voting results, that the error reduction ability of a combination varies inversely with the correlation among the classifiers t o be combined, and that the correlation coefficient is an effective measure for selecting a subset of classifiers for combination to achieve the best results. Keywords: Combination of Classifiers; Majority Vote; Classification of Chromosomes.
1. Introduction In pattern recognition, there has been a recent trend towards combining the decisions of several classifiers in order to arrive at improved recognition results. Its culmination arises from a number of reasons, among which are the demands for highly reliable performance imposed by real-life applications, and the difficulty for a single algorithm to achieve such performance. One area in which combinations of classifiers have been widely applied is that of word and character recognition, i s which many different combination methods have been developed and studied. For example, when each of the classifiers outputs a single class, the decisions can be combined by majority vote [1,2]. If the individual classifiers output ranked lists of decisions, these rankings can be used to derive combined decisions by the highest rank, Borda count, logistic regression and other methods [3,4]. Further developments in obtaining a combined decision include statistical approaches [5,6], formulatidns based on Bayesian and DempsterShafer theories of evidence [7,5], the use of neural networks [8] and fuzzy theory [9]. Some very recent theoretical studies and evaluations of classifier combinations 567
568 L. Lam & C. Y. Suen
are presented in [10,11], while comparisons of results from different combination methods can be found in [12-141. In general, the methods applicable t o combining multiple classifier decisions depend on the types of information produced by the individual classifiers. For abstract-level classifiers that output only a unique label or class for each input pattern, suitable combination methods may be a simple or weighted majority vote, the Bayesian formulation, or the Behavior-Knowledge Space (BKS) method 161. Among these methods, the simple majority vote is the only one that does not require prior knowledge of the behavior of the classifiers. For the other methods, this knowledge should be based on the results obtained from a large set of representative data, and the appropriate size of the database increases sharply with the number of pattern classes considered [15]. For example, to combine k classifiers in an n-class problem, a Bayesian formulation would require the estimation of the n2 entries of the confusion matrix for each classifier, while the BKS method would need the determination of O(nk+') such entries. Consequently, to model these behaviors on a problem such as the classification of human chromosomes (a 24-class problem), or to use neural networks to combine the classification results, would demand huge databases. This being the case, the simple majority vote is a reasonable method t o use for combining the decisions of chromosome classifiers. In this work, we study the results of combining nine such classifiers. The perspective will be different from that of previous publications [16,17], in which the majority vote problem has been theoretically analyzed based generally on the assumption of independence of opinions. While independence cannot generally be assumed in practice, it is a closer approximation when the algorithms are developed separately, using different features and classification methods. This happened in some previous applications of majority vote to OCR [1,14]. In the present context, however, all the classifiers in the study make use of the same extracted features, and they differ only in the classification phase. This creates a dependency in their decisions which can be observed in the combined results. We will first examine the effects of this dependence on the combined results, after which some aspects of correlated voting will be studied. Based on a measure of correlation, we will establish a criterion for selecting a subset of classifiers that would be effective in combination.
2. Classification of Human Chromosomes Many different techniques have been used in the automatic classification of human chromosomes into 24 classes (1-22, X, Y). Researchers have reported their results on some standard databases, among which are the Copenhagen, Edinburgh, and Philadelphia datasets. In addition, feature sets have been extracted from these databases and made available by Dr. Jim Piper at Vysis, Inc. (formerly Imagenetics) in Napierville, Illinois. Different classification methods have been used by
3.5 Classification of Human Chromosomes . . . 569 various researchers on these feature sets. These databases and features are summarized below.
2.1. Databases The first database was collected at Rigshospitalet, Copenhagen in 1976-1978 and consists of 180 peripheral blood cells. This database contains the ‘(cleanest” images. The second database was obtained at the Medical Research Council of Edinburgh in 1984, and it contains 125 male peripheral blood cells. The third database was obtained at the Jefferson Medical College, Philadelphia in 1987 and contains 130 chorionic villus cells. This last database is widely recognized as being difficult to analyze. Due to the differences in quality of the datasets, it was found that classifiers should be trained and tested within each database [HI. For this reason, each database is divided into sets A and B, so that each classifier can be trained on one set and tested on the other, after which the roles of the two sets are reversed. The numbers of chromosomes contained in the databases are shown in Table 1. In this paper, test results from both sets are used together for experimentation. Table 1. Summary of databases.
A B Total
Copenhagen
Edinburgh
Philadelphia
3416 4690 8106
2617 2931 5548
2947 3000 5947
For each chromosome, a 30-dimensional feature vector is obtained [18]. The features include normalized area, size, density, convex hull perimeter and length, weighted density distributions, global shape features, as well as centromeric indices. 2.2. Classifiers and Results Classification results are provided by the following nine methods: C1: Constrained classification by transportation method applied to negative log-likelihoods, using 16 out of the 30 features. C2: Same as C1, but all 30 featu,res are used. C3: Constrained classification by transportation method applied to Mahalanobis distance, using 16 out of the 30 features. C4: Same as C3, but all 30 features are used. C5: Constrained classification by transportation method applied to logarithm of Mahalanobis distance after weighting off-diagonal covariance terms by 0.8, using 16 out of the 30 features. C6: Same as C5, but all 30 features are used. C7: A maximum likelihood classifier applied independently to each chromosome.
570 L. Lam & C. Y.Suen Table 2. Error rates of individual chromosome classifiers. Classifier Copenhagen
Database Edinburgh
Philadelphia
3.36 4.89 2.65 3.97 2.05 2.04 4.80 2.62 3.92
14.85 14.62 13.82 14.11 11.88 11.19 16.93 12.44 16.76
19.00 23.68 17.79 21.89 14.53 14.71 23.09 16.53 22.84
c1 c2 c3 c4 c5 C6 c7 C8 c9
C8: Rearrangement classifier. C9: A probabilistic neural network. Of these methods, C1 - C6 are described in [19-21], discussions of C7 and C8 can be found in [22] and [23] respectively, and C9 is presented in [24]. Each of the above classifiers outputs a class label for every chromosome without rejections, and their error rates for each database (sets A and B combined) are shown in Table 2. These results clearly reflect the differences in quality between the databases, and they also show that the Copenhagen database contains images that are probably cleaner than can reasonably be obtained under normal circumstances. 3. Majority Vote Results As stated in Section 1, the simple majority vote is an effective method for combining the decisions of the classifiers in this problem. In particular, it is of interest to note that relationships between the classifiers can be deduced from the majority vote results. The pattern of majority votes has been studied in [17],where the following results have been established: 1. Combinations of even numbers of classifiers tend to produce both lower recognition and error rates than that of odd numbers. Adding one classifier to an even number would increase both the recognition and error rates, while adding this to an odd number would decrease both rates. These conclusions hold regardless of the performances of the individcal classifiers and whether they are independent or not. 2. Assuming that classifiers make independent decisions, adding two classifiers to an even number tends to increase the recognition rate, while the effect on the error rate would depend on the individual performances and cannot be determined a priori. On the other hand, adding two classifiers to an odd number tends to reduce the error rate, while the effect on the recognition rate depends on the individual results.
3.5 Classification of Human Chromosomes . . . 571 When multiple classifiers are used in pattern recognition, it means that the different classifiers will consider the same pattern images, and hence independence of decisions cannot be assumed. In the present context, the classifiers make use of the same feature sets, and are even more strongly dependent than usual. However, the combination results do show the tendencies established in (1) and (2) above. This is illustrated in Fig. 1, which contains the classification results from combining 2 to 6 (out of 7) classifiers on the Copenhagen database. Only combinations among 7 classifiers are shown in this figure, because 2 pairs of classifiers have been found to be highly correlated (as described below), and therefore one classifier has been selected from each such pair. Besides, combining more classifiers would produce a denser distribution of points, which makes the pattern less clearer. From Fig. 1, it can be seen that the addition of one classifier creates a zigzag effect which is stated in (1) above. The behavior of (2) is discernible in that the addition of two classifiers to an even number does cause a general increase in the recognition rate, while such an addition to an odd number produces a movement to the left caused by a lower error rate. This behavior is less prominent, in part due to the dependence among the decisions. To clarify the behavior of majority voting results in Fig. 1, the average rates and ranges of values are shown in Table 3. These values show that adding 2 classifiers to an even number increases both the correct and error rates, while adding 2 classifiers to an odd number mainly decreases the error rate.
m
m 95 -
m
m
m
m
m
m
m
m
m m
94-
m
2 classifiers
X
3 classifiers
+
4 classifiers
*
5 classifiers
0
6 classifiers
m m 93 0.6
m
4
I
I
0.8
1
1.2
I
I
1.4 1.6 ERROR
I
I
1.8
2
2.2
Fig. 1. Results from combinations of classifiers on Copenhagen database.
I 2.4
572 L. Lam €$ C. Y. Suen Table 3. Range of values for combinations on Copenhagen database. Number of classifiers in combination
Average recog. rate
Range of recog. rates
Average error rate
Range of error rates
2 3 4 5 6
95.36 97.52 96.89 97.67 97.26
[93.45,97.251 [96.68,98.041 [96.05,97.431 [97.41,97.941 [97.10,97.461
1.06 1.86 1.19 1.65 1.22
[0.69,1.581 [1.57,2.221 [0.96,1.441 [1.42,1.801 [1.12,1.371
It is noted that combinations of classifiers in this problem are not as effective in reducing errors as in a previous application to OCR [14]. Since this is postulated to be a consequence of correlated behavior, the relation between correlation and error reduction is examined in the next section.
3.1. E m r Reduction and Correlation Majority voting results can be effective indicators of correlations between pairs of classifiers. For example, Table 4 contains the error rates obtained from the majority (also unanimous) decisions of all pairs of classifiers on the Edinburgh database. (In the rest of this paper, results are usually shown for one database when the same patterns of behavior are observed in all three databases). In Table 4, the high error rate resulting from the combination of C1 and C3 (or C2 and C4), when compared to the other entries, shows that these classifiers are highly correlated. Since these four classifiers are actually based on two classifiers using 16 and 30 features respectively, it implies that the transportation method, whether applied to negative log-likelihoods or Mahalanobis distance, tends to make the same misclassifications. The results also indicate that the probabilistic neural network makes different erroneous classifications from all the other methods, and is therefore very effective for reducing the error rates in combination. This is found to be true for combinations of any number of classifiers; in other words, among all combinations composed of the same number of classifiers, C9 is almost always present in the combination that produces the lowest error rate. In order to quantify the reduction in error rates, we define the error reduction ratio E, in the following way. Suppose Eave is the average error rate of two classifiers, and E, is the error rate of their combination. Then E, = (Eaue - Ec)/Eave becomes a “normalized” measure of the effectiveness of majority vote in error reduction, and E, tends to vary inversely with the value of E,. This can be seen by comparing the values of E, shown in Table 4 with those of E, in Table 5. From Table 5, the negative effect of combining C1 and C3 (or C2 and C4) is similarly evident, as opposed to the positive effect of combining C9 with any other classifier. Furthermore, it is found that for the three databases, the average E, (averaged over all different pairs of classifiers) takes on values 0.6548, 0.4917,
3.5 Classification of Human Chromosomes . . .
573
Table 4. Error rates from pairs of classifiers on Edinburgh database.
C1 c2 c3 c4 c5 C6 c7 C8
c2
c3
C4
C5
C6
C7
C8
C9
7.71
12.02 7.43
7.66 12.11 7.35
8.22 6.11 8.98 6.27
6.72 7.30 7.16 7.59 7.44
6.47 7.14 6.52 6.92 6.00 6.90
6.74 7.48 7.12 7.70 7.34 9.52 7.39
5.21 4.87 5.25 4.85 5.68 5.32 5.23 5.43
Table 5. Values of error reduction ratio E,.
C1 c2 c3 c4 c5 C6 c7 C8
C2
C3
C4
C5
C6
C7
C8
C9
0.48
0.16 0.48
0.47 0.16 0.47
0.38 0.54 0.30 0.52
0.48 0.43 0.43 0.40 0.35
0.59 0.55 0.58 0.55 0.58 0.51
0.51 0.45 0.46 0.42 0.40 0.19 0.50
0.67 0.69 0.66 0.69 0.60 0.62 0.69 0.63
and 0.6233 respectively, whereas the average for the OCR algorithms cited in [14] is 0.9261. This supports the conclusion that majority vote is not as effective in reducing misclassifications in the current problem. Another way of testing the validity of E, as a measure of correlation is by comparison with the correlation coefficient p between 2 classifiers. Suppose these classifiers are Ci and Cj with probabilities p i and p j of being correct, r i j is the probability that they are both correct, and u! is the variance of Ci (so that u i = Then
d w ) .
is the coefficient of correlation between Ci and Cj. It is expected that classifiers having high pij would have low ET,and vice versa, which is the case for the databases and classifiers studied here. For exakple, a plot of p versus E, for the Edinburgh database in Fig. 2 shows a strong linear relationship between the two quantities (with coefficient -0.9944). For the Copenhagen and Philadelphia databases, the coefficients are -0.9635 and -0.9863 respectively. For more than 2 classifiers, the p in the above discussion can be replaced by the average correlation coefficient
574
L. Lam & C. Y. Suen V.7
0.85
-
0.8
-
22 0.75 -
1 1l
-
0.65
1
7
2 8
m
0.7
m
m m m
m m x
m
0.6
0.5
-
mn m 3 i
I
Fig. 2. A plot of correlation coefficient versus error reduction ratio.
In the next section, we study the relationship between the combined performance of classifiers and the value of pave, from which we establish a criterion for selecting a subset of classifiers whose combined decision would be most accurate. 4. Behavior of Correlated Voting One of the most cited theorems about voting is Condorcet’s jury theorem (CJT) which shows that, under suitable conditions, a majority of a group is more likely to choose the “better” of two alternatives than any one member of the group. Among the conditions are independence of voting and equal competence among the group members, which prove to be severe limitations in real situations. An interesting generalization of this theorem is presented in [25]. Translated into the present context, this recent publication shows the following: Suppose a group of n (possibly correlated) classifiers has average correct recognition rate pave > 0.5 and combined correct rate P,. Then Pn
> pave if rave
n
pave -
___ * ( n - 1)
(Pave - 0.25)
* (1 - P a v e ) ,
Pave
(3)
Since raveis also a measure of independent voting and requiring raveto be small is asking the same of the average correlation coefficient pave, the above theorem in fact stipulates that if the classifiers are not highly correlated, then a majority vote of the classifiers would produce a higher correct rate than the average.
3.5 Classajiication of Human Chromosomes . . . 575 The above condition, while sufficient, is far from necessary. It has been found that, for the three databases and all combinations of classifiers, the hypothesis is never satisfied even though the 'conclusion is true in 2/3 of all cases. Almost all of the exceptions (when P,, 5 p a v e ) occur in combinations of 2, 4,and 6 (i.e. even numbers of) classifiers, when the combined recognition rate is expected to be lower [17]. Obviously, the hypothesis is far more restrictive than needed in the sense that even when the classifiers are highly correlated, P, can still be greater than pave.
5. Experimental Results Given that P, > pave in most cases, the selection of an effective subset of classifiers for combination cannot depend on the average correct rate. Consequently, we wish to consider the validity of making a selection based on the average correlation coefficient pave between the classifiers, and to verify this criterion by experimentat ion. In order to do this, we examine the performances of all combination pairs containing the same number n of classifiers, where n = 2, 3, . . .,8. Suppose S1 and S2 are two such combinations. They are compared provided that: (i)
1 P,',,, - P&, I<
0.001, and
(ii)
I d u e - d u e I>
0.02,
where Pive is the average correct rate of classifiers in S,, and pkve is the average correlation coefficient between these classifiers. Condition (i) is imposed so that the comparison can be based on factors other than different levels of performance, and condition (ii) establishes a threshold for significant differences. Under these conditions, and supposing that pive > p i u e , we wish to determine whether S2 produces better performance than S1, where better may mean one of the following cases: (a) P2 > PI and E2 < El, where Pa and E, are respectively the correct and error rates of the combination S,. This is a clear case of superior performance by 5'2. (b) Condition (a) does not hold but R2 > R1, where R, is the reliability or accuracy of S,, and it is defined by R, = P,/(P, E,); in other words, rejections (created by lack of a majority decision) are ignored in the determination of reliability.
+
These will be referred to as Cases (a) and (b) respectively. Case (c) occurs when
R1 > R z , or a lower pave coincides. with a lower reliability, which is the opposite of our hypothesis. The statistics of all comparisons made on the 3 databases are shown in Table 6 . From the results of Table 6 , the following observations can be made: 1. The number of comparisons that can be made depends on the correct rates, which is a consequence of making comparisons only when the average correct rates are commensurate, since these occur more often when the individual correct rates are closer to each other. This is reflected in the databases, where the qualities of
576
L . Lam €5 C. Y. Suen
the data also create different ranges for the correct rates in the increasing order of Copenhagen, Edinburgh, and Philadelphia databases. 2. When only two classifiers are combined, pave is their correlation coefficient, and the correct and error rates both increase or decrease with it in general. Consequently, Case (a) should not happen, and this has been found to be true. 3. The low incidences of Case (c) clearly support our hypothesis. This shows the effectiveness of the average correlation coefficient as an indicator of reliable performance in the majority vote when the votes may be correlated, and it provides a valid basis for selecting a subset of classifiers for optimal combination. Table 6. Results of all comparisons on 3 databases. (a) Copenhagen database
2
Case (a) Case (b) Case (c) Total # pairs
0 43 6 49
Number of classifiers in 3 4 5 465 623 189 181 48 47 40 63 40 276 686 734
combination 6 7 312 58 43 1 8 3 363 62
8 3 0 0 3
(b) Edinburgh database
2
Case (a) Case (b) Case (c) Total #pairs
0 17 0 17
Number of classifiers in combination 3 4 5 6 7 178 29 145 308 405 9 88 10 10 0 6 1 4 4 1 0 160 410 419 189 29
8 0 0
0 0
(c) Philadelphia database
Case (a) Case (b) Case (c) Total # pairs
Number of classifiers in combination 2 3 4 5 6 7 0 65 146 227 100 15 9 8 6 3 1 2 8 0 1 1 3 6 0 0 10 74 212 245 108 15
8 1 0 0 1
6. Concluding Summary
In this paper, we have studied the majority vote patterns of 9 chromosome classifiers and compared the results to established theoretical ones. It is found that the theoretical behavior not based on independence are reflected in these voting results, whereas those based on the independence assumption are less clear. The
3.5 Classification of Human Chromosomes . . . 577 dependence between these classifiers is partly a result of using the same feature sets, but some classifiers are also more highly correlated than others. From the voting results, we can identify pairs of classifiers that are highly correlated (with resulting lower error reduction capability in combination), as well as the classifier that is most effective in contributing to error reduction. Further studies show the strong linear relationship between the average correlation coefficient of classifiers and the error reduction capability of their combination. The extensive experimental results indicate that this coefficient is an effective measure of reliability for combinations of any number of classifiers.
Acknowledgements The authors are deeply grateful to Drs. J. Piper, I. Mitterreiter and W. P. Sweeney for taking considerable trouble to provide their classification results for this study. This research was supported by the Natural Sciences and Engineering Research Council of Canada, the National Networks of Centres of Excellence program of Canada, and the FCAR program of the Ministry of Education of the province of Quebec. The first author is also supported by a research grant from the Hong Kong Institute of Education.
References [l] C. Y. Suen, C. Nadal, R. Legault, T. A. Mai and L. Lam, Computer recognition of unconstrained handwritten numerals, Proc. I E E E 80 (1992) 1162-1180. [2] L. Xu, A Krzyzak and C. Y. Suen, Methods of combining multiple classifiers and their application to handwritten numeral recognition, IEEE Trans. Syst. Man Cybern. 22 (1992) 418-435. [3] T. K. Ho, J. J. Hull and S. N. Srihari, Decision combination in multiple classifier systems, I E E E Trans. Pattern Anal. Mach. Intell. 16 (1994) 66-75. [4] S. Yamaguchi, K. Hagata, T. Tsutsumida, F. Kawamata and T. Wakahara, The third IPTP character recognition competition and study on multi-expert system for handwritten Kanji recognition, Fifth Int. Workshop on Frontiers in Handwriting Recogn. Colchester, UK, Sept. 1996, 479-482, World Scientific, 1997. [5] J. F'ranke and E. Mandler, A comparison of two approaches for combining the votes of cooperating classifiers, Proc. 11th Znt. Conf. Pattern Recogn., The Hague, Netherlands, Sept. 1992, Vol. 2, 611-641. [6] Y. S. Huang and C. Y. Suen, Combination of multiple experts for the recognition of unconstrained handwritten numerals, I E E E Trans. Pattern Anal. Mach. Intell. 17 (1995) 9(t94. [7] E. Mandler and J. Schuermann, Combining the classification results of independent classifiers based on the Dempster/Shafer theory of evidence, in Pattern Recogn. and Artzficial Intell. E. S . Geselma and L. N. Kanal (eds.) (North Holland, Amsterdam, 1988) 381-393. [8] D. S. Lee and S. N. Srihari, Handprinted digit recognition: A comparison of algorithms, Pre-Proc. 3rd Int. Workshop on Frontiers in Handwriting Recognition, Buffalo, USA, May 1993, 153-162. [9] F. Yamaoka, Y. Lu, A. Shaout and M. Shridhar, Fuzzy integration of classification results in a handwritten digit recognition system, Proc. 4th Znt. Workshop on Frontiers in Handwritting Recognition, Taipei, Taiwan, Dec. 1994, 255-264.
578
L. Lam B C. Y. Suen
[lo] J. Kittler, Improving recognition rates by classifier combination, Fifth Int. Workshop on Frontiers in Handwriting Recognition, Colchester, UK, Sept. 1996, 81-101. [ll] G. Pirlo, G. Dimauro, S. Impedovo and S. Rizzo, Multiple experts: a new methodology for the evaluation of the combination processes, Fifth Int. Workshop on Frontiers in Handwriting Recognition, Colchester, UK, Sept. 1996, 131-136. [12] N. Gorski, Practical combinations of multiple classifiers, Fifth Int. Workshop on Frontiers in Handwriting Recognition, Colchester, UK, Sept. 1996, 115-118. [13] T. Tsutsumida, F. Kimura, S. Yamaguchi, K. Nagata and A. Iwata, Study on multiexpert system for handprinted numeral recognition, Fifth Int. Workshop on Frontiers i n Handwriting Recognition, Colchester, UK, Sept. 1996, 11S124. [14] L. Lam and C. Y. Suen, Optimal combinations of pattern classifiers, Pattern Recogn. Lett. 16 (1994) 945-954. [15] L. Lam, Y.-S. Huang and C. Y. Suen, Combination of multiple classifier decisions for optical character recognition, Handbook on Character Recognition and Digital Image Analysis, H. Bunke and P. S. P. Wang (eds.) World Scientific (to appear). [16] L. Lam and C. Y. Suen, A theoretical analysis of the application of majority voting to pattern recognition, Proc. 12th Int. Conf. on Pattern Recognition, Jerusalem, Israel, Oct. 1994, 418-420. (171 L. Lam and C. Y . Suen, Increasing experts for majority vote in OCR: theoretical considerations and strategies, Proc. 4th Int. Workshop on Frontiers in Handwriting Recognition, Taipei, Taiwan, Dec. 1994, 245-254. [18] J. Piper and E. Granum, On fully automatic feature measurement for banded chromosome classification, Cytometry 10 (1989) 242-255. [19] P. Kleinschmidt, I. Mitterreiter and J. Piper, Improved chromosome classification using monotonic functions of Mahalanobis distance and the transportation method, Mathematical Methods of Operations Research 40 (1994) 305-323. [20] P. Kleinschmidt, I. Mitterreiter and C. Rank, A hybrid method for automatic chromosome karyotyping, Pattern Recog. Lett. 15 (1994) 87-96. [21] M. Tso, P. Kleinschmidt, I. Mitterreiter and J. Graham, An efficient transportation algorithm for automatic chromosome karyotyping, Pattern Recogn. Lett. 12 (1991) 117-126. [22] J. Piper, The effects of zero feature correlation assumption on maximum likelihood based classification of chromosomes, Signal Processing 12 (1987) 49-57. [23] J. Piper, Classification of chromosomes constrained by expected class size, Pattern Recogn. Lett. 4 (1986) 391-395. [24] W. P. Sweeney, Jr., M. T. Musayi and J. N. Guidi, Classification of chromosomes using a probabilistic neural network, Cytometry 16 (1994) 17-24. [25] K. K. Ladha, The Condorcet jury theorem, free speech, and correlated votes, Amer. J. Polit. Sci. 36 (1992) 617-634.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 579-612 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 3.6 I DOCUMENT ANALYSIS AND RECOGNITION BY COMPUTERS
YUAN Y. TANG*,', M. CHERIETt, JIMING LIU', J. N. SAID* and CHING Y. SUENt *Department of Computing Studies, Hong Kong Baptist University 224 Waterloo Road, Kowloon Tong, Hong Kong t Centre for Pattern Recognition and Machine Intelligence Conconlia University, 1455 de Maisonneuve Blvd. West Montreal, Quebec H3G 1M8, Canada Surveys of the basic concepts and underlying techniques are presented in this chapter. A basic model for document processing is described. In this model, document processing can be divided into two phases: document analysis and document understanding. A document has two structures: geometric (layout) structure and logical structure. Extraction of the geometric structure from a document refers to document analysis; mapping the geometric structure into logical structure deals with document understanding. Both types of document structures and the two areas of document processing are discussed in this chapter. Two categories of methods have been used in document analysis, namely, (1) hierarchical methods including topdown and bottom-up approaches, (2) nohierarchical methods including modified fractal signatures. Tree transform, formatting knowledge and description language approaches have been used in document understanding. All the above approaches are presented in this chapter. A particular case - form document processing is discussed. Form description and form registration approaches are presented. A form processing system is also introduced. Finally, many techniques, such as skew detection, Hough transform, Gabor filters, projection, crossing counts, form definition language, etc. which have been used in these approaches are discussed in this chapter. Keywords: Document processing, document analysis and understanding, geometric and logical structures, hierarchical and no-hierarchical methods, tree transform, formatting knowledge, description languages, texture analysis.
1. Introduction
Documents contain knowledge. Precisely, they are the medium for transferring knowledge. In fact, much knowledgelis acquired from documents such as technical reports, government files, newspapers, books, journals, magazines, letters, bank cheques, t o name a few. The acquisition of knowledge from such documents by an information system can involve an extensive amount of hand-crafting. Such hand-crafting is time-consuming and can severely limit the application of information systems. Actually, it is a bottleneck of information systems. Thus, automatic knowledge acquisition from documents has become an important subject. Since
580
Y. Y. Tang et al.
the 1 9 6 0 ’ ~ much ~ research on document processing has been done based on Optical Character Recognition (OCR) [71,5]. Some OCR machines which are used in specific domains have appeared in the commercial market. Surveys of the underlying techniques have been made by several researchers [89,70,18,66,29,40,69,83,33].The study of automatic text segmentation and discrimination started about two decades ago [71,49]. With rapid development of modern computers and the increasing need to acquire large volumes of data, automatic text segmentation and discrimination have been widely studied since the early 1980’s [1,98,105]. To date, a lot of methods have been proposed, and many document processing systems have been described [36,75,17,24,72,100,10].About 500 papers have been presented in the International Conferences on Document Analysis and Recognition ICDAR’91, ICDAR’93 and ICDAR’95 [37-391 and nine articles in the special issue of the journal Machine Vision and Applications [50] were concerned with document analysis and understanding, where a lot of papers deal with new achievements of the research on these areas, such as [11,13,42,52,54,59,85,106,48,97].Several books which deal with these topics have been published [9,86,79]. What is document processing? Different definitions have caused a bit of confusion. In this chapter, the definition is chosen from a basic document processing model proposed by [90,95,93]. And its principal ideas will be seen throughout the entire chapter. According to this model, this chapter is organized into the following sections: (i) Basic Model for Document Processing (ii) Document Structure (a) Strength of Structure
(b) Geometric Structure (c) Logical Structure (iii) Document Analysis (a) Hierarchical Methods Top-down Approach Bottom-up Approach (b) No-hierarchical Methods (iv) Document Understanding (a) Tree Transform Approach (b) Formatting Knowledge Approach (c) Description Language Approach
(v) Form Document Processing (a) (b) (c) (d)
Characteristics of Form Documents Form Description Language Approach Form Registration Approach A Form Document Processing System
3.6 Document Analysis and Recognition by Computers 581 (vi) Major Techniques Hough Transform, Skew Detection, Projection Profile Cuts, Run-Length Smoothing Algorithm (RLSA), Neighborhood Line Density (NLD), Connected Components Algorithm, Crossing Counting, Form Description Language (FDL), Texture Analysis, Local Approach, Other Segmentation Techniques. 2. A Basic Model for Document Processing A basic model for processing the concrete document was first proposed in our early work, which was presented at the First International Conference on Document Analysis and Recognition [95],and also appeared in the Handbook of Pattern Recognition and Computer Vzsion [94]. A graphic illustration can be found in Fig. 1, where the relationships among the geometric structure, logical structure, document analysis and document understanding are depicted.
Fig. 1. Basic Model for Document Processing.
The following principal concepts .were proposed in this model: A concrete document is considered to have two structures: the geometric (or layout) structure and the logical structure. Document processing is divided into two phases: document analysis and document
understanding. Extraction of the geometric structure from a document is defined as document analysis; mapping the geometric structure into a logical structure is defined as
Y. Y. Tang e t al.
582
0
document understanding. Once the logical structure has been captured, its meaning can be decoded by A1 or other techniques. But in some cases, the boundary between the two phases just described is not clear. For example, the logical structure of bank cheques may also be found during an analysis by knowledge rules. The basic model of document processing can be formally described below [94]:
Definition 1. A document R is specified by a quintuple
such that 3 = {Ol,O2,. . . ,Oi,. . . , O m } ,
where
and
where 0 0 0
0
0 0
3 is a finite set of document objects which are sets of blocks 0 2 ( i = 1 , 2 , . . . , m). {Of}* denotes repeated sub-division. a is a finite set of linking factors. cpl and cpr stand for leading linking and repetition linking respectively. S is a finite set of logical linking functions which indicate logical linking of the document objects. CY is a finite set of heading objects. p is a finite set of ending objects.
Definition 2. Document processing is a process t o construct the quintuple represented by Eqs. (1)-(3). Document analysis refers t o extracting elements 3, O i and Of in Eq. (2), i.e., extraction of the geometric structure of 0. Document understanding deals with finding a, 6 , a, and p in Eq. (3), considering the logical structure of R.
3.6 Document Analysis and Recognition by Computers 583
Fig. 2. A simple example of document processing described by the basic model.
Example - A simple example is illustrated in Fig. 2, we have
cs = { 0 1 , 0 2 , 0 3 , 0 4 , 0 5 } 04=
{04}*= {Of, @}
0 5 =
{0j5}* = {0;,0;,0;}
a = {01,02}
p
=
{04, 05)
From the above definition, it is obvious that there is a nondeterministic mapping from the geometric structure into the logical structure. However, as the geometric structure is extracted, a deterministic mapping can be achieved. It can be formally described below:
Theorem 1. Let R be a document defined by a quintuple (3,@, 6, a , p) having nondeterministic mapping from geometric structure into logical structure, then there exists a quintuple (V, a’,d’, a’, p’) ’which contains a deterministic mapping from the geometric structure of R into a logical structure. 3. Document Structures The key concept in document processing is that of structure. Document structure is the division and repeated subdivision of the content of a document into
584
Y. Y. Tang et al.
increasingly smaller parts which are called objects. An object which can not be subdivided into smaller objects is called a basic object. All other objects are called composite objects. Structure can be realized as a geometric (layout) structure in terms of its geometric characteristics, or a logical structure due to its semantic properties.
3.1. Strength of Structure To measure a document structure, a Strength of Structure S, has been introduced [104].
Definition 3. Suppose a document is divided into n objects associated with n variables. Hi stands for the partial entropy of the i-th variable, and H for the entropy of the whole document. The strength of structure is n
S, =
C H~
-
H.
(4)
i=l
For instance, if the entire document consists of four composite objects associated with the variables ~ 1 - 2 4 , the strength will be 4
n
n
3.2. Geometric Structure Geometric structure represents the objects of a document based on the presentation, and connection among these objects. According to the International Standard IS0 8613-1:1989(E) [44], the geometric or layout structure can be defined below:
Definition 4. Geometric or layout structure is the result of dividing and subdividing the content of a document into increasingly smaller parts, on the basis of the present at ion. Geometric (Layout) Object is an element of the specific geometric structure of a document. The following types of geometric objects are defined: Block is a basic geometric object corresponding to a rectangular area on the presentation medium containing a portion of the document content; n a m e is a composite geometric-object corresponding to a rectangular area on the presentation medium containing either one or more blocks or other frames; Page is a basic or composite geometric object corresponding to a rectangular area, if it is a composite object, containing either one or more frames or one or more blocks; Page Set is a set of one or more pages;
3.6 Document Analysis and Recognition by Computers 585 0
Document Geometric (Layout) Root is the object at the highest level in the hierarchy of the specific geometric structure. The root node in the above example represents a page.
The geometric structure can be formally described by the following definition according to the basic model given by Eqs. (1) and (2).
Definition 5. The geometric structure is described by the element S in the document space R = (3, 9,6 , a, 0) shown in Eqs. (1) and (2) and Pu which is a set of operations performed on S such that 3 = {SB,SC}
where S B represents a set of Basic objects, and SC stands for a set of Composite objects. Sc = {O1,02,.
. . ,Om}
This is a general definition of the geometric structure. Different types of specific documents have their specific forms. For example, for a specific document shown in Fig. 3(a) which is a page extracted from a newspaper, it is divided into several blocks as illustrated in Fig. 3(b). According to the above general model, its specific document geometric structure can be presented graphically in Fig. 3(c). In this page, a document is divided into several composite objects - text areas and graphic areas, which are broken into headline blocks, text line blocks, graphic blocks, etc. 3.2.1. Geometric Complexity The geometric complexity of a document can be measured by a complexity function p which is defined below:
Definition 6. Let ISTI and l S ~be l the number of elements in sets ST and SG respectively. Complexity function p can be presented as p=
lSTl + l S G l .
(7)
In terms of the complexity, documents can be classified into four categories: 0
0 0 0
Documents without graphics (e.g. editorials): SG = 4; Document forms (e.g. bank cheques and other business forms): SG = {OF}; Documents with graphics (e.g. general newspaper articles): SG # 4; Documents with graphics as the main elements (e.g. advertisements, front page of magazine): ISTI 5 ISG~.
586
Y. Y. Tang et al.
Fig. 3. Geometric structure and logical structure of a page in a newspaper.
3.3. Logical Structure
Document understanding emphasizes the finding of logical relations between the objects of a document. To facilitate this process, a logical structure and its model have been developed in our early work [95] which can be summarized as shown below.
3.6 Document Analysis and Recognition by Computers 587 Logical structure represents the objects of a document based on the humanperceptible meaning, and connection among these objects. According to the International Standard I S 0 8613-1:1989(E), the logical structure can be defined as follows [44]:
Definition 7. Logical structure is the result of dividing and subdividing the content of a document into increasingly smaller parts, on the basis of the human-perceptible meaning of the content, for example, into chapters, sections, subsections, and paragraphs. Logical Object is an element of the specific logical structure of a document. For logical object, no classification other than Basic logical object, Composite logical object and Document logical root is defined. Logical object categories such as Chapter, Section and Paragraph are application-dependent and can be defined using the Object class mechanism [44]. The document understanding process finds the logical relations between the objects of a document. According to the basic model represented by Eqs. (1)and (2), a formal description of the logical structure is presented as follows:
Definition 8. The logical structure is described by the elements CP, 6, a , and P in the document space R = (S,CP, 6, a , P ) in Eqs. (1) and (2), such that
For a specific document shown in Fig. 3, its logical structure can be represented graphically by Fig. 3(d).
4. Document Analysis Document analysis is defined as the extraction of the geometric structure of a document. In this way, a document image is broken down into several blocks, which represent coherent components of a document, such as text lines, headlines, graphics, etc. with or without the knowledge regarding the specific format [100,95]. This structure can be represented as a geometric tree shown in Fig. 3(c). To build a such tree, there are many methods which can be classified into two categories: Hierarchical Methods: When we break a page of document into blocks, we consider the geometric relationship among the blocks. In this way, we have three approaches, i .e . Top-down approach Bottom-up approach - Adaptive split-and-merge approach
-
588
Y. Y. Tang et al.
No-hierarchical Methods: When we are break a page of document into blocks, we do not consider the geometric relationship among the blocks.
4.1. Hierarchical Methods
In the hierarchical methods, we have two ways: (1) from parents t o children, or (2) from children to parents. Corresponding to these two ways, there are two approaches: top-down and bottom-up approaches. These two approaches have been used in document analysis. Each has its advantages and disadvantages. The top-down approach is fast and very effective for processing documents that have a specific format. On the other hand, the bottomup approach is time consuming. But it is possible t o develop algorithms which are applicable to a variety of documents. A better result may be achieved by combining the two approaches [73]. 4.1.1. Top-Down Approach The top-down (knowledge based) approach proceeds with an expectation of the nature of the document. It divides the document into major regions which are further divided into sub-regions, etc. [28,32,42,55-57,60,75,77]. The geometric structure of a document can be represented by a tree. Suppose this tree contains K levels. Figure 4 indicates the i-th and (i 1)-th levels. Suppose the upper layer has nodes N j , N i , . . . , N A ; and the lower layer has nodes N ; + l , N;+l, . . . , N;+l. The relations between these two layers are expressed by edges between the nodes. They can also be represented in the form of
+
1 1 ... 1 0 0 ... 0 0 0 ... 0 0 0 ... 0
... 0 0
1 1
... 1 0 0 ... 0
1
...
... 0 0 0 ... 0 1 1 ...
1
Values 1's in Eq. (9) correspond t o the edges in Fig. 4 meaning that
3.6 Document Analysis and Recognition by Computers
Fig. 4. The ith and (i
589
+ 1)th levels of a structure tree.
Equation (9) gives two ways, “+” from left t o right corresponding to “from top to bottom” in the tree structure (Fig. 4), and “t-” from right t o left corresponding to “from bottom to top” in the same structure. In the top-down approach, the former way is used, and a document is divided into several regions each of which can be further divided into smaller sub-regions. Let S be the set of objects which can be split into w disjoint subsets 01,02,. . . , OP, . . . , O”, opts,
p = 1 , 2 ,.”,
21
A C-function [lo41 has been defined as C(0P) 2 0,
p = 1 , 2 , ..., w
such that
C(0Pu 0 4 ) 2 C(OP)+ C ( O Q ) . From (4), the strength of structure S, will be S,(OP, 0 9 )
= C(OP u 0 9 ) - C(OP)- C ( 0 Q 2) 0.
(10)
The criterion of topdown splitting is that we should divide 0‘ = OP U OQ into two subsets OP and 0 4 such that the strength of structure S, becomes minimum. This policy will maximize the intra-subset cohesion and minimize the inter-subset cohesion.
590
Y. Y. Tang et al.
For multiple splitting, the strength of structure Ss(OP, 0 9 , . rived by repeating Eq. (10):
. . , OY) can be de-
To achieve a good splitting, S,(OP, 0 9 , . . . , OY) should be minimized. Many methods have been employed in the top-down approach, e.g. smearing [49,105,51],projection profile cut [74,46,4,56,60],Fourier transform detection [35], template [21], and form definition language (FDL) [32,28]. 4.1.2. Bottom-Up Approach The bottom-up (data-driven) approach progressively refines the data by layered grouping operations. The bottom-up approach is time consuming. But it is possible to develop algorithms which can be applied to a variety of documents [105,22,41,36,46,17,27,4,78]. The bottom-up approach corresponds to the direction of "+" in Eq. (9). In this way, basic geometric components are extracted and connected into different groups in terms of their characteristics, then the groups are combined into larger groups, etc. An analysis of this approach based on the entropy theory is given in terms of the dynamic coalescence model [104]. In this model, we start with N ( 0 ) objects of equal "mass", suppose a region is formed by m original objects, such that this region has a mass m. N ( t ) stands for the number of regions at time t. X ( * ) , R(Q)and represent the position, size and mass of the a-th region respectively. We have
N ( 0 ) > N ( t ) > N(2t) > . . . N ( n t )
R(a)= 4 n ] ( M ( a ) ) & , where Ro indicates a constant called coalescence parameter. The dynamic equation can be represented in the form of
where
3.6 Document Analysis and Recognition by Computers 591 If we want t o include the second order effect in the equation in order to enhance the chain effect, then Eq. (12) can be replaced by the following formula
where E is a constant to be adjusted. Two blocks a and /3 coalesce into a new block y when they satisfy the following condition: IX(P) - X(")I = R(P) - R(")
There are two practical bottom-up methods: (1) neighborhood line density (NLD) indicating the complexity of characters and graphics [57,45,46];and (2) connected components analysis indicating the component properties of the document blocks [64,88,8,27,78]. 4.1.3. Adaptive Split-and-Merge Approach Liu, Tang, and Suen have developed an adaptive split-and-merge approach [62,63], which draws on the conventional split-and-merge image-processing advantage of spontaneous separation of unhomogeneous regions and merging of homogeneous ones, and furthermore empowers such a process with an adaptive thresholding operation that computes the segmentation borders. The novelty of their approach consists in the document representation of a tree-like data structure that results from the block identification can readily be utilized in reasoning about the geometric relationships in those blocks. As an integral step of this approach, the relative spatial relationships are inferred at the same time as the block convergence takes place. The proposed approach has been implemented and tested with real-life documents. 4.2. N o -hiemrchical Methods
Traditionally, two approaches have been used in document analysis, namely, topdown and bottom-up approaches [93]. Both approaches have their weaknesses: They are not effective for processing documents with high geometrical complexity. Specifically, the topdown approach can process only the simple documents which have specific format or contain some a priori information. It fails t o process the documents which have complicated geometric structures. To extract the geometric (layout) structure of a document, the top-down approach needs iterative operations t o break the document into several blocks while the
592
Y. Y. Tang et al.
bottom-up approach needs t o merge small components into large ones iteratively. Consequently, both approaches are time consuming. Tang e t al. [91] presented a new approach based on modified fractal signatures for document analysis. It does not need iterative breaking or merging, and can divide a document into blocks in only one step. This approach can be widely used to process various types of documents including even some with high geometrical complexity. An algorithm has been developed in [91], and is briefly presented as follows:
Algorithm 1. (fractal signature) Input: a page of document image; Output: the geometric structure of the document; Step-1 For x = It0 Xmax do For y = 1 t o YmaXdo F is mapped onto a gray-level function gk(z, y); Step-2 For x = 1 to X,, do For y = 1 to Yma, do Substep-1 Initially, taking 6 = 0, the upper layer u,"(z,y) and lower layer bE(z,y) of the blanket are chosen as the same as the gray-level function gk (z, y), namely:
Substep-2 Taking 6 = 61, (a) u g , (z,y) is computed according t o the formula:
(b) bg, (z,y) is computed according t o the formula:
(c) The volume Volg, of the blanket is computed by the formula: VOl6,
= c ( u 6 1(z, 51) - b6l (z, Y)); X>Y
Substep-3 Taking 6 = 6 2 , (z, y) is computed according to (a)
3.6 Document Analysis and Recognition by Computers 593 (b) b g 2 ( z ,y) is computed according to
(c) The volume Vols, of the blanket is computed by
Step-3 The sub fractal signature A: is computed by the formula:
At =
Volg, - V O l 6 , 2
Step-4 Combining sub fractal signatures A:, k = 1 , 2 , .. . , n into the whole fractal signature: n
A6 = IJ A t . k=l
5. Document Understanding
As document analysis extracts geometric structures from a document image by using the knowledge about the general document and/or the specific document format, document understanding maps the geometric structures into logical structures considering the logical relationship between the objects in specific documents. There are several kinds of mapping methods in document understanding: [loo] proposed a tree transformation method for understanding multi-article documents. [98] discussed the extraction of Japanese newspaper articles using a domain specific knowledge. [41] constructed a special purpose machine for understanding Japanese documents. [32] proposed a flexible format understanding method, using a form definition language. Our research [92,107,94] has led to the development of a form description language for understanding financial documents. These mapping methods are based on specific rules applied to different documents with different formats. A series of document formatting rules are explicitly or implicitly used in all these understanding techniques. In this section, document understanding based on tree transformation, document formatting knowledge and document description language will be discussed. 5.1. Document Understanding Based o n Free Transformation
This method defines document understanding as the transformation of a geometric structure tree into a logical structure tree [loo]. A document has an obvious hierarchical geometric structure, represented by a tree, and the logical structure of a document is also represented by a tree. Three
594
Y. Y. Tang et al.
kinds of blocks are defined: H (head), B (body) and S (either body or head). During the transformation, a label is attached to each node. Labels include title, abstract, sub-title, paragraph, header, foot note, page number, and caption. The transformation, which moves the nodes in the tree, is based on four transformation rules. These rules are created according to a layout designed according to the manner in which humans read. Rules 1 and 2 are based on the observation that a title should have a single set of paragraphs as a child in the logical structure. The paragraph body in another node is moved to the node under the body title by these rules. Rule 3 is mainly for the extraction of characters or sections headed by a sub-title. By rule 4, a unique class is attached to each node. This method was implemented on a SUN-3 workstation. Pilot experiments were carried out using 106 documents taken from magazines, journals, newspapers, books, manuals, letters, scientific papers, and so on. The results show that only 12 out of 106 tested documents were not interpreted correctly. 5.2. Document Understanding Based on Formatting Knowledge Since a logical structure can correspond to a variety of geometric structures, the generation of logical structure from the geometric structure is difficult. One of the promising solutions to this problem is the use of formatting knowledge. The formatting rules may differ from each other because of the type of document and language to be used in it. However, for a specific kind of document, once the formatting knowledge is acquired, its logical structure can be deduced. An example can be found in [98] where a method of extracting articles from Japanese newspapers has been proposed. In this method, six formatting rules of Japanese newspaper layout are summarized. An algorithm for extracting articles from Japanese newspaper has been designed based on the formatting knowledge. Another example can be found in [20] where a business letter processing approach has been developed. Because business letters are normally established in a single-column representation, letter understanding is mainly the identification of the logical objects, like sender, receiver, date, etc. In this approach, the logical objects of the letter are identified according to a Statistical Database (SOB). As the author reported, the SDB consists of about 71 rule packages derived from the statistical evaluation of a few hundred business letters. Other knowledge, like the shape, size and pixel density etc. of the image block can also be used for document understanding. References [108,23] use statistical features of connected components to identify the address blocks on envelopes.
5.3. Document Understanding Based on Description Language One of the most effective ways to describe the structures of a document is the use of a description language. [32] detects the logical structure of a document and makes use of the knowledge rules represented by a f o r m definition language ( F D L ) . The basic concept of the form definition language is that both the geometric and
3.6 Document Analysis and Recognition by Computers 595 logical structures of a document can be described in terms of a set of rectangular regions. For example, a part of a program in form definition language coded for the United Nations’ (UN) documents is listed below: (defform UN-DOC# (width 210) (height 297) (if (box (? ? ? ?) (mode IN Y LESS) (area (0 210 60 100)) (include (160 210 1 5))) (form UN-DOC-A (0 210 0 297)) (form UN-DOC-B (0 210 0 297)))) (defform UN-DOC-A . . . ) (defform UN-DOC-B . . .) It means that the UN documents have a width of 210 mm and a height of 297 mm. The if predicate is one of the control structures. If the box predicate succeeds, the document named UN-DOC# is compared with UN-DOC-A and UNDOC-B, and analyzed as UN-DOC-A. Otherwise, it is analyzed as UN-DOC-B. The box states that a rule line should exist inside the region (0 210 60 100) and satisfy the conditions that the width of the ruled line is between 160 mm and 210 mm and the height is between 1 mm and 5 mm. (defform UN-DOC-A . . . ) and (defform UN-DOC-B . . . ) will give the definition of the UN documents with and without a ruled line with these properties stated above. According to the definition, a form dividing engine will analyze the document and produce the images of some logical objects, such as the organization which issued the document, document number, and section, etc. More details about this method can be found in [32]. 6. Form Document Processing
Form document is a type of special-purpose documents commonly used in our daily life. For example, millions of financial transactions take place every day. Associated with them are form documents such as bank cheques, payment slips and bills. For this specific type of documents, according to their specific characteristics, it is possible to use specific methods to acquire knowledge from it. 6.1. Chamcteristics of F o m Documents
Specific characteristics of form documents have been identified and analyzed in our early work [96,92,107,94]which are listed below: In general, a form document may consist of straight lines which are oriented mostly in horizontal and vertical directions.
596 0
0
Y. Y. Tang et al.
The information that should be acquired from a form is usually the filled data. The filling positions can be determined by the above lines as references. Texts in form documents often contain a small set of known machine-printed, hand-printed and handwritten characters, such as legal and numeric amounts. They can be recognized with current character recognition techniques.
6 . 2 . Form Document Processing based on Form Description Language
According to the above analysis, a form document processing method based on form description has been proposed in [96,92,107,94]. A block diagram of this method is illustrated in Fig. 5. The goal of this method is to extract information called items from the form documents.
Fig. 5. Diagram of form processing based on the FDPL.
To acquire the items from the form documents, the item description (IDP) has been developed [107,94]. Suppose there exists a finite set of relations r = {rl,rz, . . . , r h } between the finite set of items a = (01, a2, . . . , a,} and the finite set of graphs C = {El, C2, . .. , C,}, and it can be represented by 0-ri matrix. We call it an Item Description Matrix: M I D ,such that
3.6 Document Analysis and Recognition by Computers 597 satisfying the following condition:
where R, L , A and B represent Right, Left, Above and Below respectively. For example, the finite set of items and the finite set of graphs are given by a = { a l , a2, a 3 , a4} and c = { L l , L2, L3, L4, L5, L 6 ) respectively. Let r = {R,L , A , B } . M I D is represented by the following matrix:
Equation (15) means that
(a) (b)
is located above line L1 and also on the left of line L4; is located below line L1 and above line L2 and also on the right of line L4 and left of line L5; (c) a3 is located below line L3 and also on the right of line L5 and left of line L6; (d) a4 is located below line L3 and also on the right of line L6. a1 a2
6.3. Form Document Processing based on Form Registration
A form document processing system based on the pre-registered empty forms has been developed in [75]. The process includes two steps: (1) empty form registration, and (2) data filled form recognition. During the registration step, a form sample without any data is first scanned and registered with the computer. Through line enhancement, contour extraction and square detection, both the label and data fields are extracted. The relationships among these fields are then determined. Man-machine conversation is required during this registration process. The result of registration is stored as the format data of the form sample. During the recognition step, only the data fields are extracted according to the locations indicated by the format data. 6.4. Form Document Processing System
An intelligent form processing system (IFPS) has been described in [lo]. It provides capabilities for automatically indexing form documents for storage/retrieval to/from a document library, and for capturing information from scanned form images using OCR software. The IFPS also provides capabilities for efficiently storing form images. The overall organization of IFPS is shown in Fig. 6, which contains
598
Y. Y. Tang et al. Difference image
Data extraction lications
Fig. 6. An intelligent document form processing system.
two parallel paths, one for image applications such as retrieval, display and printing of a form document, the other for data processing applications that deal with information contained on a form. IFPS consists of six major processing components: 0 0 0 0
0 0
Defining form model; Storing the form model in a form library; Matching input form against the model stored in the form library; Registering the selected model to the input form; Converting the extracted image data to symbol code for input t o data base; Removing the fixed part of a form, and retaining only the data filled in for storage.
7. Major Techniques To implement the above approaches, a lot of practical techniques have been developed. In this section, the major techniques will be presented. 0
0 0 0 0 0 0 0 0 0 0
Hough Transform, Skew Detection, Projection Profile Cuts, Run-Length Smoothing Algorithm (RLSA), Neighborhood Line Density (NLD) , Connected Components Algorithm, Crossing Counting, Form Description Language (FDL), Texture Analysis, Local Approach, Other Segmentation Techniques.
3.6 Document Analysis and Recognition by Computers 599
7.1. Hough Transform The Hough transform maps points of Cartesian space (s,y) into sinusoidal curves in a ( p , 8 ) space via the transformation: p = zcos(8)
+ ysin(8).
Each time a sinusoidal curve intersects another at particular values of p and 8 , the likelihood that a line corresponding to these p8 coordinate values is present in the original image also increases. An accumulator array (consisting of R rows and T columns) is used t o count the number of intersections at various p and 8 values. Those cells in the accumulator array with the highest number of counts will correspond to lines in the original image. Because text lines are actually thick lines of sparse density, Hough transforms can be used t o detect them and their orientation. Three major applications of the Hough transform in document analysis are listed below: 0 Skew Detection: An important application of Hough transform is skew detection. A typical method can be found in [34]. It detects the document skew by applying the Hough transform to a “burst image”. At first, the resolution of the document image is reduced from 300 dpi (dots per inch) t o 75 dpi. Next, a vertical and a horizontal burst image will be produced based on the reduced document image. The Hough transform is then applied t o either the vertical or the horizontal burst image according to the orientation of the document. Compared to the original image, the number of black pixels in the burst image is significantly reduced compared t o the original image. It speeds up the skew detection procedure. In order to eliminate the negative effects of the large run-length contributed by the figures and black margins, only small run-lengths between 1 and 25 pixels are mapped to the ( p , 8 ) space. The skew angle can then be calculated according to the accumulator array. In [34], all skews have been detected correctly for the thirteen test images of five different types of documents. 0 Text block identification: The accumulator array produced by the transform has different properties corresponding t o the different contents of the document images. The high peaks in the array correspond t o graphics in the document, while the cells with regular value and uniform width in the array correspond to texts in the document [82,87]. Thus, the different document contents can be identified according to these properties. 0 Grouping the characters in a line for text/graph separation: The Hough transform can also be used t o iletect the text lines by means of grouping the characters together and separating them from the graphics [27].
7 . 2 . Techniques for Skew Detection Many techniques have been applied to skew detection, and many algorithms have been developed [79].For example, Akiyama and Hagita [4] developed an automated
600
Y. Y. Tang et al.
entry system for skewed documents. But it failed in a document which is mixed of text blocks, photographs, figures, charts, and tables. Hinds, Fisher and D’amato [34] developed a document skew detection method using run-length encoding and the Hough transform. Nakano, Shima, Fujisawa, Higashino and Fujiwara [76] proposed an algorithm for the skew normalization of a document image based on the Hough transform. These methods can handle documents in which the non-text regions are limited in size. Ishitani [43]proposed a method t o detect skew for document images containing a mixture of text areas, photographs, figures, charts, and tables. To handle multi-skew problems, [log] developed a method using Least Squares, and the basic idea is presented below: Given a set of N data points, i.e., the reference points of the text line, a linear function is assumed to exist between the dependent variable f (x)and the independent variable x,
They can therefore be solved t o give
Consequently, from a1 and a2, the slope of the text block as well as the skewed angle 0; can be calculated. Furthermore, the skewed angle can be rotated t o the correct position.
7.3. Projection Profile Cuts Projection refers to the mapping of a two-dimensional region of an image into a waveform whose values are the sums of the values of the image points along some specified directions. A projection profile is obtained by determining the number of black pixels that fall onto a projection axis. Projection profiles represent a global feature of a document. They play a very important role in document element extraction, character segmentation and skew normalization. Let f ( x , y ) be a document image, and R stand for a n area of the document image. Assume that f(x,y) = @lies outside the image. b[. . .] denotes a delta function. t = xsin4 - ycos4 gives the Euclidean distance of a line from the origin [80]. If the projection angle from the x-axis is 4, the projection can be defined as follows: f(x,y)b[xsin@- ycos4 - tldzdy . p(4, t) = (19)
L
For a digitized image, the symbol
sRshould be replaced by CR.
3.6 Document Analysis and Recognition by Computers 601 All objects in a document are contained in rectangular blocks. Blanks are placed between these rectangles. Thus, the document projection profile is a waveform whose deep valleys correspond to the blank areas of the documents. A deep valley with a width greater than an established threshold, can be cut as the position corresponding to the edge of an object or a block. Because a document generally consists of several blocks, the process of projection should be done recursively until all of the blocks have been located. More details about various applications of this technique in the document analysis can be found in Refs. [74,46,99,103,4,101]. 7.4. Run-Length Smoothing Algorithm (RLSA)
The basic RLSA is applied to a binary sequence in which white pixels are represented by 0’s and black pixels by 1’s. It transforms a binary sequence x into an output sequence y according to the following rules: (a) 0’s in x are changed to 1’s in y, if the number of adjacent 0’s is less than or equal to a predefined limit C. (b) 1’s in x are unchanged in y.
For example, with C = 4 the sequence x is mapped into y as follows:
x : 00010000010100001000000011000 y : 11110000011111111000000011111
When applied to pattern arrays, the RLSA has the effect of linking together neighboring black areas that are separated by less than C pixels. With an appropriate choice of C, the linked areas will be regions of a common data type. The degree of linkage depends on the following factors: (a) the threshold value C, (b) the distribution of white and black pixels in the document, and (c) the scanning resolution. On the other hand, the RLSA may also be applied to the background. It has the effect of eliminating black pixels that are less than C in length [49]. The choice of the smoothing threshold C is very important. Very small horizontal C values simply close individual characters. Slightly larger values of C merge together individual characters in a word, but are not large enough to bridge the space between two words. Too large values of C often cause sentences to join to non-text regions, or to connect to adjacent columns. In general, the threshold C is set according to the character height, gap between words and interline spacing [49,26]. The RLSA was first proposed by Johnston [49] to separate text blocks from graphics. It has also been used to detect long vertical and horizontal white lines [1,102]. This algorithm was extended to obtain a bit-map of white and black areas representing blocks which contain various types of data [105]. Run-length smoothed document images can also be used as basic features for document analysis [24,26,28].
602 Y. Y. Tang et al.
7.5. Neighborhood Line Density (NLD) For every pixel on the document, its NLD is the sum complexity of its four directions.
ci =
c
(l/Lij)
iEN
N = (L,R,U,D), where, L, R, U and D stand for the four directions, i.e., left, right, up and down respectively. Ci indicates the complexity of a pixel for the direction i. Lij represents the distance from the given pixel to its surrounding stroke j in the direction i. Based on the following features, the NLD can be used to separate characters from graphics including the situation that some characters are touching the graphics: (1) NLD is higher for character fields than graphic fields, and (2) there are high peaks of NLD in the character fields and their height is affected by the character’s size and pitch [58]. The NLD consists of three processing steps. First, the NLD for all the black pixels of the input document is calculated using the method stated above. Second, an NLD emphasis processing is carried out in order to enlarge the NLD difference between the graphic fields and character fields. The third step is thresholding, the pixels which have an NLD value greater than a threshold f? are classified as character fields, otherwise they are classified as graphic fields.
7.6. Connected Components Analysis (CCA)
A connected component is a set of connected black or white pixels such that an 8-connected path exists between any two pixels. Different contents of the document tend to have connected components with different properties. Generally, graphics consist of large connected components. Texts consist of connected components with regular and relatively smaller size. By analyzing these connected components, graphics and texts in the document can be identified, grouped together to different blocks and separated from each other. The size and location of the connected component can be represented by a fourtuple [98]. The analysis of a document can be regarded as the process of merging these four-tuples. Taking the newspaper as an example, its content is classified into several regions like index, abstract, article body, picture and figure, etc. During image analysis, the four-tuples are merged and classified into these regions using the features found in the regions. In [98], 13 features about the six regions of Japanese newspaper are summarized. According to these features, a table is created summarizing the properties of the four-tuples in each region. All the four-tuples can be classified and merged following the rules described in this table. Since the
3.6 Document Analysis and Recognition by Computers 603
four-tuples contain information about the location of the components, all the regions can be classified and located at the end of the four-tuple merging process. Two typical applications of the CCA in document processing can be illustrated below.
Envelope Processing: An important application is automatic envelope processing [108,6,7,19,23,61].By placing the connected components into several groups and further analyzing the components in them, CCA has been used to locate address blocks on envelopes [108]. Mixed Text/Graphics Document Processing: [27] describes the development and implementation of a Robust algorithm where the CCA is successfully used to separate a text string from a mixed text/graphics document image. This algorithm consists of five steps: (a) Connected component generation, (b) Area/ratio filter, (c) Collinear component grouping, (d) Logical grouping of strings into words and phrases, and (e) Text string separation. 1'.7. Crossing Counts
A crossing count is the number of times the pixel value turns from 0 (white pixel) to 1 (black pixel) along horizontal or vertical raster scan lines. It can be expressed as a vector whose components are defined as follows. (1) Horizontal crossing counts:
(2) Vertical crossing counts:
Crossing counts can be used to measure document complexity. In [4],crossing counts have been used as one of the basic features to separate and identify the document blocks.
7.8. Form Definition Language (FDL) [32]proposed a top-down kno&edge representation called Form Definition Language (FDL), to describe the generic layout structure of document. The structure can be represented in terms of rectangular regions, each of which can be recursively defined in terms of smaller regions. An example is given in Fig. 7. These generic descriptions are then matched to the preprocessed input document images. This method is powerful, but is complicated to implement. [28] developed a simplified version of FDL so that it may be implemented more easily.
604
Y. Y. Tang et al.
I
10-
40- .....................
F 2 w -
(defform F (form F1 (10, 90, 10,40)) (form ) (form F3 )) (defform F1 (form F11 ) (form F1.2
............ ............ ............ ............
))
Fig. 7. Representation of structure using the FDL.
7.9. Texture Analysis - Gabor Filters
A text segmentation algorithm using Gabor filters for document processing has been proposed by [48]. The main steps of this algorithm are described below:
Step 1. Filter the input image through a bank of n even-symmetric Gabor filters [65], to obtain n filtered images. Step 2. Compute the feature image consisting of the “local energy” estimates over windows of appropriate size around every pixel in each of the filtered images. Step 3. Cluster the feature vectors corresponding to each pixel using a squarederror clustering algorithm to obtain a segmentation of the original input image into K clusters or segments. This algorithm has been used for locating candidate regions of the destination address block (DAB) on images of letters and envelopes in [47]. It treats the document image as a multi-textured region in which the text on the envelope defines a specific texture and other non-text contents including blank regions produce different textures. Thus, the problem of locating the text in the envelope image is posed as a texture segmentation problem. A great variety of methods in texture analysis have been developed for image processing and pattern recognition. Although many of these methods have not been used in document processing directly, a lot of texture classification and segmentation techniques can be useful. Some of those published since 1990 have been selected and are listed below: A new set of textural measures derived from the texture spectrum has been presented in [31]. The proposed features contain more complete texture characteristics of the image. Based on the texture spectrum, a texture edge detection method has been developed in [30]. The basic concept of the method is to use the texture spectrum
3.6 Document Analysis and Recognition by Computers 605 as the texture measure of the image and combine it with the conventional edge detectors. In [81], two new methods have been described which use geometric proximity t o reference points in region growing. The first one is based on Voronoi tessellation and mathematical morphology while the second one is based on the “radiation model” for region growing and image segmentation. In [67], a multi resolution simultaneous autoregressive (MR-SAR) model has been presented for texture classification and segmentation.
7.10. Local Approach In [15,16], a formal model has been proposed to seek specific regions inside a document and perform appropriate local, instead of global, thresholding techniques that lead t o extracting valuable information from the discovered regions with the challenge of preserving its topological properties. The proposed local approach can be summarized into the following key steps: (1) image enhancement, (2) image segmentation, (3) guide lines elimination, (4) morphological processing, (5) restoring lost information, (6) topological processing, (7) edge detection, (8) gaps identification and filling, and (9) information extraction. This approach allows for the visualization and understanding of the finest granularity of data written or printed on documents such as bank cheques.
7.11. Other Segmentation Techniques Segmentation techniques can be roughly categorized as [3,84,2,68,12,25,53]: project ion-based, pitch-based, recognition-based, region-based. The first two techniques are suitable tor typewritten texts where characters are equally spaced and there is a significant gap between adjacent characters. In the “recognition-based” methods, segmentation is performed by recognizing a character in a sequential scan. For handwritten or hand-printed texts where variations in handwriting are unpredictable, the performance of these methods is dubious. The “region-based” method is the only alternative for the segmentation of totally unconstrained handwritten characters. This category of techniques consists of finding and analyzing the input image components as well as how these components are related in order to detect suitable regions for segmentation. Also, segmentation techniques can be categorized as the following types by the sorts of pixels to be worked on: Methods working on foreground pixels (black pixels) [3,84,68,12,25,53], Methods working on background pixels (white pixels) [14].
606
Y. Y. Tang et al.
8. Conclusions
Every day, millions of documents including technical reports, government files, newspapers, books, magazines, letters, bank cheques, etc. have to be processed. A great deal of time, effort and money will be saved if it can be executed automatically. However, in spite of major advances in computer technology, the degree of automation in acquiring data from such documents is very limited and a great deal of manual labour is still needed in this area. Thus, any method which can speed up this process will make a significant contribution. This chapter deals with the essential concepts of document analysis and understanding. It begins with a key concept, document structure. The importance of this concept can be seen in the whole chapter: constructing a geometric structure model and a logical structure model; considering document analysis as a technique of extracting the geometric structure; regarding document understanding as a mapping from the geometric structure into logical structure, etc. This chapter attempts to theoretically analyze document structure and top-down, bottom-up approaches which are commonly used in document analysis in terms of entropy function. Some open questions and problems still exist, especially in document understanding. Any practical document can be viewed differently depending on its geometric structure space and logical structure space. Because there is no one-to-one mapping between these two spaces, it is difficult to find a correct mapping to transform a geometric structure into a logical one. For example, rules based on knowledge may vary in different documents, how to find the correct rules is a profound subject for future research. References [l] L. Abele, F. Wahl and W. Scheri, Procedures for an automatic segmentation of text graphic and halftone regions in document, Proc. 2nd Scandinavian Conf. on Image Analysis (1981) 177-182. [2] P. Ahmed and C. Y . Suen, Computer recognition of totally unconstrained handwritten Zipcodes, Int. J. Pattern Recognition and Artificial Intelligence 1, 1 (1987) 1-15. [3] P. Ahmed and C. Y . Suen, Segmentation of unconstrained handwritten postal zipcodes, Proc. 6th Int. Conf. on Pattern Recognition (1982) 545-547. [4] T. Akiyama and N. Hagita, Automated entry system for printed documents, Pattern Recognition 23, 11 (1990) 1141-1154. [5] R. N. Ascher, G. M. Koppelman, M. J. Miller, G. Nagy and G. L. Shelton Jr, An interactive system for reading unformatted printed text, IEEE Trans. on Computers, C-20, 12 (1971) 1527-1543. [6] N. Bartneck, Knowledge based address block finding using hybrid knowledge representation schemes, Proc. 3rd USPS Advanced Technology Conf. (1988) 249-263. [7] A. Bergman, E. Bracha, P. G. Mulgaonkar and T. Shaham, Advanced research in address block location, Proc. 3rd USPS Advanced Technology Conf. (1988) 218-232. [8] J. P. Bixler, Tracking text in mixed-mode document, Proc. ACM Conf. Document Processing Systems (1988) 177-185.
3.6 Document Analysis and Recognition by Computers 607 [9] H. Bunke, P. S. P. Wang and H. S. Baird (Eds.) Document Image Analysis, Singapore: World Scientific Publishing Co. Pte, Ltd., 1994.
[lo] R. G. Casey, D. R. Ferguson, K. M. Mohiuddin and E. Walach, An intelligent forms processing system, Machine Vision and Applications 5, 3 (1992) 143-155. [ll] R. G. Casey and G. Nagy, Document analysis - a broader view, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 3COct. 2 (1991) 839-850. [12] M. Cesar and R. Shinghal, An algorithm for segmentation handwritten postal codes, Man Machine Studies 33 (1990) 63-80. [13] Y. Chenevoy and A. Belaid, Hypothesis management for structured document recog-
nition, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 3CkOct. 2 (1991) 121-129. [14] M. Cheriet, Y. S. Huang and C. Y. Suen, Background region-based algorithm for the segmentation of connected digits, Technical Report, Centre for Pattern Recognition and Machine Intelligence, Concordia University (1991). [15] M. Cheriet, J. N. Said and C. Y. Suen, A formal model for document processing of bank cheques, Proc. 3-nd Int. Conf. on Document Analysis and Recognition, Montreal, Canada, Oct. 14-16, (1995) 21C213. [16] M. Cheriet, J. N. Said and C. Y. Suen, A recursive approach for image segmentation, IEEE Rans. Image Processing, (submitted 1995). [17] G. Ciardiello, M. T. Degrandi, M. P. Poccotelli, G. Scafuro and M. R. Spada, An experimental system for office document handling and text recognition Proc. 9th Int. Conf. on Pattern Recognition (1988) 739-743. [18] R. H. Davis and J. Lyall, Recognition of handwritten characters - a review, Image and Vision Computing 4, 4 (1986) 208-218. [19] V. Demjanenko, Y. C. Shin, R. Sridhar, P. Palumbo and S. Srihari, Real-time connected component analysis for address block location, Proc. 4th USPS Advanced Technology Conf. (1990) 1059-1071. [20] A. Dengel, Document image analysis - expectation-driven text recognition, Proc. Syntactic and Structural Pattern Recognition (SSPRSO) (1990) 78-87. [21] A. Dengel and G. Barth, Document description and analysis by cuts, Proc. RIAO, MIT, 1988. (221 W. Doster, Different states of a document’s content on its way from the gutenbergian world to the electronic world, Proc. 7th Int. Conf. on Pattern Recognition (1984) 872-874. [23] A. C. Downton and C. G. Leedham, Preprocessing and presorting of envelope images for automatic sorting using OCR, Pattern Recognition 23, No. 3/4 (1990) 347-362. [24] F. Esposito, D. Malerba, G. Semeraro, E. Annese and G. Scafuro, An experimen-
tal page layout recognition system for office document automatic classification: an integrated approach for inductive generalization, Proc. 10th Int. Conf. on Pattern Recognition (1990) 557-562. [25] R. Fenrich, Segmentation of automatically located handwritten words, Proc. 3rd International Workshop on Frontiers in Handurnding Recognition, Chateau de Bonas, France (1991) 33-44. [26] J. L. Fisher, S. C. Hinds and D. P. D’Amato, A rule-based system for document image segmentation, Proc. 10th Int. Conf. on Pattern Recognition (1990) 567-572. [27] L. A. Fletcher and R. Kasturi, A robust algorithm for text string separation from mixed textlgraphics images, IEEE Rans. on Pattern Analysis and Machine Intelligence 10, 6 (1988) 910-918.
608
Y. Y. Tang et al.
[28] H. Fujisawa and Y. Nakano, A top-down approach for the analysis of document images, Proc. SSPRSO (1990) 113-122. [29] V. K. Govindan and A. P. Shivaprasad, Character recognition - a review, Pattern Recognition 23, 7 (1990) 671-683. [30] D. C. He and L. Wang, Detecting texture edges from image, Pattern Recognition 25, 6 (1992) 595600. [31] D. C. He and L. Wang, Texture features based on texture spectrum, Pattern Recognition 24, 5 (1991) 391-399. [32] J. Higashino, H. Fujisawa, Y. Nakano and M. Ejiri, A knowledge-based segmentation method for document understanding, Pmc. 8th Int. Conf. on Pattern Recognition (1986) 745-748. [33] T. H. Hilderbrandt and W. Liu, Optical recognition of handwritten Chinese characters: advances since 1980, Pattern Recognition 26, 2 (1993) 205-225. [34] S. C. Hinds, J. L. Fisher and D. P. D'Amato, A document skew detection method using run-length encoding and the Hough transform, Proc. 10th Int. Conf. on Pattern Recognition (1990) 464-468. [35] M. Hose and Y. Hoshino, Segmentation method of document images by twodimensional Fourier transformation, System and Computers in Japan 16, 3 (1985) 38-47. [36] N. Hagita I. Masuda, T. Akiyama, T. Takahashi and S. Naito, Approach to smart document reader system, Proc. CVPR'85 (1985) 550-557. [37] ICDAR'91, Proc. First Int. Conf. on Document Analysis and Recognition, SaintMalo, France, Sept. 30-Oct. 2, 1991. [38] ICDAR'93, Proc. Second Int. Conf. on Document Analysis and Recognition, Tsukuba Science City, Japan, Oct. 20-22, 1993. [39] ICDAR'95, Proc. Third Int. Conf. on Document Analysis and Recognition, Montreal, Canada, August 14-16, 1995. [40] S. Impedovo, L. Ottaviano and S. Occhinegro, Optical character recognition - A survey, Znt. J. Pattern Recognition and Artificial Intelligence 5, 1 (1991) 1-24. [41] K. Inagaki, T. Kato, T. hiroshima and T. Sakai, MACSYM: A hierarchical parallel image processing system for event-driven pattern understanding of documents, Pattern Recognition 17, 1 (1984) 85-108. [42] R. Ingold and D. Armangil, A top-down document analysis method for logical structure recognition, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 41-49. [43] Y. Ishitani, Document Skew Detection Based on Local Region Complexity, Proc. Second Int. Conf. on Document Analysis and Recognition, Tsukuba Science City, Japan, Oct. 20-22 (1993) 49-52. [44] ISO, 8613: Information Processing- Text and Ofice Systems-Ofice, Document Architecture (ODA) and Interchange Format, International Organization for Standardization, 1989. [45] 0. Iwaki, H. Kida and H. Arakawa, A character/graphic segmentation method using neighbourhood line density, Dans. of the Institute of Electronics and Communication Engineers of Japan, Part IV, J&D, 4 (1985) 821-828. [46] 0.Iwaki, H. Kida and H. Arakawa, A Segmentation method based on office document hierarchical structure, Proc. IEEE Int. Conf. Syst. Man. Cybern. Alexandria, VA, Oct. (1987) 759-763. [47] A. K. Jain and S. K. Bhattacharjee, Address block location on envelopes using Gabor filters: supervised method, Proc. 11th Int. Conf. on Pattern Recognition (1992) 264-266.
3.6 Document Analysis and Recognition by Computers 609 [48] A. K. Jain and S. K. Bhattacharjee, Text segmentation using Gabor filters for autcmatic document processing, Machine Vision and Applications 5, 3 (1992) 169-184. [49] E. G. Johnston, Short note: printed text discrimination, Computer Graphics and Image Processing 3, 1 (1974) 83-89. [50] Journal, Machine Vision and Applications, (Special Issue: Document Image Analysis Techniques) 5,3 (1992). (511 J. Kanai, M. S. Krishnamoorthy and T. Spencer, Algorithms for manipulating nested block represented images, Advance Printing of Paper Summaries, SPSE’s 26th Fall Symposium, Arlington, Virginia, Oct (1986) 190-193. [52] S. M. Kerpedjiev, Automatic extraction of information structures from documents, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 32-40. [53] F. Kimura and M. Shridhar, Recognition of connected numerals, 1st Znt. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 731-739. [54] J. Kreich, A. Luhn and G. Maderlechner, An experimental environment for model based document analysis, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 50-58. [55] J. Kreich, A. Luhn and G. Maderlechner, Knowledge based interpretation of scanned business letters, IAPR Workshop on C V (1988) 417-420. [56] M. Krishnamoorthy, G. Nagy, S. Seth and M. Viswanathan, Syntactic segmentation and labeling of digitized pages from technical journal, IEEE Trans. on Pattern Analysis and Machine Intelligence 15, 7 (1993) 737-747. [57] K. Kubota, 0. Iwaki and H. Arakawa, Document understanding system, Proc. 7th Int. Conf. on Pattern Recognition (1984) 612-614. [58] K. Kubota, 0. Iwaki and H. Arakawa, Image segmentation techniques for document processing, Proc. 1983 Int. Conf. on Text Processing with a Large Character Set (1983) 73-78. [59] S. W. Lam and S. N. Srihari, Multi-domain document layout understanding, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 112-120. [60] K. K. Lau and C. H. Leung, Layout analysis and segmentation of Chinese newspaper articles, Computer Processing of Chinese and Oriental Languages 8 , 8 (1994) 97-114. [61] S. W. Lee and K. C. Kim, Address block location on handwritten Korean envelope by the merging and splitting method, Pattern Recognition 27, 12 (1994) 1641-1651. [62] J. Liu, Y . Y . Tang and C. Y . Suen, Adaptive rectangle-shaped document segmentation and geometric relation labeling, Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, Aug. 25-30 (1996) 763-767. [63] J. Liu, Y . Y . Tang and C. Y . Suen, Chinese document layout analysis based on adaptive split-and-merge and qualitative spatial reasoning, Pattern Recognition (in press). [64] H. Makino, Representation and segmentation of document images, Proc. IEEE Comput. SOC.Conf. Pattern Recognition and Image Processing (1983) 291-296. [65] J. Malik and P. Perona, Preattentive texture discrimination with early vision mechanisms, Journal Opt. SOC.Amer. A . 7, 5 (1990) 923-932. [66] J. Mantas, An overview of character recognition methodologies, Pattern Recognition 19,6 (1986) 425-430. [67] J. C. Mao and A. K. Jain, Texture classification and segmentation using multiresolution simultaneous autoregressive models, Pattern Recognition 25, 2 (1992) 173-188.
610
Y. Y. Tang et al.
[68] B. T. Mitchell and A. M. Gillies, A Model-based computer vision system for recognizing handwritten ZIP codes, Machine Vision and Applications 2 (1989) 231-243. [69] S. Mori, C. Y. Suen and K. Yamamoto, Historical review of OCR research and development, Proceeding of the IEEE 80, 7 (1992) 1029-1058. [70] S. Mori, K. Yamamoto and M. Yasuda, Research on machine recognition of handprinted characters, IEEE h n s . on Pattern Analysis and Machine Intelligence 6 , 4 (1984) 38G405. [71] G. Nagy, A preliminary investigation of techniques for the automated reading of unformatted text, Comm, ACM 11, 7 (1968) 480-487. [72] G . Nagy, Towards a structured-document-image utility, Proc. SSPRSO (1990) 293-309. [73] G. Nagy, J. Kanai and M. Krishnamoorthy, Two complementary techniques for digitized document analysis, Proc. ACM Conf. on Document Processing Systems (1988) 169-176. [74] G. Nagy, S. C. Seth and S. D. Stoddard, Document analysis with an expert system, In E. S. Gelsema and L. N. Kanal eds., Pattern Recognition Practice II, pp. 149-159, Elsevier Science Publishers B. V. (North-Holland), 1986. [75] Y. Nakano, H. Fujisawa, 0. Kunisaki, K. Okada and T. Hananoi, A document understanding system incorporating with character recognition, Proc. 8th Int. Conf. on Pattern Recognition (1986) 801-803. [76] Y. Nakano, Y. Shima, H. Fujisawa, J. Higashino and M. Fujiwara, An algorithm for the skew normalization of document image, Proc. 10th Int. Conf. on Pattern Recognition 2 (1990) 8-13. (771 D. Niyogi and S. N. Srihari, A rule-based system for document understanding, Proc. AAAI’86 (1986) 789-793. [78] L. O’Gorman, The document spectrum for structural page layout analysis, IEEE f i n s . on Pattern Analysis and Machine Intelligence 15, 11 (1993) 1162-1173. [79] L. O’Gorman and R. Kasturi (eds.) Document Image Analysis, New York: IEEE Computer Society Press, 1995. [80] T. Pavlidis, Algorithm for Graphics and Image Processing, Maryland: Computer Science Press, 1982. [81] I. Pitas and C. Kotropoulos, A texture-based approach to the segmentation of sesmic image, Pattern Recognition 25, 9 (1992) 929-945. [82] A. Rastogi and S. N. Srihari, Recognizing textual blocks in document images using the Hough transform, T R 86-01,Dept. of Computer Science, SUNY Buffalo, NY, 1986. [83] M. Sabourin, Optical character recognition by a neural network, Neural Networks 5 , 5 (1992) 843-852. [84] M. Shridhar and A. Badreldin, Recognition of isolated and simply connected handwritten numerals, Pattern Recognition 19, 1 (1986) 1-12. [85] J. C. Simon and K. Zerhouni, Robust description of a line image, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 3-14. [86] A. L. Spitz and A. Dengel (e’ds.) Document Analysis Systems, Singapore: World Scientific Publishing Co. Pte, Ltd., 1995. [87] S. N. Srihari and V. Govindaraju, Analysis of textual images using the Hough transform, Machine Vision and Application 2 (1989) 141-153. [88] S. N. Srihari, C. H. Wang, P. W. Palumbo and J. J. Hull, Recognizing address blocks on mail pieces: specialized tools and problem-solving architecture, AI Mag. 8 , 4 (1987) 25-40.
3.6 Document Analysis and Recognition by Computers 611 [89] C. Y. Suen, M. Berthod and S. Mori, Automatic recognition of handprinted characters - The state of the art, Proc. IEEE 68, 4 (1980) 469-487. [go] C. Y. Suen, Y. Y. Tang and C. D. Yan, Document layout and logical model: A general Analysis for document processing, Technical Report, Centre for Pat tern Recognition and Machine Intelligence (CENPARMI), Concordia University, 1989. [91] Y. Y. Tang, Hong Ma, Dihua Xi, Yi Cheng and C. Y. Suen, A new Approach to document analysis based on modified fractal signature, Proc. 3-nd Int. Conf. on Document Analysis and Recognition, Montreal, Canada, Oct. 14-16 (1995) 567-570. (921 Y. Y. Tang, C. Y. Suen and C. D. Yan, Chinese form pre-processing for automatic data entry, Proc. Int. Conf. on Computer Processing of Chinese and Oriental Languages, August 13-16 (1991) Taipei, Taiwan, 313-318. [93] Y. Y. Tang, C. Y. Suen and C. D. Yan, Document processing for automatic knowledge acquisition, ZEEE Trans. on Knowledge and Data Engineering 6, 1 (1994) 3-2 1. [94] Y. Y. Tang, C. D. Yan, M. Cheriet and C. Y. Suen, Automatic analysis and understanding of documents, Handbook of Pattern Recognition and Computer Vision, pp. 625-654, edited by Patrick S. P. Wang, C. H. Chen and L. F. Pau, Singapore: World Scientific Publishing Co. Pte, Ltd., 1993. [95] Y. Y. Tang, C. D. Yan, M. Cheriet and C. Y. Suen, Document analysis and understanding: a brief survey, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, fiance, Sept. 30-Oct. 2 (1991) 17-31. [96] Y. Y. Tang, C. D. Yan and C. Y. Suen, Form description language and its mapping onto form structure, Technical Report, Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia University, 1990. [97] S. L. Taylor, R. fiitzson and J. A. Pastor, Extraction of data from preprinted forms, Machine Vision and Applications 5, 3 (1992) 211-222. [98] J. Toyoda, Y. Noguchi and Y. Nishimura, Study of extracting Japanese newspaper article, Proc. 6th Int. Conf. on Pattern Recognition (1982) 1113-1115. [99] Y. Tsuji, Document image analysis for generating syntactic structure description, Proc. 9th Int. Conf. on Pattern Recognition (1988) 744-747. [loo] S. Tsujimoto and H. Asada, Understanding multi-articled documents, Proc. 10th Int. Conf. on Pattern Recognition (1990) 551-556. [loll M. Viswanathan, Analysis of scanned documents - a syntactic approach, Proc. SSPRSO (1990) 450-459. [lo21 F. Wahl, L. Abele and W. Scheri, Merkmale fuer die segmentation von dokumenten zur automatischen textverarbeitung, Proc. 4th DAGM-Symposium, 1981. (1031 D. Wang and S. N. Srihari, Classification of newspaper image blocks using texture analysis, CVGIP 47 (1989) 327-352. [lo41 S. Watanabe, Pattern Recognition: Human and Mechanical, Wiley-Interscience Publication, 1985. [lo51 K. Y. Wong, R. G . Casey and F. M. Wahl, Document analysis system, IBM J. Research Develop, 26, 6 (1982) t47-656. [lo61 A. Yamashita, T. Amano, H. Takahashi and K. Toyokawa, A model based laout understanding method for document recognition system (DRS), Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 130-138. [lo71 C. D. Yan, Y. Y. Tang and C. Y. Suen, Form understanding system based on form description language, Proc. First Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, Sept. 30-Oct. 2 (1991) 283-293.
612
Y. Y. Tang et al.
[lo81 P. S. Yeh, S. Antoy, A. Litcher and A. Rosenfeld, Address location on envelopes, Pattern Recognition 20, 2 (1987) 213-227. [log] C. L. Yu, Y. Y. Tang and C. Y. Suen, Document skew detection based on the fractal and least squares method, Proc. 3-nd Int. Conf. on Document Analysis and Recognition, Montreal, Canada,Oct. 14-16 (1995) 1149-1152.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 613-624 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 3.7 I PATTERN RECOGNITION AND VISUALIZATION OF SPARSELY SAMPLED BIOMEDICAL SIGNALS
CHING-CHUNG LI, T. P. WANG Department of Electrical Engineering, University of Pittsburgh Pittsburgh, PA 15261, USA and A. H. VAGNUCCI, M.D. Department of Medicine, University of Pittsburgh Pittsburgh, PA 15261, USA A variety of biomedical signals such as hormonal concentrations in peripheral blood can only be sampled and measured infrequently over a limited period of time; hence, they are considered as sparsely sampled non-stationary short time series. Discrete pseudo Wigner distribution is a transform which can be applied t o such signals to provide timedependent spectral information at an improved frequency resolution in comparison t o the short-time Fourier transform. When appropriately clipped and scaled, it can be visualized as an image showing the characteristic pattern of the signal. Spectral features can be extracted from the Wigner distribution for use in automatic pattern recognition. The basic technique is described in this article along with an example of its application to cortisol time series. Keywords: Biomedical signal; cortisol; pattern recognition; pattern visualization; short time series; Wigner distribution.
1. Introduction
Various biological signals are often measured in a clinical setting for providing information to aid medical diagnosis. Some signals, such as EEG and ECG, etc., can be continuously measured with relative ease; it is well known that their spectra and other analyses have been successfully applied in characterizing the state of health [1,2]. Other types of biological signals, such as chemical signals in blood samples, can be measured only infrequently for a very limited period of time; such sparsely sampled data constitute short time series which are of non-stationary nature [3]. The short-time Fourier transform gives some crude spectral information at a coarse resolution. However, the pseudo Wigner distribution can be applied to provide a better estimate of the time-dependent spectral information. This time-frequency domain analysis and its use in biomedical pattern recognition will be discussed in the following sections. 613
614 C.-C. La, T. P. Wang €4 A. H. Vagnucca
The Wigner distribution was introduced by E. P. Wigner [4] in 1932 in the context of quantum mechanics, and then applied to signal theory by J. Ville [5] in 1948. During the past ten years, the methods and applications of the Wigner distribution to non-stationary signals have been developed rapidly [6-131. An exposition of the important mathematical background of the Wigner distribution can be found in a series of three papers by Claasen and Mecklenbrauker 161,and a recent review is contained in a paper by Hlawatsch and Boudreaux-Bartels [14]. We will summarize some of the most useful properties of the Wigner distribution and the techniques of applying the discrete pseudo Wigner distribution to sparsely sampled biomedical signals. The plasma cortisol time series is taken as an example to illustrate its application to recognition and visualization of normal and abnormal patterns. 2. Wigner Distribution
2.1. Continuous Wigner Distribution Let f(t) be a continuous function of time variable t ; f ( t ) may be either real or complex, and f*(t) is the complex conjugate of f ( t ) . The Wigner distribution of f ( t ) is defined by 00
W f ( t , w ) = / -00 f ( t + ; ) f * ( t - i ) e - j w T d T where T is the correlation variable, +$ and -$ denote the time advance and time delay respectively. f(t + $)f*(t - $) forms a kernel function of the time variable t and the correlation variable T . The Fourier transform of this kernel function with respect to T gives the Wigner distribution W f(t,w ) which is a real-valued continuous function of both time t and frequency w. If F ( w ) is the Fourier transform of f(t) and F * ( w ) is its complex conjugate, the Wigner distribution W F ( W t ), can be defined as
w ~ ( wt ), =
2.rr
F (w -m
+
g)
F* ( w -
g)
eitcd<
It can be shown that W f ( tw, ) = W F ( W t ), . The reconstruction of f ( t ) from W f ( tw, ) is given by
2J* wf (i,w ) ejWtdw
f(t) = 2.rrf*(O)
-m
and the reconstruction of F ( w ) from W f ( t w , ) is given by
/
l o F(w)= F*@)
o
-m
W f (t,
i)
e-jwtdt.
The Wigner distribution is a bilinear transformation; if f ( t ) = Cf==, f k ( t ) , then
k= 1
\i=k+l
k=l
/
3. '7 Pattern Recognition of Biomedical Signals 615 where Wf,f,(t,w ) is the cross Wigner distribution of
fk(t)
and f i ( t ) ,
Furthermore, integration of W f ( t , w )with respect to t gives the energy density of f(t) at frequency w , J-CCl
and integration of W f ( t , w )with respect to w gives the instantaneous power at time t ,
2.2. Discrete-time Wigner Distribution Consider a discrete-time signal f ( n T ) which is sampled from f(t) with a sampling period T , where t = nT and n is an integer. If T is equal to one time unit, the discrete-time signal can be simply denoted by a sequence f ( n ) . The discrete-time Wigner distribution is then given by
W f ( nw , ) is a real-valued function of the discrete variable n and the continuous variable w ; it is periodic in w with its period equal to T , The sum of W f ( n w , ) over the time index n gives 00
n=--oo
The instantaneous signal power is given by (2.11)
To compute the Wigner distribution, a symmetric window function h ( k ) with a finite interval [-N 1, N - 11 is applied to the discrete-time signal f ( n ) , with its origin (k = 0) being placed at the time instant n,
+
h(lc)=
elsewhere
(2.12)
616 C.-C. La, T. P. Wang & A. H. Vagnucci
where g(k) can be any symmetric function, for example, g(k) = 1. The kernel function used for computing the pseudo Wigner distribution is then equal to h ( k ) h * ( - k ) f ( n k)f*(n- k) within the time window of length 2(N - 1).
+
2.3. Discrete Pseudo Wigner Distribution
If the frequency variable w is also discretized with w = maw, where the frequency quantization Au is equal to 2&, then the discrete pseudo Wigner distribution W ( n ,m ) is given by W ( n ,m) = W f ( n maw) , N-1
=2
C
1g(k)1'.f(n
+ k)f*(n- k ) e - j Z k m A w .
(2.13)
k=-N+l
W ( n ,m ) is a function of discrete-time n and discrete frequency maw. The frequency resolution is increased by a factor of two in comparison to that of the discrete Fourier transform. In practical applications, most signals are real-valued and, with g(k) = 1, Eq. (2.13) can be simply rewritten into
c
N-1
W ( n , m )= 2
f(n
+ k ) f ( n - k)e-j"*
.
(2.14)
k=-N+l
The discrete pseudo Wigner distribution has many useful properties, among which six are listed below [6]:
W ( n ,m) is real-valued. W f ( nmaw) , is periodic in frequency with period
IT,
i.e.
W f ( n maw) , = W f ( n maw , + IT) .
(2.15)
This is different from the discrete-time Fourier spectrum which has periodicity with period equal to IT. Wf(n,mAw)has higher frequency resolution by a factor of two as compared to the discrete Fourier transform. W ( n ,m) is a bilinear transformation with respect to f(n). If (2.16) then
(2.17)
3.7 Pattern Recognition of Biomedical Signals 617 (v) The sum of W ( n ,m) over its discrete frequency index m for one period is equal to the instantaneous signal power. N-1
(2.18) ~
m=-(N-1)
(vi) The sum of W f ( nmaw) , over the time index n gives the energy density at the discrete frequency maw , N-1
E(mAw) =
C
Wj(n,mAw)
n=-N+l
=
+
IF(mAw)12 IF(mAw + .)I2
(2.19)
where F(mAw) is the discrete Fourier transform of f(n). This implies that if we want to evaluate the energy density from W f ( nmaw), , the signal should be sampled with a Nyquist frequency larger than twice the bandwidth so that there will be no aliasing problem.
3. Recognition and Visualization of Characteristic Patterns
A signal generally has multiple components with distinct individual time and frequency characteristics. Because the kernel function of the Wigner distribution contains multiplication of shifted signals, this multiplication produces two types of products: auto-products resulting from the individual signal components, and crossproducts resulting from interaction between different signal components. These two types of products are then transformed into the frequency domain, giving the so-called auto-components and cross-components, respectively, of the Wigner distribution. The auto-components are mainly positive, while the cross-components are oscillatory, have both positive and negative values, and each is located at the midpoint between two corresponding auto-components. Cross-components of large magnitude will contribute peculiar patterns t o obscure the auto-components. One would like to remove or suppress those cross-components in order to obtain a better measurement of auto-components in the Wigner spectrum. Several methods have been developed to achieve this purpose, among which is the auto-component selection (ACS) method which is discussed below [lo]. In the ACS method, W ( n ,m ) is processed by two different filters. One is a n averaging filter with a large support ( P x Q) to filter out the oscillatory crosscomponents and give output G(n,m). The other is a pre-processing filter of small support (U x V) t o appropriately smooth out the original discrete pseudo Wigner distribution and give output R(n,m). The ratio is compared with a threshold value t, where t, < 1. If the ratio is greater than t, and, at the same time, the value of W ( n ,m) is positive, then the original W ( n ,m ) is accepted as an auto-component value and is designated by S(n,m);otherwise, S(n,m) is set to zero. The resulting
-$$$$
618
C.-C. Li, T. P. Wang €9 A. H. Vagnucci
distribution S(n,m) is considered to represent only the positive auto-component in the original discrete pseudo Wigner distribution. If both filters are simple averaging filters with different support sizes ( P > U , Q > V ) ,the combining action of these two filters in this selection process can be represented by a simple mask of size P x Q where all elements are equal to one, except the central U x V elements each of which is given by D = 1 For example, we may use P = Q = 7 and U = V = 3, so the filter mask is
gt..
1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 D D D 1 1 1 D D D 1 1 1 D D D 1 1 1 1 1 1 1 1 1 1 1 1 1 1
where the value of threshold parameter t, is selected empirically. If the filter output is positive, then S(n,m) = W ( n , m ) ;otherwise, S(n,m) = 0. With its negative values being clipped to zero and positive values being appropriately scaled, S(n,m ) may be presented as an image for visualization of the characteristic pattern of the signal in the timefrequency plane. For quantitative analysis, however, the energy density E(mAw) = W ( n ,maw) at various frequencies maw can be examined and selected as discriminatory features to be used in automatic pattern recognition. Both aspects will be illustrated in the next section.
xfI:N+l
4. Pattern Recognition of Plasma Cortisol Signal As an example, let us consider the application of the above described method to the problem of pattern recognition of plasma cortisol data. The circadian variation of cortisol concentration in peripheral blood is believed to manifest normality or abnormality in regard to a disease called Cushing’s syndrome. There are three categories of the disease, each associated with a different tumor location: in the pituitary, in the adrenal, or elsewhere. They are denoted by “pituitary” , “adrenal”, and “ectopic”, respectively. Although CT or MFU scans are routinely used to detect such tumors, they could be missed in the examination due to their small size, especially during the early stages. It would be desirable to detect the disease and recognize the disease class from the circadian cortisol pattern so as to infer the tumor location prior to confirmation by CT or MRI examination and surgical operation. Blood samples can be drawn and cortisol concentration be measured every half hour over a period of 25 to 28 hours, providing 50 to 56 data points in each measured cortisol signal. Such signals are sparsely sampled short-time series. The short-time Fourier transform and Karhunen-Loeve expansion were applied to these cortisol time series, of both normal subjects and patients with Cushing’s syndrome, in order to extract discriminatory features, and an automatic pattern recognition system was developed to recognize a cortisol pattern as normal or abnormal, and in the case of the latter, to define the category of the disease (31.
3.7 Pattern Recognition of Biomedical Signals 619 Recently, we also applied the discrete pseudo Wigner distribution to the cortisol time series for their pattern recognition [15,16]. Altogether a set of 90 cortisol time series, including 41 normal subjects, 28 “pituitary”, 12 “adrenal” and 9 “ectopic”, were processed. The results are summarized below as an illustration. W ( n , m ) was computed from each cortisol time series. We chose N = 25 and, hence, the observation window length was 48 and the frequency quantization was Aw = $. The auto-component selection was performed with threshold t, empirically set at 0.75. After clipping negative values to zero and scaling the magnitude to be within 8 bits, the resulting auto-component of the discrete pseudo Wigner distribution, S(n,m), can be presented as images in the time-frequency plane. For eight example cortisol times series given in Fig. 1, their corresponding Wigner distributions S(n,m),clipped and scaled, are shown in Fig. 2. In each image, the horizontal axis represents the time index n, (n = 0,1,2,. . .,60), the vertical axis represents the frequency maw, (m = -18, -17,. . ., - l , O , 1 , . . .,17,18), and the darker region indicates a larger magnitude of S(n,m). In Fig. 2, from the top to the bottom, each pair of images are respectively normal, “pituitary”, “adrenal” and “ectopic” spectral patterns. It is interesting to note that they show similar patterns for the cortisol time series of the same category, and distinct patterns for different categories. They provide a good visualization potential for physicians to consider. 0 -
r u 0
Normal 0.
Pltultary
--a
Normal
; k d Pltultary
0-
0 I-
6’ L
n m.
Adrenal
Adrenal 04 , , , , , , , ,
,
,
,
0-
--
0
n
n
Ectoplc 0
0 I
i.
7
I.
7
Clock Tlmo Fig. 1. Eight cortisol time series of normal subjects and patients with Cushing’s syndrome; from top to bottom, two in each category: normal, “pituitary”, “adrenal” and “ectopic”. (From Li et al. 1161, Copyright @ 1990 New York University, reprinted by permission of New York University Press.)
620
C.4.Li, T. P. Wang 13A . H. Vagnucci
I
I
0
30
0
60
1
30
60
TIME
Fig. 2. Wigner distribution of eight cortisol time series shown in Fig. 1, presented here as images in the time-frequency plane with negative values clipped to zero and positive values scaled to within 255; from top to bottom, two in each category: normal, “pituitary”, “adrenal” and “ectopic”. (From Li et al. [16], Copyright @ 1990 New York University, reprinted by permission of New York University Press.)
Examining these time-frequency characteristics, one can find that the major differences are shown in the central portion of the time-frequency domain (n = 13 to 37) where W ( n ,m) is most reliably computed from the summation of all 49 non-zero products of data points. It supports the observation that W ( n ,m) has the most significant intensity information in the time interval from n = 13 to n = 37. Let us examine the energy density profile along the frequency axis and compute the essential energy density at maw by summing up W ( n , m )over the time index n from 13 to 37
c 37
En(m)=
W(n,m)
(4.1)
n=13
These En(m)’sare examined for selection of discriminating features. The normal patterns and Cushing’s syndrome patterns can be distinguished by using two features: E,(O) and En(4).Their distributions are shown in Fig. 3. Among the Cushing’s syndrome patterns, “adrenal” and “ectopic” categories can also be differentiated by using these two features. E,(O), En(2)and En(3) were selected for discriminating “pituitary” from ‘Ladrenay’.“Pituitary” and “ectopic” categories can Altogether, six be classified by using four features: E,(O), En(3), En(7) and En(8). spectral features were selected for automatic pattern recognition of cortisol signals. By using a similar structure as the one used in [3], another pattern recognition system shown in Fig. 4 was trained with 100% accuracy. The weight vectors W l ,
3.7 Pattern Recognition of Biomedical Signals 621 0
0 0 0
m 0
0 0 h 0
0 0 0 0 (0
0 0 0 0
In
- 0
so cz u* 0 0 0
0
m
*
0
-
0
0 0
N
M A
&
0 0 0
z
i
0
I
"
500000
I
'
"
1000000
'
I
"
"
1500000
I
'
2000000
25C 1000
W O ) Fig. 3. Distributions of 41 normal patterns and 49 Cushing's syndrome patterns in E,(O)-E,(4) feature space (triangle: Normal; circle: Cushing's syndrome).
W2, W, and Wd of component classifiers in the system are given in Table 1, where the last component in each weight vector is the threshold weight. Linear decision functions di = (y;,l)Wi, (i = 1, 2, 3, 4), are used in the system. Joint decisions assign an abnormal category, for example, "pituitary" is classified when d l < 0, d2 > 0 and d3 > 0.
5. Summary
In summary, we have shown that the discrete pseudo Wigner distribution provides an effective method for analyzing sparsely sampled biomedical signals. On the one hand, its presentation in the time-frequency domain, after appropriate post processing may provide a means for pattern visualization for physicians. On the other hand, spectral information can be obtained with increased frequency resolution, and thus contributes t o the effective and efficient feature extraction for automatic pattern recognition.
622
C.-C. Li, T. P. Wang €4 A . H. Vagnucci
Table 1. Augmented weight vectors for cortisol pattern recognition system shown in Fig. 4. Classifier
Normal/ Patient
“pituitary”/ “adrenal”
“pituitary”/ “ectopic”
“adrenal”/ “ectopic”
Augmented Weight Vectors
-0.3345 -5.5846 9994.5571
0.0745 3.1639 -2.5136 -9401.63
-0.0680 -0.6749 -1.5909 2.3334 41225.1457
-0.0095 0.1780 9206.8272
I
-
Wi
Classifier
-
-
I
I I E (0)
I NORMAL
’F
-+
y,=[E:
(q]
Classifier Augmented Weight Decision Function d, = (Y; #1)W4
Fig. 4. Block diagram of a pattern recognition system for cortisol time series using spectral features extracted from discrete pseudo Wigner distribution.
During the past several years, new distributions have been developed for further reducing the interference of cross components, for example, the reduced interference distribution (RID) by Jeong and Williams [17,18] which has been tested on various electrophysiological signals and bioacoustic signals. Alternatively, the wavelet transform provides a new approach for signal analysis with both time and frequency localization properties [19-221. One may choose compactly supported wavelets and their corresponding scaling functions which are suitable for representing sparsely sampled, short time series encountered in bio-medicine. The characteristic patterns
3.7 Pattern Recognition of Biomedical Signals 623 of signals can be visualized in the time-scale plane, and wavelet features can be used in an artificial neural network for pattern recognition. References [l] R. G. Shiavi and J. R. Bourne, Methods of biological signal processing, in T . Y. Young and K. S. Fu (eds.), Handbook of Pattern Recognition and Image Processing (Academic press, New York, 1986) 545-568. (21 N. V. Thakor (guest ed.), Biomedical Signal Processing, IEEE Engineering in Medicine and Biology Magazine 9, March (1990). [3] A. H. Vagnucci, T. P. Wang, V. Pratt and C. C. Li, Classification of plasma cortisol patterns in normal subjects and in Cushing’s syndrome, IEEE Trans. Biomed. Eng. 38 (1991) 113-125. [4] E. P. Wigner, On the quantum correction for thermodynamic equilibrium, Phys. Rev. 40 (1932) 749-759. [5] J. Ville, Theorie et applications de la notion de signal analytique, Cables et Transmission 2A (1948) 61-74. [6] T. A. C. M. Claasen and W. F. G. Mecklenbrauker, The Wigner distribution-A tool for time-frequency signal analysis, Part I, Part 11, Part 111, Philips J. Res. 35 (1980) 217-250, 276-300, 372-389. [7] G. F. Boudreaux-Bartels, Time-frequency Signal Processing Algorithms: Analysis and Synthesis Using Wigner Distributions, Ph.D. Thesis, Rice University, 1984. [8) W. Martin and P. Flandrin, Wigner-Ville spectral analysis of nonstationary processes, IEEE Trans. Awust. Speech Signal Process. 33 (1985) 1461-1470. [9] J. C. Andrieux, M. R. Feix, G. Mourgues, P. Bertrand, B. Izrar and V. T. Nguyen, Optimum smoothing of the Wigner-Ville distribution, IEEE Tkans. Acoust. Speech Signal Process. 35 (1987) 764-769. [lo] M. Sun, The Discrete Pseudo Wigner Distribution: Efficient Computation and CrossComponent Elimination, Ph. D. Thesis, University of Pittsburgh, 1989. [ll] M. Sun, C. C. Li, L. N. Sekhar and R. J. Sclabassi, Efficient computation of discrete pseudo Wigner distribution, IEEE Trans. Acoust. Speech Signal Process. 37 (1989) 1735-1742. (121 R. M. S. S. Abeysekera, Time-frequency domain features of ECG signals: An interpretation and their application in computer aided diagnoses, Ph. D. Thesis, University of Queensland, Australia, 1989. [13] S. Usui and H. Araki, Wigner distribution analysis of BSPM for optimal sampling, IEEE Engineering in Medicine and Biology Magazine 9, March (1990) 29-32. [14] F. Hlawatsch and G. F. Boudreaux-Bartels, Linear and quadratic time-frequency signal representations, IEEE Signal Processing Magazine 9, April (1992) 21-68. [15] T. P. Wang, M. Sun, C. C. Li and A. H. Vagnucci, Classification of abnormal cortisol patterns by features from Wigner spectra, in Proc. 10th Int. Conf. on Pattern Recognition, Atlantic City, NJ, June, 1990, 228-230. [IS] C. C. Li, A. H. Vagnucci, T. P. Wang and M. Sun, Pseudo Wigner distribution for processing short-time biological signals, in D. C. Mikulecky and A. M. Clarke (eds.), Biomedical Engineering: Opening New Doors, Proc. 1990 Annual Fall Meeting of Biomedical Engineering SOC.(New York University Press, New York, 1990) 191-200. [17] J. Jeong and W. J. Williams, Kernel design for reduced interference distributions, IEEE Trans. signal Process. 40 (1992) 402412. [IS] W. J. Williams, Reduced interference distributions: biological applications and interpretatons, Proc. IEEE 84 (1996) 1264-80.
624
C.-C. Li, T. P. Wang & A . H. Vagnucci
[19] Y. T. Chan, Wavelet Basics (Kluwer Academic Publishers, Boston, MA, 1995). [20] M. Akay (guest ed.), Special Issue: Wavelet Transforms in Biomedical Engineering, Annals of Biomedical Engineering 23, September/October (1995). [21] A. Aldroubi and M. Unser (eds.), Wavelets in Medicine and Biology (CRC Press, Boca Raton, FL, 1996). [22] M. Unser and A. Aldroubi, A review of wavelets in biomedical applications, Proc. IEEE 84 (1996) 626-638.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 625-666 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
CHAPTER 3.8
1
PATTERN RECOGNITION AND COMPUTER VISION FOR GEOGRAPHIC DATA ANALYSIS
F. CAVAYAS Department of Geography, Universite' de Montre'al, Montreal, B. 0. 6128, Station Centre-Ville, Quebec, Canada, H3C 357 Y. BAUDOUIN Department of Geography, Universitk du Que'bec Ci Montre'al, Montreal, B. 0. 8888, Station Centre- Ville, Quebec, Canada, H3C 3P8 This chapter provides a summary of recent developments in two connected fields: remotely sensed image analysis and geographic (map) data analysis, with emphasis on spatial pattern extraction and description as well as on segmentation and categorisation of the geographic space. Basic functions of Geographic Information Systems (GIS) and Remotely Sensed Image Analysis Systems (RSIAS) are first reviewed. Then the application of pattern recognition, computer vision and spatial analysis principles t o image data analysis and map data analysis are examined. Finally, examples are given t o illustrate the possibilities of achieving greater synergy in the use of both image and map data sets. Such use will be essential to fully exploit images which would be provided by new satellite sensors with spatial resolution ranging form a meter to a kilometer. The described analytic tools could constitute the core of integrated RSIAS and GIS systems. Keywords: Geographic Data, Gathering of Geographic Data, Geographic Information Systems, Remotely Sensed Image Analysis Systems, Geographic and Image Databases, Models for Geographic Data Representation, Geographic Knowledge Bases, Models for Knowledge Representation, Information Extraction Approaches from Remotely Sensed Data, Image Attributes and Extraction, Information Extraction Approaches from Map Data, Map Attributes and Extraction, Map-Guided Analysis of Remotely Sensed Image Data.
1. Introduction
Geographic data are facts about objects and phenomena occurring or manifested at various locations on or near the surface of the earth at a given point in time. Thus, assuming time as a constant, geographic data is comprised of two distinct components: (a) data referring to the measured categories or magnitudes of geographic features, and (b) data describing the geographic location where the measurements have been collected. The former are qualified as the nonlocational or aspatial attributes, the latter as the locational or spatial attributes of a geographic datum. This particularity of the geographic data permits a number of analyses involving: (a) only locational attributes, for example, distance measures 625
626
F. Cavayas -3 Y. Baudouin
for proximity analyses; (b) only nonlocational attributes, for example, identification of geographic patterns using multivariate analyses; (c) locational and nonlocational attributes together, for example, inference of the magnitude of a phenomenon at a given location using measurements done at sampling locations; and (d) combined analysis of locational and various nonlocational attributes, for example, studying spatial correlation between the education level and the criminality rate in various city districts. Geographic data are gathered using various methods and techniques (Table 1). Topographic mapping and earth resources inventory and mapping constitute one of the major activities in various geosciences and engineering disciplines. Combined analysis of map data sets at different scales is essential in modeling and understanding processes related to the Earth’s physical environment, managing earth resources, responding to urgent situations caused by natural catastrophes or human activities, etc. Spatial analyses and modeling based on socioeconomic and population data are major study subjects in disciplines related to human geography, such as geodemographics and geoepidemics. Combined analyses of data related to physical and human components of the environment are essential for environmental health diagnosis, studying the environment capacity to support socioeconomic activities, making spatial decisions, etc. As many geographic phenomena change rapidly in space and time, data collection and map updating has to be carried out on a regular basis. All these activities involving geographic data sets are qualified as the four Ms: measurement, mapping, monitoring, and modeling [l]. The term Geographic Information Systems (GIS) is usually employed, especially since the go’s, to describe computerized systems dedicated to the analysis of various geographic data sets, stored in the form of digital maps, and the extraction and distribution of geographic information [2]. Various tools have been developed allowing localized searches in geographic databases and information extraction, cartometric operations, map overlay, and combined data sets analysis. Modern GIS, equipped with powerful computer graphics routines, permit geographic information
Table 1. Sources of geographic data. Type of geographic data
Acquisition methods and techniques
Location data (X, Y , Z) and geometric attributes of spatial objects
Geodetic surveys, Global Positioning Systems (GPS), Photogrammetry, Remote Sensing
Biophysical data
Ground surveys, laboratory measurements, photointerpretation, remote sensing, geophysical surveys and sensors, sonars, climatologic and meteorological stations, air pollution stations
Socioeconomic and population data
Census, interviews, origin-destination surveys, land use inventories, cadastre records
3.8 Pattern Recognition and Computer Vision for . . . 627 visualization and map edition almost instantaneously. Spatial data analysis methods developed in geography, geosciences and engineering disciplines [2-51 are being progressively introduced into GIs, allowing spatial pattern detection, analysis and description [6]. More specialized GIS have also been developed allowing full 3-D representation of the geographic space [7], and new approaches are being explored for representing and analyzing phenomena evolution in space and time [8]. GIS are now considered standard technology for geographic data analysis and mapping [9]. Pattern recognition and computer vision were introduced in various geosciences and engineering disciplines in the early 70’s as a means of automatically extracting pertinent geographic information from remote sensing imagery [lo]. A large body of literature exists on this subject, and many chapters in this handbook cover various topics on image analysis in general, and remotely sensed image analysis in particular. Powerful remotely sensed image analysis systems (RSIAS), equipped with image processing and primarily statistical pattern recognition capabilities, are now available. Remotely sensed image understanding systems are, however, still in the experimental phase. Since the 70’s several authors have been pleading for the integration of RSIAS and cartographic systems or GIS for versatile automatic cartography and digital map updating [ll-141. Other authors emphasize the importance of integrated RSIAS and GIS systems in application areas such as image classification, development of calibration models for transforming satellite measured radiances to biophysical terrain parameters at various image resolutions, and development of environment models taking into account space-time interactions and scaling properties of terrain variables [15]. The aim of this chapter is to review concepts and methods in pattern recognition and computer vision for image analysis as well as in spatial pat tern analysis as developed with conventional geographic data. According to us this could constitute the core of analysis methods in integrated RSIAS/GIS oriented to landscape understanding and monitoring. Our point of view is that pattern recognition and computer vision principles enriched with spatial analysis principles could be extended to conventional geographic data sets. It is therefore possible to build fast and efficient integrated image and map analysis systems for the extraction of pertinent and useful geographic information. The need for such systems was already established in the 80’s. For instance, Davis [4] (p. 449), in discussing methods for comparison of different maps for the extraction of geologic information, wrote in 1986, “The subject of map comparisons will become increasingly important in the future, because interpreting the voluminous data for Earth-sensing satellites will require development of automatic pattern recognizers and map analyzers”. After exposing his ideas for the development of such tools, essentially those of knowledge-based image understanding systems, he concluded (p. 450): “If this is not done, we will literally be buried under the reams of charts, maps, and photographs returned from the resources survey satellites, orbiting geophysical platforms, and other exotic tools of the future”. This data glut will become
628
F. Cavayas & Y. Baudouin
a reality in the very near future with the launching of many different satellites with various types of imaging sensors and spatial resolutions, ranging from a meter to a kilometer [15,16]. For example, the NASA Earth Observing System (EOS) of orbiting satellites alone is projected to generate some 50 gigabytes/hour of remotely sensed image data when it becomes operational at the end of the century [17]. Following a brief examination in the next section of the various approaches proposed for the integration of RSIAS and GIS technologies in terms of analysis methods, we will review: (1) the structures of databases and knowledge bases used in the context of GIS and RSIAS (Part I); (2) the major elements of image and spatial data analysis used for the segmentation and categorization of geographic space (Part 11); and (3) the methods used for combined image and map data analysis (Part 111). 1.1. Convergence of IAS and GIS Technologies i n Terms of
Analysis Methods Remotely sensed data in their various forms are a major source of geographic information on the Earth's natural and cultural environment. For many third world countries remote sensing imagery constitutes the key data source to complete reference map cover or to update map cover dating often from the second world war and even from the period of colonization. The following table presents a synthesis from the United Nations cartographic inventory (1991) [18], clearly revealing the disparities around the world in reference map cover, costs associated with map production, and total investment in map data production.
Table 2. Cover of topographic mapping (%) and associated costs (1991) 118). Continent
Area (Mkm2)
Class I
Class I1
Class 111 Class IV ($US/km2) (M$US)**
~ 1 : 5 000 0 =1:100 000 ~ 1 : 2 000 5
=1:250
000 Africa Asia Australia and Oceania Europe North America South America (ex-) USSR
30.319 27.693
2.5 13.9
34.5 68.4
19.5 62.1
86.6 83.7
0,158 0,118
83.985 374.465
8.505
18.3
22.8
54.4
82.9
2,830
57.355
4.939
83.4
96.2
78.5
90.9
2,692
691.922
24.256
37.0
71.1
37.1
99.2
0,638
249.991
17.837
6.7
29.8
53.4
77.6
0,148
40.830
22.402
100.0
100.0
100.0
100.0
0,528
150.000
'Cost of map production **Total investment in cartography
3.8
P a t t e r n Recognition and Computer Vision for
. ..
629
For countries with a long tradition in cartography, GIS were developed as an extension of computerized cartographic systems whose development dates from the early 60’s. Data sets in such systems were obtained by sources other than remote sensing imagery and very often by digitizing existing analog maps. Therefore, in GIS literature the most widespread idea concerning the role of remote sensing imagery is that of an external source, sometimes considered as “ancillary,” [19] providing specific data sets for input to an autonomous GIS (e.g. land cover, natural vegetation mapping). Conversely, in RSIAS literature, digital cartographic and geographic data are often treated as “ancillary” data sets. However, their use in practice over the past 30 years has been extensive, from almost all the steps of image processing and analysis for increasing the performance of statistical pattern recognition methods to building experimental image understanding systems (Table 3) [15]. Table 3. Use of geographic data in image processing and analysis (Modified from Davis and Simonett [15].) Type of map data
Image processing operations
Scope
Digital elevation model
Radiometric correction
Surface illumination and reflectance directionality effects
Maps
Radiometric correction
Selection of surface invariant targets for atmospheric correction
Maps
Training a classifier
Selection of sites for supervised spectral classification
Per pixel classification
Creation of elevation masks to account for elevation zonation of vegetation types, etc. for spectral classification
Digital elevation model
Maps
Digital elevation model
Per region statistical classification
Stratification of satellite images into more homogeneous and statistically stationary subregions
Per pixel image classification
Introduction of geomorphometric variables in spectral classification
Maps
Validation of image classification
Maps
Image understanding
Location of field sites for classification accuracy assessment Mapguided segmentation, classification and change detection
630
F. Cavayas & Y. Baudouin
Map-guided image analysis is one of the techniques listed in Table 3. This approach, which requires a more synergetic use of map and image data than the others, is representative of what an integrated RSIAS/GIS should be in practice. Employing map data as a guide permits a fast and efficient pattern search in the image space, change detection, and consequently, database updating [11,12,20-221. Another advantage of this approach is the improvement of the geographic database accuracy concerning spatial features visible in both maps and images [20,23]. By extending the principles of map-guided analysis in dynamic situations, continuous monitoring of the geographic features could thus be greatly facilitated. For more technologically advanced countries where basic reference map cover has almost been accomplished (Table 1), geographic database updating, earth surface monitoring, and spatially dynamic processes modeling will be the principal focus in GIs. However, for these activities remote sensing imagery becomes the key data source. Only computerized systems doted with analytic tools permitting the synergetic exploitation of various geographic data sets and remotely sensed image data, will be able to provide information and knowledge on the spatiotemporal evolution of the Earth’s environment [15]. Geographic knowledge represented in knowledge bases will be important in building image and map understanding systems. The same or similar approaches have been adopted with respect to the use of analytic tools in the fields of RSIAS and GIs. For instance, multivariate analysis of geographic data such as discriminant functions and clustering, long used by natural taxonomists and geographers [6,15], are also applied in statistical pattern recognition, which in fact, introduced a new concept - the capacity of a machine to learn how to assign unknown patterns to different classes. Thus, discriminant functions are used by supervised-learning machines, and clustering is commonly employed by unsupervised-learning machines. Many techniques used in image understanding systems to characterize the shape and boundaries of objects, are similar to those employed in a GIS context with map data, especially for the automatic generalization of spatial entities geometry [24]. Finally, knowledge-based approaches to image interpretation share many similarities with knowledge-based interpretation of geographic data sets. Such approaches are applied in GIs, for example, in automatic extraction of information from geographic data [25], or in the interpretation of geographic data sets for prescriptive or predictive spatial modeling (site selection, environmental susceptibility, mineral potential evaluation, etc.) 161. However, examples can be found in literature where image analysis principles are introduced in the analysis of geographic data, and inversely, spatial data analysis principles are introduced in image analysis. Typical examples are the identification of various geomorphic classes using digital elevation models and syntactic pattern recognition [26]; the categorization of the geographic space according to ecological units using various geographic data sets and neural networks [27]; the identification of the type of drainage network (dentritic, trellis, radial, etc.) as depicted in maps using knowledge-based pattern recognition approaches [28]; the introduction of nearest neighbor analysis of point patterns in order to automatically detect complex types
3.8 Pattern Recognition and Computer Vision for . . . 631 of forest cuts on satellite imagery [21]; and the introduction of variograms [5] in order to extract useful struckural parameters of forest stands [29] or ocean patterns [30] from airborne or spaceborne images. Other concepts such as fractals and scale are increasingly used to understand image texture [31], relations between topographic relief and image tone variability [32], or for multi-resolution analysis of complex scenes [33]. The combined analyses of image data and geographic data (Table 3), and the use of similar approaches for the analysis of either image data or geographic data, may be considered representative examples of an emerging collection of methodologies, technologies and systems for t he analysis of multi-source geographic data. This tendency is also observed in other scientific and engineering disciplines dealing with the analysis of large databases; this emerging field is referred to as “data mining and knowledge discovery” [34].
PART 1. GEOGRAPHIC DATABASES AND KNOWLEDGE BASES Various proposals exist for the realization of integrated RSIAS and GIS systems [12,35,36]. For some authors this could be achieved by defining “transparent” links for data transfer from one system to another or by designing entirely new system architectures. However, in order for these integrated systems to be realized, they will have to preserve the advantages offered by the two basic models of geographic data representation - the vector and the raster models [37]. Current GIS and RSIAS database structures are reviewed next, along with models used to represent geographic knowledge in advanced RSIAS and GIS.
2.1. Geographic and Image Databases
A GIS database can be considered a computer readable realization of a multifaceted model of the geographic reality as conceived in various disciplines. Special data models were developed in order to accommodate the particularity of geographic data with their two indissociable components, locational and nonlocational attributes. These models differ primarily in the way that geographic space is considered, and consequently, in the manner in which spatial features and their inter-relationships are represented. The vector model is the most widely used. The geographic space is viewed as a continuum of points with arbitrary precision coordinates according to a cartographic system. As mentioned, only some specialized GIS afford full 3-D representation of the earths surface. Following the standard cartographic model, geographic features are considered according to the map scale as zero-extension entities (point data), l - D entities (lines), or 2-D entities (areas). Linear and areal entities are therefore represented by a set of points in order to reproduce to scale their extent and shape. Because GIS are developed in order to respond to specific needs, a variety of vector database structures is presently in use. As a detailed examination is beyond the scope of this chapter, only a brief review is presented here. The interested reader is referred to the various books on GIS listed in the references section.
632
F. Cavayas
tY Y. Baudouin
Roughly, there are two types of vector databases, those following the CAD (Computer-Aided Design) model and those following the topologic model. CAD systems handle geographic data sets as separated data layers according to the kind of spatial entity (road network, hydrographic network, buildings, land use, etc.) Relations between the layers and features in each layer are not represented and are therefore merely graphic systems with limited capabilities in spatial data analysis. Topological models, as their name implies, focus on the relations between spatial entities using network representations (nodes and arcs). Nonlocational and locational attributes could be found physically in the same file (as in the CAD model) or in separate files (as with some topological models). In the latter case each spatial entity has its own identifier, used as the link between spatial and aspatial attributes. When separated files are used, nonlocational attributes could be logically organized as in standard data computer models. The relational model is usually employed in modern GIS (Fig. 1). With the advent of the object-oriented database model, the possibilities to create unified geographic data representations while preserving the properties of standard computer models, such as the relational one, are now being explored [38]. Geographic space may also be viewed as a discrete set of points called a twodimensional lattice. Each point in the lattice could be considered as the center of a uniform plane geometric figure, for example, a square or a hexagon. The pattern of these elementary figures repeated over space is referred to as tessellation. Square tessellation is most commonly used. The size on the ground of this elementary figure defines the spatial resolution of the raster model. Unlike the vector model,
Fig. 1. Schematic representation of relational geographic databases.
3.8 Pattern Recognition and Computer Vision for . . . 633 the representation of the geographic space by a regular array of atomic units eliminates the need to explicitly define the location attributes. Only the coordinates of the unit of origin and the size of each unit are required to define the coordinates of any unit within this array. However, topological relationships between spatial entities cannot be represented in such databases. With respect to representation of geographic entities, the form of the lattice and the spatial resolution play a role analogous to the scale in vector models. At small scales for the vector model and low spatial resolution in the raster model, the problems of abstraction, generalization and aggregation of spatial entities constitute major issues in cartography and GIS [38]. It is evident, however, that the raster model is far less effective than the vector model for punctual and linear entities representation, but is well adapted for the representation of areal entities and spatially continuous phenomena, such as surface elevation and temperature. The representation of these phenomena using vector models is done by isolines or Triangulated Irregular Networks (TIN) [39]. Contrary to the vector model, where an aspatial attribute characterizes an entire spatial entity and each entity may have many aspatial attributes, the raster model requires the definition of the aspatial attribute for each point in the lattice, and there exist as many arrays as aspatial attributes in the geographic data set. The multiplication of the arrays to accommodate the representation of large geographic data sets underscores the problem of storage memory. Various techniques have been proposed in order to compress raster data. The most commonly used is the run-length encoding technique and the variable spatial resolution array, termed quad-tree [1,38],which is more appropriate for areal thematic data sets (land cover, vegetation maps, etc.). Image databases follow the structure of raster databases. To preserve the image data variability, compression methods, such as the run-length encoding or the DPCM (Differential Pulse Code Modulation) technique are usually employed [38]. The intelligent indexing and automatic retrieval of images according to a specific query in large image databases remains an open question [40]. This problem is far less acute in large vector databases given that geographic data have a definite meaning and could be directly treated as symbols [41]. The hierarchical representation of the geographic space at various scales either as a pyramid raster structure (multiple spatial resolution model) or quad-tree raster structure (variable spatial resolution model) [l]gave rise to new ideas with respect to the organization of global geographic databases. For example, a modified quadtree structure was proposed [42] with a single node at the top of the hierarchy representing the entire earth, and subsequent levels of the hierarchy representing earthly features at progressively finer spatial resolution. The finest spatial resolution, at the 30th level of the hierarchy, is at a centimeter scale. Such structures are of particular interest for integrated RSIAS/GIS systems dedicated to the modeling of spatial processes based on multi-resolution imagery (Fig. 2). However, the integration of geographic data from various sources into a common database, either vector or raster, poses important problems related to data quality
634 F. Cavaym €4 Y.Baudouin
Pf evince of
Montreal
Zone 34
Bfock 22
Uueber, Fig. 2. Multi-resolution representation of the earthly features.
and coherence. These problems are more apparent in GIS applications based on overlay or combined analysis of various thematic data layers, created by digitizing analog maps of variable quality, and compiled at different times and cartographic scales. The propagation of errors and map-scales incompatibility give rise to uncertain and sometimes inconsistent results [27], indicating the need for new methods in evaluating geographic database quality [38]. The systematic use of metadata files describing the characteristics of geographic digital data distributed by various organizations (data sources, acquisition date, cartographic projection, files format, expected accuracy, etc.) 1381, is an important requirement to ensure data quality, coherence and consistency in GIs. Similar data incompatibility and variable quality problems have also been noted in experiments seeking to develop integrated GIS and RSIAS. It was observed that satellite imagery of relatively high resolution, such as Landsat and SPOT, is geometrically more stable when corrected to match a cartographic projection than corresponding digital maps created by aerotriangulation and restitution of aerial stereo photographs [14,20,43]. A thorough discussion and illustration of such problems, with particular emphasis on integrated systems for forest resources management, can be found in Goodenough [14]. Part I11 of this chapter discusses methods for matching map and image data. 2.2. Geographic Knowledge Bases
As mentioned in Section 1.1, attempts have been made to introduce geographic knowledge on earthly features and phenomena in the analysis of image data. In the majority of cases this knowledge is used indirectly in the form of geographic data (Table 4). Therefore, thematic maps and digital elevation models were used: (a) to stratify the image space in meaningful zones for more accurate image classification; (b) as additional image channels in multispectral classification; and (c) to facilitate location on images of meaningful geographic features, or to generate hypotheses about their possible location (map-guided segmentation [44], road network tracking and new roads detection [20,43], etc.) Attempts have also been made to directly apply geographic knowledge, explicitly represented in knowledge bases, to problem solving, using Artificial Intelligence (AI) met hods. Various knowledge-based vision
3.8 Pattern Recognition and Computer Vision for . . . 635 Table 4. Examples of geographic knowledge introduced in knowledge-based vision systems. Type of geographic knowledge
Modeled properties Typical landforms Expected features in each landform Likelihood of occurrence of features in each landform
Physiographic sections [46] Landforms Geographic features 0
Forest change characteristics [46]
0
Forest stand architecture [21]
Description of the stand structure as seen from the vertical
Co-occurrence of land uses in urban areas [22]
Repulsion-attraction relationships (see Fig. 4)
0
0
0
Urban features properties [22,47] Knowledge on maps and databases [46]
- Table
of properties (object making shadow, expected height, natural color)
Table 5 . Models for knowledge representation (primary source: Patterson [48]). Knowledge representation
Knowledge modeling
Structure
Formalized Symbolic Logic Proposition logic
Symbolic representation of verbalized concepts, relationships, heuristics, etc. as pieces of independent knowledge which when appropriately chained permit inferences
Facts, rules
Predicates, functions, variables, quantifiers
Predicate calculus
Structured representation Semantic Networks
Graphical representation of knowledge
Nodes (concepts), directed links (IS-A, AKO . . . , relations)
Frames
Stereotyped representation of knowledge
Class frames, object frames, slots, facets, values, frame networks
Object-oriented
Extended frame-like representation
Objects, classes, messages, methods
systems have been proposed to accomplish tasks such as land cover mapping, terrain analysis, and change detection [45]. Table 5 presents examples of geographic knowledge, which has been included in proposed knowledge-based vision systems. This knowledge could be applied t o
Brushland
I I Grassland
I
Fig. 3. Typical information tree for earth resources classes: only a natural vegetation ‘‘tree” is spanned to a level of the hierarchy compatible with regional scale studies (adapted from Swain and Davis [49]).
1 I
Earth surface features
I
3.8 Pattern Recognition and Computer Vision for . . . 637 problem solving in various ways: (a) translate the analyst’s objective in terms of meaningful geographic features t o be searched for on an image and in interpretation strategies t o achieve that objective; (b) generate hypotheses about the features possibly present in a particular image, and, associated with knowledge in remote sensing, determine expectations regarding the image properties of particular geographic features; and (c) help resolve conflicts relating to image interpretation. Among the knowledge representation models (Table 5), formal logic representation is more frequently used [45]. Heuristic rules could adequately capture geographic knowledge as used to support human image interpretation, as well as knowledge specific to image characteristics and computer image analysis. However, as Argialas and Harlow (page 878) [45] stress “... representing knowledge as an unordered and unstructured set of rules has certain disadvantages. For example, one cannot easily express the structure of the domain in terms of taxonomic, or part-whole relations that hold between objects and between classes of objects”. Frames or object-oriented representations better exploit the inherent characteristics of geographic knowledge for image understanding purposes [45] given that in any geoscientific domain earth features of interest are usually grouped according to various taxonomic schemes (land cover/land use, forest species, rocks, etc.). An example of a taxonomic system often employed in remote sensing studies is presented in Fig. 3.
Fig. 4. Repulsion-attraction relationships used as explicit knowledge in urban land use image classification (see Sec. 6).
638
F. Cavayas €4 Y.Baudouin
In the context of GIs, knowledge-based systems proposed for the classification or extraction of geographic features from maps and digital elevation models (Section 1.1) apply geographic knowledge in a similar fashion t o that of image interpretation problems. Versions of the well-known Prospector Expert System [6] are among the first examples of knowledge-based systems applied in multiple map analysis for regional assessment of site favorability, for instance, for mining. Since the advent of the Prospector, other systems, frequently based on formal logic knowledge or frame-like representations, have been proposed for the analysis of multiple geographic data sets for modeling purposes. Given the particular nature of the modeling problem, only disciplined knowledge of how to combine multiple geographic data sets (corresponding to the assumed model parameters) are usually represented in such systems.
PART 11. SEGMENTATION AND CATEGORIZATION OF GEOGRAPHIC SPACE 3. Remote Sensing Image Analysis
Remote sensing image analysis could be applied to one image (either monochannel or multichannel) or a sequence of images in space and time. Problems frequently encountered concern: (a) the establishment of locational and nonlocational attributes of spatial features directly visible on the image, such as buildings and their use, roads and their pavement type, forests and their biomass, drainage and its form, and geomorphic features and their origin; (b) the establishment of locational and nonlocational attributes of geographic features partially or totally hidden by others (soils, rocks, aquifers, sediment loading of a river, etc.); (c) the reconstruction of the 3-D shape of geographic features (surface and ocean topography, heights of objects, bathymetry); and (d) the detection and identification of land cover/use changes, tracking of moving features such as icebergs or forest fires, and monitoring of flooded areas, etc. Over the past 30 years, significant progress has been made in modeling the relation of image radiances to biophysical parameters under controlled conditions. However, when applied t o real scenes, these models often encounter difficulties related to the spatial heterogeneity of the terrestrial objects and pervasive noise (either sensor artifacts or environmental noise) in remote sensing images. Significant progress has also been made in automating the photogrammetric work for 3-D shape extraction. The massive availability of radar satellite imagery in the early 90’s has sparked renewed interest in satellite interferometry as an alternative method for 3-D shape extraction. The basic problem, however, of any image analysis, that is, the detection of meaningful spots, lines and areas by fast and efficient automatic methods and techniques, remains unresolved, especially in the case of high spatial resolution imagery. Developments in pattern recognition and computer vision will be discussed next by examining the related fields of feature extraction, segmentation and classification. Part I11 of this chapter presents developments in change detection and geographic database updating.
3.8 Pattern Recognition and Computer Vision for . . . 639
3.1. Image Feature Extraction An image feature is a property of a pixel or an image segment and is represented by scalar values (metrics) such as gray levels, local density of edges, and compactness of an image segment. By analogy to visual stimuli, which permits perception and recognition of visual patterns by a photo-interpreter, image metrics are often considered attributes of such stimuli as tone, color, texture, and shape. In image analysis, metrics can be used to identify meaningful (according to the analysis goal) geographic features. In more complex image analysis systems, image metrics may be used as descriptors, permitting the declarative representation of image feature properties such as “the gray level is lower than a threshold T”, “the segment is convex”, or “the segment includes holes”. The instantiation of rules included in the knowledge base using these properties results in new image features in the form of binary predicates or some form of likelihood supporting a hypothesis concerning the identity of an image pattern. Prevalent in remote sensing literature, scalar features, which are used either directly in image analysis or as descriptors, are reviewed next. For a detailed analysis of these features and others proposed in the image analysis domain in general, the reader is referred to Pratt [51] and Jain [52], among others. Gray level: The gray level of a pixel corresponds to the visual stimuli of tone and is proportional to the flux of electromagnetic radiation reflected, emitted or scattered by objects within the instantaneous field of view (IFOV) of the sensor. Modern sensors quantize the signal in 8 bits, or in the case of radars, 16 bits. The same sensor can record pixel gray levels in various spectral zones or bands and with active sensors under different polarization states. The same territory can be viewed at different instances in time under the same or different view directions. RSIAS are equipped with suitable algorithms for geometric corrections, permitting the accurate registration and formation of multisensor data sets. The use of multiple measurements for the discrimination of various classes of geographic features is the most important concept in remote sensing. First order gray-level distribution statistics: Statistics such as the mean, the variance and other central moments, are used to characterize gray-level distribution within a region of an image. These features can be calculated using a small fixed-size moving window (usually between 3 x 3 and 11 x 11 pixels) and then associated to the central pixel of that window. Central tendency statistics (mean, median, mode) are often used for smoothing out punctual random noise in the original images [51] while dispersion statistics (variance, coefficient of variation, entropy, etc.) can be used as metrics of the local texture coarseness. First-order statistics can also be computed from gray-level histograms taken as an approximation of the first-order probability distribution of image amplitude, and subsequently employed as descriptors of an image segment. Second-order gray-level distribution statistics: Second-order statistics are extracted from second-order histograms. The latter are presented as arrays of L x L elements, where L is the number of image quantization levels, and are considered as
640
F. Cavayas & Y. Baudouin
an approximation of the joint probability distribution of pairs of pixel values. Each element of the array thus represents the relative frequency of co-occurrence of two specific gray levels, established by examining all possible pairs of pixels separated by a given distance and in a given direction within the image space. A set of such histograms, often referred to as gray-level co-occurrence matrices (GLCM), formed by changing the relationship between pairs of pixels (interpixel distance and angle), may be studied in order to gain a better understanding of the texture properties of various geographic objects. Given the raster structure of images, second-order statistics are usually defined for four different angles: 0”, 45”, 90” and 135”. Using, as before, a fixed-size moving window, second-order statistics can be computed and associated to the central pixel of the window, or averaged over an image segment and used as segment features or descriptors. However, in practice, the use of GLCM to extract image features presents many difficulties. “To obtain statistical confidence in estimation of the joint probability distribution, the histogram must contain a reasonably large average occupancy level”. [51] As such, one is forced to significantly reduce the number of quantization levels from 256 or more to 32 or less while maintaining the measurement window relatively large, especially when interpixel distances are greater than two. Furthermore, small amplitude texture variations could be lost and errors caused if the texture changes over the window [51]. Authors proposing the use of second-order statistics in image classification fix the interpixel distance at 1 or 2 and, assuming angularly invariant object textures, consider as features their averages over all the measurement angles [53]. In the case of an interpixel distance of 1, a window size of more than 15 x 15 pixels has been found to be optimal in statistical classification of land covers with multispectral satellite images (20 m of spatial resolution) [53]. Alternatives to GLCM features: Edge density, the number of local extrema, and features extracted from the texture spectrum are examples of spatial features that to a certain degree express local texture characteristics. Edge density can be computed by counting the number of significant edge pixels (Section 3.2) over a regular window centered on each pixel or over image segments [54,55]. The MAX-MIN [56] is a measurement derived by counting local extrema along pixel profiles following one of the four standard directions within a fixed-size moving window. Whether or not a local extremum is counted depends on an a priori specified threshold value. The texture spectrum (TS) [57] represents in the form of a histogram the distribution characteristics of local texture coarseness and directionality over an image or a portion of it. These local texture properties are extracted on a pixel basis using atomic measurement units (3 x 3 pixels windows) called texture units (TU). Each of the eight outlying pixels in the TU can take one of three possible values in relation to the center pixel: 0 (less than), 1 (equal to), or 2 (greater than). Accordingly, a TU may be represented by one of the 6,561 (38) possible combinations of these three numbers, and each combination is further transformed into a single number referred to as the texture unit number (NTu). Counting the pixels’ NTU results in the TS which is used to extract features such as black-white symmetry, geometric
3.8 Pattern Recognition and Computer Vision for
...
641
symmetry and degree of direction. The principal advantages of this approach over the GLCM are as follows: (1) texture directionality is implicitly taken into account thereby eliminating the need to compute the texture spectrum for various directions; (2) the size of the measurement window need not be adjusted; and (3) the texture spectrum as a uni-dimensional histogram greatly simplifies subsequent analyses. However, unlike the GLMC, the TS approach cannot easily be extended to include neighbors at various distances from the center pixel and thus capture various scales of texture coarseness. Furthermore, the transformation of the original values to the relative scale (0, 1, 2) reduces the sensitivity of the extracted features to texture amplitude variations. All the above-mentioned features, including first-order dispersion statistics and GLCM, are often used in multivariate classification experiments. The term “spatial signatures” is used to distinguish these features from the commonly used spectral signatures. In such experiments it is established that the introduction of spatial signatures results in greater classification accuracy than spectral signatures alone [58]. Spatial features play a more significant role in the classification of high spatial resolution imagery where object meso- and micro-texture greatly influence the image content. However, the Achilles’ heel of all these methods of extracting spatial information is the numerous parameters the analyst has to set a priori, which as previously stated, may be the window size, interpixel distance, directions of measurement, number of quantization levels, and various thresholds. The analyst must also make important decisions regarding the spectral bands to be used for the extraction of these features and the selection of the most appropriate features for the classification problem at hand. For instance, a dozen features may be extracted for only one co-occurrence matrix. Research on tools to evaluate the spatial content of a particular image and on rules to guide the selection of the parameters for spatial feature extraction may provide valuable solutions to the above-mentioned problems. For example, use of the range of semi-variograms evaluated over different portions of the image has proven helpful in fixing window size for spatial feature extraction in experiments related to forest stand discrimination using high resolution multispectral imagery [59]. Geometric features: In remote sensing, 2-D geometric characteristics of image objects as established by a segmentation procedure (Section 3.3) are of interest for pattern recognition purposes. The third dimension, if automatically extracted, for example, from stereo-images, is regarded as a measurement in order to resolve conflicts in interpretation of high spatial resolution imagery [12]. The object size and shape are important geometric features for object discrimination, especially in man-made environments [22]. Abundant literature exists on the use of shape description and shape measurements as analysis and classification tools in diverse geoscientific disciplines, image analysis and computer graphics. A list of shape features is provided in Fig. 11 and a survey may be found in Davis [4], Pratt [51], and Jain [52], among others. The search for invariant features under a linear geometric transformation (translation, rotation, scaling) is an important
642
F. Cavayas & Y. Baudouin
requirement in pattern recognition. One commonly used invariant feature is the (standardized) compactness defined as the ratio of the squared perimeter length to 47r times the segment size. Other invariant features may be extracted by taking into account the size of regular geometric figures (rectangles, circles, ellipses) inscribed in or bounding the image segment compared to the actual size of the segment. The ratio of length to width of minimum bounding rectangles defined from the least moment of inertia of the image segments (assumed without holes) is used to measure the elongatedness of segments [47], a useful feature in distinguishing linear-like objects to area-extended objects of similar sizes. Shape features may also be provided by information-preserving shape representations such as various periodic contour functions (contour polar signature, tangent angle function, etc.) [60]. Other properties such as moments, number of holes, singular contour points, number of straight boundary segments, and skeleton properties may be extracted from image segments and used in pattern recognition problems [51,52]. Interesting examples on the use of shape features in high resolution image interpretation may be found in Nagao and Matsuyama [47]. However, the problem of shape feature selection in object classification is seldom examined. It was established, for example, in recognizing urban land uses in SPOT panchromatic images, that information-nonpreserving shape features exhibit a high degree of correlation [22], indicating the importance of this selection to avoid redundancy in shape description. Another largely unresolved problem is the dependence of an object’s shape on image spatial resolution [60]. Figure 5 partially illustrates this problem with an example of isolated objects of circular form located on a rather uniform background
Fig. 5. Change in image structure and objects shape as a function of image spatial resolution.
3.8 Pattern Recognition and Computer Vision f o r . . . 643 (oil tanks in an industrial area). Individual objects exhibit change in their shape as the resolution decreases; they progressively lose their individuality and become increasingly confounded with neighboring objects with which they may form new distinct patterns, until finally, they vanish. Parallel changes also occur in the background in terms of texture coarseness and amplitude. With respect to a range of image spatial resolutions, fractal analysis of geographic features may offer some insight into the problem [61]. Additionally, the more recently introduced wavelet transform theory is another method that holds promise for the understanding of the scale-space properties of imaged objects shape [62].
3.2. Image Structural Features Detection Edges, lines and spots are all crucial elements revealing the structure of an image. Edges are local discontinuities in some image property such as tone, color or texture. Edges usually constitute either boundaries between contrasted homogeneous regions on an image (not necessarily coinciding with geographically meaningful objects) or interfaces between two different environments such as a sea shoreline. Lines are more or less long segments of connected pixels between two edges in physical proximity, and spots are small-extent isolated image objects. Depending on the image scale, examples of spot- and line-like geographic features, are buildings, ponds, roads and streams. In some circumstances, detected individual spots and lines could be considered parts of complex patterns whose characteristics are essential in image interpretation (Table 5, Section 4.1). The detection of object boundaries by linking detected edges, and the formation of patterns of spots and lines are the principal issues in the field of structural image analysis [47,63-651. The latest developments in this field, with emphasis on man-made structures, are outlined in Leberl et al. [66] Spatial pattern analysis developed with map data may provide interesting statistical and mathematical tools for image pattern analysis (Section 4.1).
Table 5 . Structural measures for linear pattern recognition (source: Wang and Howarth [108]. Structural Measures
Drainage networks
Road networks
Lineaments
Network density
hydrology
urban-rural segmentation
mineral deposits
Length
rock-type discrimination
road pattern description
structural analysis
Curvature
drainage pattern classification
road pattern description
geological structures
Orientation
rock structure
Angle
rock-type discrimination
644
F. Cavayas & Y. Baudouin
Edge and line detection may be accomplished by global or local operators. Global operators such as masks applied to the Fourier transform space are more suited to images that include objects which exhibit periodic edge or line patterns such as a sea wave field. Research currently being conducted in wavelet transform and multiresolution analysis should eventually provide a general solution to the structural feature detection problem (see Chapter 5). Local operators are usually applied to gray-level images and may be classified in three general categories: differential detectors, surface modeling and morphological operators [67]. Sobel gradient operator and zero-crossings of the Laplacian, convoluted with a smoothing Gaussian filter, or LOG operator, are useful differential detectors to locate edges [51,67]. The combination of LOG and Ratio operators yielded interesting results with respect to radar images corrupted by speckle [68]. Model edge fitting techniques such as the Hueckel [70] or Haralick [67] operators are not frequently mentioned in remote sensing literature. Finally, morphological gradients and watershed basin techniques are suited to images with low-textured objects and somewhat simple structures [67]. Local operators for line detection assume that lines on an image follow a top hat or spike-like model. In other words, a line segment is a small region where the gray level suddenly increases (or decreases), or may remain constant within a distance of a few pixels and then suddenly decrease (or increase). Fitting one of the abovementioned models may result in the elimination of false linear features created by the presence of edges on the image [69]. Lastly, depending on their expected size, spots may be detected either by using templates or comparing pixel values with an average or median of neighboring pixels within a fixed-size window [51,52]. Detected boundaries and lines are usually thick and fragmented. Non-maximum suppression [69] and morphological operators (erosion, dilation) are examples of methods used to thin lines. The Hough transform [51] and morphological operators [51,52] are useful procedures for edge and line linking. Alternative methods to detect boundaries are presented in the next section. In structural features detection, the presence of noise and highly textured objects are the source of false edges. Operators with complete noise immunity are impossible to construct. Some authors recommend the use of smoothing edge-preserving operators, such as the Nagao operator [47], before the application of a structural feature detection operator. Many studies focus on the delicate problem of edgepreserving operators in speckled radar images, and adaptive filtering techniques are usually proposed as a solution [71]. Filtering out the wavelet coefficients only at scales where speckle structure is dominant is an alternative to adaptive filtering of speckles [72]. Edge thresholding is necessary to eliminate false edges. Other strategies have also proposed, for example, in the case of multispectral imagery, edges are sometimes detected independently in the various spectral bands, and pixels presenting high edge intensity in at least two different bands are retained as edge pixels. The use of multitemporal images to provide clues for the detection of edges and lines of cartographic interest is discussed in Guindon [73] and is based on the idea that edge and line features represented in digital map bases are expected
3.8 Pattern Recognition and Computer Vision for . . . 645 to exhibit a level of invariance when viewed in a temporal sequence of scenes. This is not usually the case with many non-cartographically significant edges which are transient in nature (e.g. edges due to agriculture surface cover).
3.3. Image Segmentation Segmentation of the image domain R is the determination of a finite set of regions { R1 , R2 ,. . . ,R I } such that I
R=URi
and
RjnRi=D
forjfi.
i=l
Segmentation could be considered as a transformation of raw numeric data to symbolic data. As such, it is the basis of any image understanding system (Section 3.5). However, image features extracted by analyzing the whole set of pixels within an image region may be used as attributes for region classification using pattern recognition systems (Section 3.4). Segmentation algorithms are usually iterative multi-step procedures and may be based on pixel, regional, or border properties. Pixel-oriented approaches: Image segmentation based on pixel properties is carried out in two steps: (a) pixel classification, and (b) region formation by aggregation of connected pixels of the same class. Classes are usually defined in terms of pixel property similarity without necessarily taking into account their correspondence to geographically meaningful classes (Sec. 3.5). For monochannel images, gray level histogram thresholding is usually applied. Automatic threshold selection is the major issue in this field [70,74]. A common approach is to consider local histogram minima as the bounds of gray-level classes. Some local operators, for example, a monodimensional Laplace operator of the form [l -2 11 [21], or statistical methods [70] may then be applied to detect such histogram minima. The thresholding of histograms computed over small image portions was proposed to reduce the incidence of gray-level local variations due to exogenous factors such as surface illumination conditions and variable atmospheric effects (511. In addition, methods have been proposed for computing histograms with deepened “valleys”, allowing easier threshold selection [70]. The thresholding is far more problematic in the case of radar images with their characteristic speckle noise, exhibiting frequently unimodal histograms (751. An iterative application of speckle smoothing filters is thus proposed in order to permit the transformation of such a histogram to a bimodal or multimodal one [75]. Recursive thresholding techniques have also been proposed for multichannel images [51] for which clustering [51] or relaxation labeling methods are commonly used [67,76]. Introducing the notion of the neighborhood of a pixel, the latter methods offer a somewhat region-oriented approach. Region-oriented approaches: Such approaches extend pixel-oriented methods by simultaneously taking into account the similarity of the pixel properties and their spatial proximity to form regions under iterative processes. Regional properties such as gray-level, multispectral signature [77] or texture [78] may be used.
646
F. Cavayas & Y. Baudouin
Region-oriented approaches follow two basic strategies: (1) bottom-up or regiongrowing, and (2) top-down or region-splitting. Hybrid strategies such as split and merge have been proposed [51]. Among these methods, and often proposed in remote sensing literature, region growing starts with a number of seeding pixels distributed throughout the image space [70] or some initial partition of the image space [77]. Beginning from these initial pixels or partition, regions are quasisimultaneously created by progressively merging connected pixels depending on similarity criteria (cost functions) [70,77]. Created regions may be further merged using criteria such as size, strength of the gray-level difference between adjacent regions along their common boundaries, and merging costs [51,77]. To be effective, all the strategies in region-oriented approaches must be based on models representing the image structure such as quadtrees and pyramids [51,67]. Once the various regions have been obtained, contour-following operators are applied to extract information on region boundaries [51]. Boundary-oriented approaches: Such approaches are based on edge detection and edge thresholding using one of the techniques mentioned in Section 3.2. Contrary to the region-oriented techniques, the segmentation is often not complete and edge-linking operations are needed to form boundaries (Section 3.2). Figure 6 presents the results of SPOT panchromatic image segmentation in a n urban area using a boundary-oriented approach. Region filling operators are finally applied in order to uniquely identify each detected segment. Alternative methods for boundary detection are proposed such as those based on the Markovian fields theory [67] but the evaluation of their real potential with remotely sensed images is still in the early stages.
Fig. 6. Image segmentation using boundary-oriented approach [79].
3.8 Pattern Recognition and Computer Vision for . . . 647 Generally, image segmentation met hods are not available in commercial RSIAS. However, the inherent limitations of the per-pixel classifications (Section 3.4) of remotely sensed images, especially those of high resolution, underscore the need for image segmentation as an essential first step in any image analysis process. Segmentation methods based on pixel properties are attractive by their computational simplicity, but they are inherently limited to images with few and relatively large well-contrasted homogeneous areas. Region growing segmentation is more suited to remotely sensed images but has difficulty detecting thin regions and accurately locating regional boundaries. Boundary-oriented approaches cope well with thin regions and could more accurately locate regional boundaries provided that adjacent regions are well contrasted. Problems of under-segmentation or over-segmentation, especially in highly textured zones, are common to all these approaches. Such problems are evident in the segmentation example in Fig. 6 . 3.4. Image Classification
Remotely sensed image classification involves categorizing image pixels in one of several geographically meaningful classes that frequently corresponds to a particular level of a hierarchical taxonomic system provided by a given geoscientific discipline (Fig. 3). Classification algorithms are the core of pattern recognition systems (Fig. 7 ) . An image pixel may be considered an autonomous geographic entity or part of an image region as specified by image segmentation. Inversely, as discussed, per pixel classification algorithms may be applied as an initial step for image segmentation. Equally, labeling of image segments by classification algorithms may be considered an essential operation in computer vision systems (Section 3 . 5 ) .
Image
’
Feature extraction Feature selection
+ Classification
+
Sy ntactid Structural recognition
Fig. 7. Pattern recognition systems.
Raster map
’
- - - -- - - -
648 F. Cavayas & Y. Baudouin
Using spectral-spatial signatures, per pixel classification was the dominant approach in the past. Furthermore, classifications based on spectral-temporal signatures were also proposed, particularly in the context of land cover change identification [80]. Pattern recognition theory is discussed in Chapters 1.1 to 1.3 of this handbook while Chapter 4.2 presents statistical pattern recognition and neural networks applications in remote sensing. Only a brief review is presented here as a reference to the subsequent sections of this chapter. The standard approach in classification is to consider a pixel (or a segment) as a vector of measurements in a multi-dimensional space where the axes represent the various measured image features. This feature space is segmented in mutually exclusive class domains using either probabilistic or geometric criteria. Searching the domain where the unknown pattern belongs may be carried out using distance or similarity measures, or discriminant functions. Classification algorithms are usually distinguished according to the way the class domains are specified, either using class prototypes (supervised learning) or not (unsupervised learning) and if a parametric model is employed to describe the class distribution characteristics. The Bayesian decision rule (under symmetric losses) is the basis of the most popular image classification techniques. While its simplicity is appealing, this rule as such is not easily applied in practice due to our ignorance of class a priori probabilities and class conditional probability density functions under variable conditions of image acquisition (atmosphere, surface illumination, viewing angles, etc.) A simplified initial hypothesis postulates that all classes are equally probable and the resulting classifier is the Maximum Likelihood (ML). The often postulated second hypothesis is that class conditional probabilities follow the Gaussian distribution, the most easily adapted to multivariate classification problems. Gaussianity is, however, a questionable hypothesis, especially with variables created using spatial features or with geographic objects exhibiting high spatial tone variation. For example, it was demonstrated that spatial features such as the variance or some features extracted from the GLCM, do not follow a Gaussian distribution [81]. For some types of images such as those provided by radar sensors, the presence of the coherent noise precludes the application of such a parametric classifier unless intensive smoothing has been applied. The ML under the Gaussian assumption is a quadratic classifier using information on mean vector and variance-covariance matrix per class. Supervised learning of these parameters is the most popular approach. The Bayesian rule may be further simplified assuming equal variance-covariance or identity variancecovariance matrices. In the first case a linear classifier is specified using the Mahalanobis distance; in the second a minimum Euclidean distance classifier is obtained. Unsupervised learning algorithms such the K-means or multivariate histogram thresholding are often applied to define data clusters used as training areas or to examine spectral purity of training areas in a supervised a p proach [49]. Problems related to feature space dimensionality and feature selection gain importance as new sensors with a high number of spectral bands are becoming or will become increasingly available. For example, Wharton [82] established that
3.8 Pattern Recognition and Computer Vision for
...
649
a Landsat-TM image of 512 x 512 pixels and seven spectral bands contains about 2 x lo5 to 2.5 x lo5 distinct spectral vectors or otherwise stated, about 75% to 99% of the pixels convey a distinct spectral vector. This contrasts with a Landsat-MSS image with four bands and lower radiometric and spatial resolutions where an image of 512 x 512 pixels contains about 30,000 distinct vectors or 11%of the pixels. Classification accuracy is usually assessed empirically using confusion matrices. A sample of classified pixels or regions throughout the examined scene is taken and their truth classes established on the ground or with the use of available maps and photographs. The optimum number of pixels or regions to be sampled as well as the sampling strategy are discussed by several authors [80]. The number of pixels correctly assigned to each truth class and other classes are then reported in the confusion matrix. It is thus possible to assess the quality of the classification globally (overall accuracy) and by class. The Kappa coefficient of agreement [83] has become the standard method of assessing classification accuracy based on a confusion matrix and of testing for a statistically significant difference of two confusion matrices obtained for the same data set by two different classification methods [84]. However, not one of these classification techniques has been generally accepted by the user community as an effective numerical counterpart of image classification by an experienced human interpreter. This is because an interpreter possesses knowledge of images, geographic context and relationships between earthly features, and stereotyped image memory, as well as the ability to rapidly perceive image structure. F‘urthermore, his perception is not limited to local tone/color and spatial patterns. Still, all these experiments with classification methods have been necessary and essential to arriving at a better understanding of the problem of remotely sensed image interpretation. Lastly, such methods continue to be valuable for the classification of somewhat low resolution imagery where image tone/color are the basic stimuli for image interpretation. Since the early ~ O ’ S , alternative approaches to image classification, originated from the domain of artificial intelligence, have been introduced, including Neural Networks (NN) (see Chapter 3.3), formal logic rule-based classifiers [85], fuzzy classifiers [86,87], Bayesian rule-based classifiers [84], and Demster-Shafer evidential calculus [88]. The research in these fields is under way but some tentative conclusions may be drawn from all these experiments: (1) the major problem in many classification applications based on individual pixels is the inability of multispectral and spatial signatures to adequately capture the general characteristics of geographically meaningful classes. For instance, although NNs have definite advantages over standard classifiers given that they are better suited to nonlinear discrimination problems and to complex partition of the feature space, the gain in classification accuracy were rather minimal in many experiments, (2) there is a problem of how to establish class probabilities, class fuzzy membership functions, mass functions committed to a class, or IF-THEN rules with the various image features. Another important problem related to the application of such classifiers is the lack of methods for feature selection based on criteria other than the Gaussian assumption of
650 F. Cavayas & Y. Baudouin
class distribution; (3) the integration of the various classification algorithms into a common system exploiting image segmentation principles and geographic knowledge is possibly the way to develop effective and versatile systems for image classification and interpretation. 3.5. I m a g e Understanding Systems
The goal of image understanding as applied t o remotely sensed images is to establish computerized systems with capabilities of image interpretation comparable to those of an experienced human interpreter. Using the human eyebrain system as an analogy, image understanding systems are established with a view to emulating such operations as recognition, classification, reasoning, feedback, evidence accumulation, etc. [89] often deployed in human image interpretation [lo]. Computer vision involves a number of data transformations starting with raw pixel numeric values, and progressing, through symbolic data processing and analysis, to a description of the scene content [48,89]. The various data transformation stages are usually referred t o as low-level processing (smoothing, enhancement, features extraction, segmentation), intermediate-level processing (region description, application of lowlevel knowledge models for region labeling) and high-level processing (application of high-level knowledge models for region relationship description and scene interpretation). Various image understanding systems have been proposed adopting various strategies to analyze the image and recognize and label image objects. Hierarchical systems based either on a bottom-up (or data-driven), top-down (or goal-driven) or hybrid strategy as well as systems adopting a blackboard architecture are commonly used [48]. Figure 8 presents an idealized schema of a goal-driven system.
storagehetrieval Remote Sensing GeoaraDhic FeatuksModel Temporal Model
OtherData
I
1
r-4
2 :t
II
I
I
Fig. 8. Goal-driven image understanding system (adapted from Steinberg [SS]).
3.8 Pattern Recognition and Computer Vision for . . . 651 Knowledge representation models are briefly reviewed in Section 2.2 of this chapter and specific concepts concerning image understanding systems (control and planing mechanisms, inference and uncertainty management mechanisms, evidence accumulation and conflict resolution mechanisms, etc.) can be found in Whiston [go], Ballard and Brown [89], and Patterson [48],among others. Argialas and Harlow [45] present a survey of image interpretation models and give several examples of proposed natural scene vision systems. Few studies, however, have been conducted in the field of image understanding in conjunction with remotely sensed imagery [46,47,91-931. Among these studies, experiments in developing knowledge bases and testing inference mechanisms are the only ones generally presented. SHERI (System of Hierarchical Experts for Resource Inventory) is a typical example of a new approach integrating concepts of RSIAS, GIs, image understanding systems and decision supporting systems. This system developed by the Canada Centre for Remote Sensing and the British Columbia Ministry of Forests permits forest inventory map updating using remotely sensed images [46,94,95]. 4. Map Data Analysis
Map data analysis may involve one or several geographic data sets covering the same or different territories and collected at a particular time interval or at different intervals. The following problems are frequently encountered: (a) the characterization of a spatial pattern (punctual, linear, areal) using statistical, geometric or qualitative criteria, for example, the characterization of the distribution of human settlements within a territory as clustered, or the classification of a drainage pattern as dentritic; (b) the relationships in terms of locational and/or nonlocational attributes of two or more geographic data sets, for example, the local density of cases of lung disease in terms of distance from air pollution sources; (c) the combination of two or more data sets using physical models to rate particular locations according to their potential, susceptibility, favorability and the like, for example, the soil erosion potential of agriculture fields, susceptibility to landslides, favorability of a region to a particular economic activity, etc.; and (d) generating from existing geographic data sets a new data set describing a geographic phenomenon indirectly measured, for example, the establishment of watershed basins using a digital elevation model, or defining ecological land units by analyzing geology, soil, and vegetation patterns. According to the analysis goal, various transformations may be envisaged concerning the object’s geometry as well as its content, such as object aggregation, cartometry (area, perimeter, elongation, convexity, etc.), and attribute processing (entropy index, density, etc.) Once this data preprocessing has been carried out, it is then possible to proceed to certain spatial analysis that will permit the object’s characteristics to be combined while taking into account its location and any of the attached attributes (gravity models, proximity, etc.) The next step is usually the thematic illustration of the results through a cartographic document
652 F. Cauayas @ Y. Baudouin
produced according to graphic semiology rules [38,97-991. Contrary to the vector data, raster data provided by remote sensing image analysis or digital elevation models, are often considered raw data in terms of spatial analysis. Consequently, further processing is required to adequately identify and localize the chosen thematic categories, group the basic elements (pixels) to compose more explicit objects, derive new attributes using logical and mathematical operations, etc. Developments in map analysis, with particular emphasis on pattern description and generation of new data sets, are reviewed in the following sections. 4.1. Characterization of Spatial Patterns
Point patterns: Such patterns may represent: (a) zero-dimensional natural objects, such as, according to the map scale, human settlements, lakes, sinkholes in karst topography, etc.; (b) events at specific locations such as accidents in a road network; (c) intersections of lines composing a linear pattern such as geological faults and joints; (d) sampling locations selected for the measurement of some continuous or zonal geographic phenomenon, such as temperature, soil categories; and (e) central locations within polygonal entities. Figure 9 presents the various notions used in point pattern analysis [loo-1041. The notions of dispersion and concentration are used to characterize a whole pattern while the notion of distance applies to individual points within a pattern. Dispersion measures may be simply counts of punctual objects within spatial sub-units defined by the user (administrative unit, watershed, etc.) and are obtained by applying Monmonier’s point-in-polygon method (plumb-line) [105]. However, the data dispersion is often evaluated without applying any boundary notion. Statistics generated from quadrats and nearest neighbor analyses are employed to characterize patterns as regular, random, or clustered. Scattered or contiguous quadrats may also be used. Scattered quadrats (triangular, rectangular, hexagonal form, etc.) are superimposed on the point map and different parameters then extracted (Fig. 9). When contiguous quadrats are used, a grid is applied and a detailed account of points effected over the whole map. The size and shape of the quadrat directly influence the nature of the results. In the case of nearest neighbor analysis the distances between the closest pairs of points are evaluated instead of the number of points within subareas. The average distance between pairs of closest points is compared to a theoretical distance corresponding to a random placement of points. Different techniques have been proposed in order to correct map edge effects in nearest neighbor computations which are based on the assumption that point patterns extend to infinity in all directions [4]. The concentration may be expressed as point density and dispersion from a hypothetical central place. These measurements could be calculated by taking into account only locational attributes such as gravity, mean center and standard distance, and distance from preferential axes. Nonlocational attributes may be used as weights, influencing relative distances between points (mean distance, variance,
3.8 Pattern Recognition and Computer Vision for . . . 653
654 F. Cavayas & Y. Baudouin
etc.) With respect to density measures, the simplest method to generate density maps is by using a grid and counting the number of points per grid cell [6]. The principal issues in this field involve cell size selection and problems related to density computations near the boundaries of the examined area [6]. Statistical techniques for density computations have been also proposed (kernel functions) [106]. Concentration measures are used for point patterns representing center locations of polygonal entities in order to find spatial relations between categories of the same phenomenon (attraction, repulsion, combinatory effect, etc.) However, concentration measures could be based solely on aspatial attributes. For instance, we may try to identify the preponderant urban function within an administrative area using a land-use map. Different indexes permit evaluation of the mutual influence of each component without regard to the location of their attachment point [107]. The notion of distance is applied in cases where the neighborhood or zone of influence of a point is sought. This neighborhood could be a circle of specified diameter (buffer) centered on a point or may be defined using distances between that point and its closest neighbors, as for example, in constructing Thiessen (or Voronoi) polygons and Delaunay triangles. Such figures are usually employed in conjunction with point patterns representing sampling locations to generate maps of continuous or zonal phenomena. Alternatively, and for quantitative data only, interpolative methods such as trend surfaces [4] and Kriging [5] are applied. Line pattern analysis: Given the nature of linear elements, certain intrinsic characteristics provide information on its geometry, direction (general tendency), mutual arrangement (trees) and movement inside a network (circuit). This diversified information, once synthesized, allows the description of the general pattern and relationship between each line within the pattern. Figure 10 presents the various measurements extracted from a linear pattern while Table 5 shows possible applications for such measures. Many geometric parameters can be extracted from a linear pattern (connected or unconnected) such as sinuosity, density and frequency. Further analysis may be performed to identify the direction using traditional techniques (rose diagram [4]) or other parameters presented in Fig. 10 (e.g. tangent direction). Special measurements may be extracted to characterize lines forming networks or loops (drainage networks, road networks, etc.) The circuit incorporates all the characteristics related to the segments of a network. Areal pattern analysis: Areal pattern analysis is often limited to autocorrelation measurements [6,100]permitting the characterization of binary or multiple attribute areal patterns. Examples of such measurements, which could be applied to either vector or raster data, are joint-count statistics, and Geary and Moran coefficients (Fig. 11). Measurements of contiguity such as the number of joints between adjacent polygons, the common boundary length, or the distances between centroid of adjacent polygons, may be introduced in such autocorrelation measurements. Moreover, statistical tests have been developed to test the similarity between a given pattern and one organized at random and, in the case of multivalued attributes, between a
3.8 Pattern Recognition and Computer Vision for . . . 655
I
F. Cavayas & Y. Baudouin
656
I
c
4 S
3.8 Pattern Recognition and Computer Vision for . . . 657
Fig. 12. Examples of point, line and areal parameter.
pattern and attributes drawn independently from a given normal population (6,1031. Such measurements are usually extracted as a starting point for an analysis of the origin of such patterns. Chou et al. [log] present an interesting example of the use of such measurements in order to take into account spatial autocorrelation phenomena in a logistic regression for the prediction of spatial occurrence of wildfires as a function of various regional biophysical factors. Local texture measurements similar to those used in remote sensing may also be applied in order to characterize continuous phenomena represented in raster format. Finally, various shape measures may also be extracted from polygon data representing natural discrete objects (Fig. 11). As in the case of remotely sensed images, such measurements may be used to distinguish categories of objects. Figure 12 illustrates various point, line and area features extracted from map data. 4.2. Map Data Comparison and Combination
Map data comparison may involve map pairs representing two different phenomena and covering the same territory or the same phenomenon but in different territories. The primary goal of such comparisons is to discover spatial associations and differences or similarities in distribution of mapped variables. Davis [4] discusses the comparison of map pairs representing continuous variables using statistical
658
F. Cavayas €4 Y. Baudouin
similarity criteria, while Bonham-Carter [6] extends the discussion to categorical variables. The combination of two or several maps seeks primarily to use models to knalyze and predict spatial phenomena (Section 1). Map overlay using Boolean or arithmetic operations are commonly employed in map pairs combination while theoretical models or empirical models based on statistical or heuristic relationships are used for the combination of multiple map data sets. Bonham-Carter [6] presents an excellent overview of spatial modeling using data- or knowledge-driven models such as Boolean logic, index overlay, fuzzy logic models and Bayesian methods, while Zou and Civco [110] propose genetic learning neural networks for suitability analysis. 4.3. New Data Set Extraction
Quantitative attributes classification, reclassification of categorical data according to various criteria (Fig. l l ) ,generalization, and aggregation of spatial data [lll], and transformation of one data set to another (raster-to-vector or vice-versa, point to area or vice versa) are the basic methods used to produce new data sets from one input geographic data set [6]. Of particular interest in the context of combined RSIAS/GIS are methods for producing new data sets from existing ones, which could then be used as site specific geographic knowledge or new geographic information to facilitate landscape understanding and monitoring of spatial phenomena. The terms structural and pattern recognition approaches have been coined by the authors in this handbook to describe these methods. Structural approaches: This category includes algorithmically intensive procedures applied primarily to digital elevation models (DEM) to automatically extract various geomorphologic information such as watershed basins [112], slope and other geomorphometric variables [113], as well as qualitative landform parameters such as crests, troughs, and flats [114]. Watershed basins are basic units for terrestrial environmental studies while other geomorphometric variables are used to improve image classification results (Section 1). Landform information is a basic element in terrain analysis and may constitute important geographic knowledge in interpreting remotely sensed images. Pattern recognition approaches: Examples of pattern recognition approaches applied to single map data set or to multiple data sets reported in the literature are cited in Section 1 of this chapter. In particular, Gong et al. [27] provide an interesting example of use of neural networks for the mapping of ecological land units using multiple map layers either in vector or raster formats. In addition, they propose an index of certainty which permits an appraisal of the final map quality. Neural networks constitute one of the possible methods which may be used with map data representing qualitative and/or qualitative data. In fact , the application of standard pattern recognition approaches such as the Maximum Likelihood classifier under the Gaussian assumption (Section 3.4) is hampered by quantitative map data which usually violates this assumption, as well as quantitative data following
3.8 Pattern Recognition and Computer Vzsion for
...
659
angular distributions (e.g. slopes), or qualitative data. Classification algorithms based on the Dempster-Shafer evidence theory [88],fuzzy sets, decision-tree classifiers [115,116], or algorithms especially designed for categorical data [117,118] are different approaches that may be used for pattern recognition in the context of a RSIAS/GIS.
PART 111: COMBINED MAP-IMAGE ANALYSIS 5. Pattern Recognition Approaches
As mentioned in the introduction of this chapter, combined map-image analyses are often undertaken to enhance the results of image classification. Digital Elevation Model (DEM) is the most important geographic data source used in such undertakings (Table 1). This is understandable since topography is an important parameter constraining the occurrence of geographic phenomena such as vegetation types, surface cover and land use types; it reflects to some extent the underlying geology and is directly associated with geomorphic processes. Data sets issued from structural analysis of a DEM (Section 4.3) may be used in image classification [113]. As with map data, alternative approaches to the maximum likelihood standard classifier of remotely sensed images must be used [88,116]. 6. Map-Guided Approaches for Image Analysis The term map-guided ( M 4 ) approach was introduced to describe various processes using existing map data as a model of the situation on the ground (Section 1). Thus maps were used in the initial step of an image segmentation by imposing lines, for example roads, onto a region segmentation of the image [44, 1181. This initial segmentation was further refined in subsequent steps using standard segmentation techniques constrained by rules concerning the expected regional characteristics (Knowledge-basedimage segmentation). Problems related to such imposition of lines are mostly related to misregistration between the map and the image (e.g. double region boundaries) [118]. In order to eliminate ambiguities due to misregistration, various semi-automatic and automatic methods have been proposed. For example, Goldberg et al. [23] describe an expert system that can recognize the relative position of corresponding features (lakes, roads, etc.) on the map and image and locally compute displacement vectors for rubber sheeting operations. M-G techniques offer interesting potential, especially in automatic change detection and identification problems. Some examples of such approaches are reviewed next. Maillard and Cavayas [20] discuss an M-G procedure for updating topographic maps using SPOT panchromatic images with particular emphasis on the road network. To correct map errors, first the location of the road network segments as depicted on the map are searched for locally on the image, and geometrically corrected to match the map projection characteristics. At the same time hypotheses are generated concerning the existence of new road segments. These hypotheses are
660
F. Cavayas €9 Y. Baudouin
Fig. 13. Extraction of urban road network from a SPOT panchromatic image using neural network.
based on the fact that new roads always intercept existing road networks. Once the map road segments are located on the image, road pixels identified as probable intersections with new road segments are used as starting points of a road segment tracking algorithm. All local operators are essentially based on line detection operators assuming that the road follows a spike-like model. Although the procedure works well in general, due to the rigidity of such line detection operators with parameters fixed at once, road segments may be difficult to locate because of faint contrast between the road and its environment or the inversion of the spike-like model (the road becomes darker than the environment). More recently, Fiset and Cavayas [43] replaced the line detection operators by the activation values of an NN trained to recognize only roads on the image. It was thus found that even if the detection of the road network by the NN is incomplete (Fig. 13), the road network is far more accurately located on the image [119]. A goal-driven, rule-based system was proposed by Cavayas and Francoeur [21] for the updating of forestry inventory maps, taking into account disturbance due to logging activities. The existing forestry map is used to segment the image. For each segment a number of histogram features are extracted and matched with expected features. The existing forestry map is used to segment the red spectral band of the satellite image. A number of histogram features (modes, entropy, occupancy of specific gray-level ranges) are extracted for each segment and matched with expected features. Once the type of forest cut (partial or total) is detected, the system uses a data-driven approach to identify the age of the cut and other useful characteristics (selective partial cut, successive partial cut, etc.) for forestry mapupdating purposes. Applied to a SPOT multispectral image, the system attains an overall accuracy of 87%. A rule-based image understanding system for urban land-use identification and change detection using map-data and SPOT panchromatic images was proposed
3.8 Pattern Recognition and Computer Vision for . . . 661 by Baudouin [79]. The system comprises an iconographic database (maps, satellite images) and a knowledge base, including geographic knowledge on urban areas (see Table 4 and Fig. 4). The map data classification unit is a city block, and the interpretation rules are based on four features: mean gray-level value, coefficient of variation, size, and compactness index (Fig. 14). The latter is defined as the ratio of the actual size of a polygon to the size of a circle with the same perimeter length as the actual polygon. The system could permit the identification of 20 land use taxa corresponding to the city land use map at scale 1: 50,000. The system arrives at a classification accuracy per city block ranging from 80% to 87% depending upon the HRV-SPOT sensor view angle, the best accuracy being obtained with the sensor pointing in the solar direction. A special rule permits recognition of big city blocks which have been subdivided since the date of the map compilation. The system then selects an appropriate mask for the detection of the new road pattern within that block using knowledge on the street patterns in the neighboring area.
Fig. 14. Variation of four parameters for different city blocks; on the left, histograms extracted from SPOT panchromatic images; and on the right, the contents of city blocks in high-resolution aerial photographs.
662
F. Cavayas €4 Y. Baudouin
Conclusion This chapter provided a summary of developments in two connected fields, remotely sensed image analysis and map-data analysis, with emphasis on pattern extraction and description, and segmentation and categorisation of the geographic space. Finally, examples were given to illustrate the possibilities of achieving greater synergy in the use of both image and geographic data sets. Such use will be essential to fully exploit images which would be provided by new satellite sensors with resolutions ranging from a meter to a kilometer. The described analytic tools could constitute the core of integrated RSIAS and GIS systems.
References [l] J. Star and J. Estes, Geographic Information Systems: an Introduction (PrenticeHall, 1990). [2] D. J Maguire et al. (eds.), Geographic Information Systems: Principles and Applications, 2 volumes (Longham Scientific & Technical, 1991). [3] S . Fortheringham and P. Rogerson (eds.), Spatial Analysis and G I s (Taylor & Francis, 1994). [4] J. C. Davis, Statistics and Data Analysis in Geology, 2nd edn. (John Wiley & Sons, 1986). [5] E. H. Isaaks and R. M. Srivastava, A n Introduction to Applied Geostatistics (Oxford University Press, 1989). (61 G. F. Bonham-Carter, Geographic information systems for geoscientists: modelling with GIs, Computer Methods in the Geosciences, 13, Pergamon (1994). [7] J. Rapper (ed.), Three Dimensional Application in Geographic Information Systems (Taylor & Francis, 1989). 181 G. Langran, Time in Geographic Information Systems (Taylor & Francis, 1992). [9] J. C. Muller, Latest developments in GIS/LIS, Int. J. GIS 7 (1993) 293. [lo] J. E. Estes et al., Fundamentals of Image Analysis: Analysis of Visible and Thermal Infrared Data, in Manual of Remote Sensing, Colwell (ed.), ASPRS (1983) 987-1124. [ll] Tenebaum et al., Map-guided interpretation of remotely-sensed imagery, Proc. IEEE Computer SOC.Conf. Pattern Rec. Image Proc., Chicago, Illinois (1979) 61Ck627. [12] D. M. McKeown Jr, Knowledge-based aerial photo interpretation, Photogrammetria 39 (1984) 91. [13] M. J . Jackson and D. C. Mason, The development of integrated geo-information systems, Int. J. Remote Sens. 7 (1986) 723. [14] D. G. Goodenough, Thematic mapper and SPOT integration with a GIs, Phot. Eng. Rem. Sens. 54 (1988) 167. [15] F. W. Davis and D. S. Simonett, GIS and Remote Sensing, in Maguire et al. (eds.), ch. 14, V O ~ .1 (1991) 191-213. [16] M. J. Wagner, Through the looking glass, Earth Obs. Magaz. (1996) 24. [17] J . Way and E. A. Smith, The evolution of SAR systems and their progression to the EOS SAR, IEEE Bans. Geosci. Rem. Sens. GE 29 (1991) 962. [18] United Nations, Rapport Mondzal de la Cartographie, vol. XX (1991). [19] H. D. Parker, The unique qualities of a GIs: a commentary, Phot. Eng. Rem. Sens. 54 (1988) 1547. [20] P. Maillard and F. Cavayas, Automatic map-guided extraction of roads from SPOT imagery for cartogtaphic database updating, Int. J. Rem. Sens. 10 (1989) 1775.
3.8 Pattern Recognition and Computer Vision for . . . 663 [21] F. Cavayas and A. Francoeur, Systirme Expert pour la mise B jour des cartes forestihres B partir des images satellites, in Te'le'dktection et Gestion des Ressources, P. Gagnon (ed.), Assoc. Qu6b6coise de TBlBdBtection, Vol. VII (1991) 169-178. [22] Y . Baudouin et al., Vers une nouvelle m6thode d'inventaire et de mise B jour de l'occupation/utilisation du sol en milieu urbain, Can. J. Rem. Sen. 21 (1995) 28. [23] M. Goldberg et al., A knowledge-based approach for evaluating forestry-map congruence with remotely sensed imagery, Trans. Royal SOC. Lond. A234:447 (1988). (241 K. Thapa, Automatic line generalization using zero-crossings, Phot. Eng. Rem. Sens. 5 4 (1988) 511. [25] J. Quian et al., An expert system for automatic extraction of drainage networks from digital elevation data, ZEEE Transact. Geosc. Rem. Sens. GE 2 8 (1990) 29. [26] J. Chorowicz et al., A new technique for recognition of geological and geomorphological patterns in digital terrain models, Remote Sens. Environ. 2 9 (1989) 229. [27] P. Gong et al., Mapping ecological land systems and classification uncertainties from digital elevation and forest-cover data using neural networks, Phot. Eng. Rem. Sens. 6 2 (1996) 1249. [28] D. P. Argialas et al., Quantitative description and classification of drainage patterns, Phot. Eng. Rem. Sens. 54 (1988) 505. [29] B. A. St-Onge and F. Cavayas, Estimating forest stand structure from high resolution imagery using the directional variogram, Int. J. Rem. Sens. 16 (1995) 1999. [30] L. Ward, Some examples of the use of structure functions in the analysis of satellite images of the ocean, Phot. Eng. Rem. Sens. 5 5 (1989) 1487. [31] G. M. Henebry and H. J. H. Kux, Lacunarity as a texture measure for SAR imagery, Znt. J. Rem. Sens. 16 (1995) 565. [32] D. Roach and M. Lasserre, Topographic roughness exponent estimates from simulated remote sensing images, Proc. 16th Can. Symp. Rem. Sens. (1993) 793-798. [33] N. Lam, Description and measurements of Landsat TM images using fractals, Phot. Eng. Rem. Sens. 5 6 (1990) 187. [34] U. M. Fayyad et al. (eds.), Advances in Knowledge Discovery and Data Mining (AAAI Press/The MIT Press, 1996). [35] M. Ehlers et al., Integration of remote sensing with GIs: a necessary evolution, Phot. Eng. Rem. Sens. 55 (1989) 1619. [36] J. M. Piwowar and E. F. LeDrew, Integrating spatial data: a user's perspective, Phot. Eng. Rem. Sens. 5 6 (1990) 1497. [37] P. A. Burrough, Principles of GZS for Land Resources Assessment (Clarendon Press, 1986). [38] A. H. Robinson, et al., Elements of Cartography, 6th edn. (John Wiley & Sons, 1995). [39] J. R. Carter, Digital representation of topographic surfaces, Phot. Eng. Rem. Sens. 54 (1988) 1577. [40] A. Petland et al., Photobook: content-based manipulation of image databases, Proc. SPZE 21 85, Storage and Retrieval Image and Video Databases ZI (1994). [41] S. Menon and T. R. Smith, A declarative spatial query processor for GIs, Phot. Eng. Rem. Sens. 55 (1989) 1593. [42] W. R. Tobler and Z. Chen, A quadtree for global information storage, Geog. Analysis 18 (1986) 360. [43] R. Fiset and F. Cavayas, Automatic comparison of a topographic map with remotely sensed images in a map updating perspective: the road network case, Int. J. Remote Sensing 18 (1997) 991. [44] D. C. Masson et al., The use of digital map data in the segmentation and classification of remotely-sensed images, Znt. J. GIS 2 (1988) 195.
664
F. Cavayas B Y. Baudouin
[45] D. P. Argialas and C. A. Harlow, Computational image interpretation models: an overview and a perspective, Phot. Eng. Rem. Sens. 56 (1990) 871. (461 M. Goldberg et al., A hierarchical expert system for updating forestry maps with Landsat data, IEEE Proc. 73 (1985) 1054. [47] M. Nagao and T. Matsuyama, A Structural Analysis of Complex Aerial Photographs (Plenum Press, 1980). [48] D. W. Patterson, Introduction to Artificial Intelligence and Expert Systems (PrenticeHall, 1990). [49] P. H. Swain and S. M. Davis (eds.), Remote Sensing: The Quantitative Approach (McGraw-Hill, 1978). [50] V. B. Robinson and A. U. Frank, Expert systems for GIs, Phot. Eng. Rem. Sew. 53 (1987) 1435. [51] W. K. Pratt, Digital Image Processing, 2nd edn. (John Wiley & Sons, 1991). [52] A. K. Jain, Fundamentals of Digital Image Processing (Prentice-Hall, 1989). [53] D. J. Marceau et al., Evaluation of the grey-level co-occurrence matrix method for land-cover classification using SPOT imagery, IEEE Trans. Geosc. Rem. Sens. GE 28 (1990) 513. [54] C. A. Hlavka, Land-use mapping using edge density texture measures on TM simulator data, IEEE ’Ilans. Geosc. Rem. Sens. GE 25 (1987) 104. [55] P. Gong and P. Howarth, The use of structural information for improving land-cover classification accuracies at the rural-urban fring, Phot. Eng. Rem. Sens. 56 (1990) 67. [56] 0. R. Mitchell et al., A MAX-MIN measure for image texture analysis, IEEE Trans. Computers C26 (1977) 408. [57] L. Wang and D. C. He, A new statistical approach for texture analysis, Phot. Eng. Rem. Sens. 56 (1990) 61. [58] P. Gong et al., A comparison of spatial texture extraction algorithms for land-use classification with SPOT HRV data. Rem. Sens. Environ. 40 (1992) 137. [59] S. E. Franklin and G. J. McDermid, Empirical relations between digital SPOT HRV and CASI spectral response and lodgepole pine forest stand parameters, Int. J . Rem. Sew. 14 (1993) 2331. [60] P. 3. Van Otterloo, A Contour-oriented Approach to Shape Analysis (Prentice-Hall, 1991). [61] L. De Cola, Fractal analysis of classified Landsat scene, Phot. Eng. Rem. Sens. 55 (1989) 601. [62] D. Roach and K. Fung, Scale-space quantification of boundary measures for remotely-sensed objects, Proc. 16th Can. Symp. Rem. Sens. Sherbrooke, 7-10 June 1993, 693-700. [63] S. Wang et al., Spatial reasoning in remotely sensed data, IEEE Trans. Geosc. Rem. Sens. GE 21 (1983) 94. [64] R. M. Haralick et al., Extraction of drainage network by using the consistent labeling technique, Rem. Sens. Enuir. 18 (1985) 163. [65] V. S. S. Hwang et al., Evidence accumulation for spatial reasoning in aerial image understanding, Center for Automation Research, Univ. of Maryland, Report Number CAR-TR-28; CS-TR-1336 (1983). (661 F. Leberl et al. (eds.), Mapping buildings, roads and other man-made structures from images, Proc. IAPR TC-7 Workshop, Graz, AU, 2-3 September, 1996. [67] J.-P. Cocquerez and S. Philipp, Analyse d’images: Filtrage et Segmentation (Masson, 1995). [68] A. C. Bovik, On detecting edges in speckled images, IEEE Trans. Acoustics, Speech, Signal Proc. 36 (1988) 1618.
3.8 Pattern Recognition and Computer Vision for . . . 665 [69] M. James, Pattern Recogn. (BSP Professional Books, Oxford, UK, 1987). (701 F. M. Wahl, Digital Image Signal Processing (Artech House, 1987). [71] A. Lopes et al., Adaptative speckle filters and scene heterogeneity, ZEEE Trans. Geosc. Rem. Sens. GE 28 (1990) 992. [72] T. Ranchin and L. Wald, The wavelet transform for the analysis of remotely sensed images, Int. J. Rem. Sew . 14 (1993) 615. [73] B. Guindon, Multi-temporal scene analysis: a tool to aid in the identification of cartographicaly significant edge features on satelllite imagery, Can. J. Rem. Sens. 14 (1988) 38. [74] J. C. Weszka et al., A comparative study of texture measures for terrain classification, ZEEE R a m . Syst. Man Cybern. SMC 6 (1985) 269. [75] J.-S. Lee and I. Jurkevich, Segmentation of SAR images. ZEEE Trans. Geosc. Rem. Sens. GE 27 (1989) 674. [76] R.M. Hord, Remote Sensing: Methods and Applications (John Wiley & Sons, 1986). [77] G. B. Benie and K. P. B. Thompson, Hierarchical segmentation using local and adaptive similarity rules, Znt. J . Rem. Sens. 13 (1992) 1559. [78] B. St-Onge and F. Cavayas, Automated forest structure mapping from high resolution imagery based on directional semiovariogram estimates, Rem. Sensing of Envir. in press (1997). [79] Y. Baudouin, Dkveloppement d’un syst&med’andyse d’images satellites pour la cartographie de l’occupation du sol e n milieu urbain, thesis, Montreal University (1992). (801 J. R. Jensen et al., Urban/suburban land use analysis, in Manual of Remote Sensing, R. N. Colwell (ed.),ASPRS (1983) 1571-1666. (81) K. Arai, Multi-temporal texture analysis in TM classification, Can. J. Rem. Sens. 17 (1991) 263. [82] S. W. Wharton, Algorithm for computing the number of distinct spectral vectors in thematic mapper data, IEEE %ns. Geosc. Rem. Sens. GE 23 (1985) 67. (831 J. Cohen, A coefficient of agreement for nominal scales, Educational and Phycological Measurement 20 (1960) 37. [84] A. K. Skidmore, An expert system classifies eucalypt forest types using thematic mapper data and a digital terrain model, Phot. Eng. Rem. Sens. 55 (1989) 1449. [85] S. W. Wharton, A spectral-knowledge based approach for urban land-cover discrimination, IEEE %ns. Geosc. Rem. Sens. GE 25 (1987) 272. [86] E. Concole and M. C. Mouchot, Comparison between conventional and fuzzy classification methods for urban area and road network characterization, in Leberl et al. (eds.) (1996) 3%49. [87] I. A. Leiss et al., Use of expert knowledge and possibility theory in land use classification, in Progress in Rem. Sens. Research and Applications, E. Parlow (ed.) (A. A. Bakema, Rotterdam, 1996). [88] D. K. Peddle, Knowledge formulation for supervised evidential classification, Phot. Eng. Rem. Sens. 61 (1995) 409. (891 D. H. Ballard and C. M. Brown, Computer Vision (Prentice-Hall, 1982). [go] P. Whiston, Artificial Intelligence, 2nd edn. (Addison Wesley, 1984). (911 M. Goldberg et al., A production rule-based expert system for interpreting multitemporal Landsat imagery, Proc. IEEE Comp. SOC. Conf. Comp. Vision and Patt. Rec., Washington, D. C. (1983) 77-82. (921 F. Gaugeon, A forestry expert package - the Lake Traverse study, Petawawa Nat. Forestry Inst., Forestry Canada, Information Report PI-X-108 (1991). [93] F. Wang and R. Newkirk, A knowledge-based system for highway network extraction, IEEE Bans. Geosc. Rem. Sens. GE 26 (1988) 525.
666
F. Cavayas & Y. Baudouin
[94] D. G. Goodenough et al., An expert system for remote sensing, IEEE Trans. Geosc. Rem. Sens. GE 25 (1987) 349. [95] K. Fung et al., The system of hierarchical experts for resource inventory (SHERI), Proc. 16th Can. Symp. Rem. S e w . (1993) 793-798. [96] A. N. Steinberg, Sensor and data fusion, in Emerging Systems and Technologies, S . R. Robinson (ed.), The Infrared and ElectreOptical Systems Handbook, Vol. 8 (ERIM and SPIE Optical Eng. Press, 1993) 239. [97] J. Bertin, Se'miologie Graphique - Les diagrammes - les re'seaux - les cartes (Gauthier Villars, 1967). [98] D. J. Cuff and M. T. Mattson, Thematic maps Their design and production (Methuen, 1982). (991 B. D. Dent, Cartography Thematic Design, 2nd edn. (Wm C. Brown Publishers, 1990). [loo] P. Lewis, Maps and Statistics (Methuen & Co Ltd 1977). [loll D. Unwin, Introductory Spatial Analysis (Methuen, 1982). [lo21 B. Boots and A. Getis, Point Pattern Analysis, Sage Scientific Geography Series, Number 8 (Sage Publications, London, 1988). [lo31 R. G. Cromley, Digital Cartography (Prentice-Hall, 1992). [lo41 M. N. Demers, Fundamentals of: Geographic Information Systems (John Wiley & Sons, 1997). [lo51 M. S. Monmonier, Computer-Assisted Cartography Principles and Prospects (Prentice-Hall, 1982). [lo61 Gatrell and B. Rowlingson, Spatial point process modelling in a GIS environment, in Spatial Analysis and GZS (1994) 147-163. [107] H. Beguin, Me'thodes d'analyse Ge'ographique Quantitative (Litec, 1988). [lo81 J. Wang and P. J. Howarth, Structural measures for linear feature pattern recognition from satellite imagery, Can. J. Rem. Sens. 17 (1991) 294. [lo91 Y.-H. Chou et al., Spatial autocorrelation of wildfire distribution in the Idyllwird Quandrangle, Phot. Eng. Rem. Sens. 56 (1990) 1507. [110] J. Zhou and D. L. Civco, Using genetic learning neural networks for spatial decision making in GIs, Phot. Eng. Rem. Sens. 62 (1996) 1287. [lll] M. S. Monmonier, Raster-mode area generalization for land use and land cover maps, Cartographica 20 (1983) 65. [112] S. K. Jenson and 0. J. Domingue, Extracting topographic structure from digital elevation data for GIS analysis, Phot. Eng. Rem. Sens. 54 (1988) 1563. [113] G. J. McDermid and S. E. Franklin, Spectral, spatial and geomorpometric variables for the remote sensing of slope process, Rem. Sens. Envir. 49 (1994) 57. [114] Blaszczynski, Landform characterization with GIs, Phot. Eng. Rem. S e w . 63 (1997) 183. [115] R. K. T. Reddy and G. F. Bonham-Carter, A decision-tree approach to mineral potential mapping in Snow Lake area, Manitoba, Can. J. Rem. Sens. 17 (1991) 191. [I161 A. T. Cialella et al., Predicting soil drainage class using remotely sensed and digital elevation data, Phot. Eng. Rem. Sens. 63 (1997) 171-178. (1171 M. James, Classification Algorithms (Collins, 1985). [118] L. Sanders, L'analyse Statistique des Donne'es e n Ge'ographie,RECLUS (Montpellier, 1989). [119] A. M. Tailor et al., Development of a knowledge-based segmentor for remotely sensed images, Phil. Trans. R. SOC.Lond. A324 (1988) 437. [120] Fiset et al., Map-guiding and neural networks: a new approach for high accuracy automatic road extraction, in Leberl et al. (eds.) (1996) 293-308.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 667-683 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 3.9 I FACE RECOGNITION TECHNOLOGY*
MARTIN LADES Institute for Scientific Computing Research, L-416, L a m n c e Livermore National Laboratory Livermore, California 94551, USA E-mail:
[email protected] This chapter surveys the progress in the application area of face recognition, a task posing a challenging mix of problems in object recognition. Face recognition promises to contribute to solutions in such diverse application areas as multimedia (e.g. image databases, human-computer interaction), model-based compression, and automated security and surveillance. Contributing to the growing interest in face recognition is the fact that humans readily relate to the results ( ‘ I . . . but I am terrible with names.”). The Encyclopedia Britannica defines recognition as a “form of remembering characterized by a feeling of familiarity” under the usage of face recognition: ‘ I . . . Recognizing a familiar face without being able to recall the person’s name is a common example . . .”. The chapter starts with a task statement distinguishing classes of applications, continues with pointers supporting the competitive testing of face recognition and provides examples for image standards and publicly available databases. The remainder of the chapter describes as an example the paradigm based on elastic matching of attributed graphs. Keywords: Face recognition, object recognition, image database, automated surveillance, wavelets, elastic matching.
1. Introduction Automated face recognition poses an attractive mix of challenges in computer vision and due to the human interest in the application often serves as example for the approaches described in this handbook and its first edition [l].Several excellent reviews on generic and specific aspects of face recognition have been published, e.g. [2-51. More information on the neurophysiological background of how face recognition works in humans as well as technical systems is found for example in books dedicated to the processing of facial information (e.g. [ 6 ] ) ,the proceedings of recent conferences (e.g. [7-9]), and on the world wide web (WWW) (e.g. [lo-121). This handbook chapter complements the wide variety of available information with a description of some practical aspects of automated face recognition research and *Can Machines Ken?
667
668
M. Lades
pointers to available online resources. The chapter is divided into three parts. In the first part several subclasses of face recognition applications are identified and pointers are provided in a survey of existing solutions and ongoing research. The second part contains examples of image standards and publicly available databases for the competitive evaluation of face recognition systems. The third part of the chapter concisely describes one of the leading face recognition paradigms, the elastic matching of attributed graphs, and some of its potential extensions.
2. Approaches to the Problem 2.1. Task Definition The Encyclopedia Britannica defines recognition as “a feeling of familiarity when something previously experienced is again encountered” and Webster’s Dictionary describes it as “the knowledge or feeling that someone or something present has been met before”. The implementation of Automated Face Recognition (AFR) requires the specification of system components that achieve the technical interpretation of recognition: recall of stored data from memory. This may occur by constructing and evaluating the similarity between a database of “known” faces and the current input data. Many AFR systems have the following typical components: the finding, the estimation of posture/affine transforms, the identification of a face corresponding to the components’ segmentation, the preprocessing/feature extraction and the classification in a classical pattern recognition system. The application areas for AFR systems are very diverse, reaching from multimedia (e.g. associative image databases and digital libraries [13-161 and many more) to law enforcement and security applications (e.g. automated surveillance, credit card verification, and other soft security applications; see also the review in [17]). The automated determination of such qualities as approximate age, gender, race, or emotional state are also applications of face recognition technology and may be interesting for both known and unknown faces. A flurry of research activity has been spawned by the recent advances in available computer power for the desktop. Real-time matching of video (dynamic matching) and the matching of still images (static matching) to large databases pose different challenges and constraints and lead to a plausible subdivision of the AFR application field. A further distinction is possible between the systems which use normal image data and systems which use different sensor data, e.g. thermograms, range data and 3-D modeling techniques, or data collected by a sensor which includes processing. Table 1 lists some candidates of application areas for face recognition technology. Natural choices of application areas for face recognition appear to exploit unique strengths: Recognition from a photographic image. It may be the only source available. Noncooperative security/recognition. Face recognition works at a distance without knowledge of the subject under investigation.
3.9 Face Recognition Technology 669 0
Interpretation of facial expressions/reconstruction. It uses underlying physiological knowledge.
Other application areas often face strong competition from cheaper solutions than computer vision. Table 1. Examples of application areas of automated face recognition. Applications
Challenges
Strengths
Weaknesses
Mug shot matching
Search of large databases
Works with photographs; Controlled environment (CE); Controlled segmentation and image quality
Efficiency of database access
ID validation
Rejection of false positive ID
Low complexity of match; Conditions similar to 1
Distribution of faces required
Matching of drawings
Feature types; Distortions
Avoidance of viewer fatigue; Conditions similar to 1
Competition from composited images
Reconstruction from partials/remains
Accurate extrapolation of physiology
Integration of physiological knowledge
Relies partially on guessing
Artificial aging
Accurate prediction of development
Integration of physiological knowledge
Background work required for reliable guess
Automated surveillance
Segmentation; Detection of intruder and activity
Avoidance of viewer fatigue
Poor image quality
ID of moving face in crowd
Segmentation; Real-time constraints
Avoidance of viewer fatigue
Poor image quality
Interpretation of facial expressions
Motion analysis
Human-computer interaction
High computational effort
Model-based compression
Recognition & reconstruction; Motion analysis; Real-time constraints
High compression ratio attainable
High computational effort
670
M. Lades
2.2. Static Matching Mug shot matching is the most typical application in the class of static matching applications (see 1-5 in Table 1). Current research is geared towards efficient access of databases and robustness against noise, distortions and posture changes under rather controlled conditions (straightforward segmentation, limited postures, good illumination and image quality). The face finding and posture estimation stage of the system can then be simplified. The main challenge faced is then the robust operation on sizeable databases (e.g. several million people in a DMV database of drivers’ licenses) as they are already operational for fingerprints. Current research systems and industrial face recognition products operate well on most off the available public test databases at present with around 1200 faces (see the NIST database and the FERET database listed in Section 3). Larger proprietary databases (with around 8000 images) are claimed for proprietary systems (MIT) but are not accessible for testing purposes. Once larger databases become accessible, the fruits of face recognition research will also benefit associative image databases and digital libraries, because the technology will permit for example the indexing/searching of the Internet with an associative paradigm in addition to currently available textual metadescriptors. Early face recognition systems used a variety of approaches such as matching of simple template regions, holistic measures such as principal component analysis (PCA), backpropagation or associative memory. The variety is greater than this brief survey can do justice. The reader is directed to the named surveys for references. The motivation justifying certain technologies reaches from, e.g. information theory (minimum data size, fast feedforward recognition) to the demonstration of concepts recognized as plausible from over 20 years of cognitive psychology and neuroscience research. Representative of the state of the art in mug shot systems are the systems under investigation in a project sponsored by the Army Research Lab on the FERET database. Projects ran, e.g. at Rockefeller University [18], at MIT extending on [19,20] and at USC based originally on [21,22]. These systems respectively use PCA or Karhunen-Lohe transform (see Chap. l . l ) ,local feature analysis (LFA), and elastic matching with wavelet-derived features (see Section 4 for a description) as recognition technology. The currently successful systems resemble each other in construction, permitting explicitly the ability to cope with distortions and perspective changes by using localized feature descriptors. Commercial mug shot systems (some associated with researchers) are currently offered by: (1) Visionics Corporation [18] as FaceIt [23] developed by Identification and Verification International Inc. (IVI) [24] based on proprietary hardware/software technology and FaciaReco patents with patent protection for a number of applications, by MIT/Media Lab as FaciaReco [25], by Berninger Software as Visec-FIRE [26]; (2) Zentrum fuer Neuroinformatik as Phantomas [27] (including the ambition to match drawings to images, as first demonstrated in [28] for the algorithm); (3) Miros Inc. as TrueFace [29];
3.9 Face Recognition Technology 671 and (4) Technology Recognition Systems (TRS) Inc. as FRlOOO which uses infrared technology and is further discussed in a following subsection. Free demonstration systems are available on request or from the respective Web sites of Visionics, Miros and MIT for recognition with eigenfaces, and from Carnegie-Mellon University [30] for a face detector or face finder stage. Other excellent starting points for research related to mug shot recognition include [31-331 for another face finder approach. The operators of mug shot systems desire robustness as much as excellent discrimination on large databases. If a system can cope with an extended range of perspectives (e.g. a posture estimation stage can help to extend the range of in depth rotation, scaling and rotation, and distortions of a system) it becomes more applicable and reliable. Research is therefore geared to extend these capabilities in current systems as described in [5,34-381. Extended robustness requires, besides an integration of geometrical transforms, the integration of physiological knowledge: Which distortions are plausible, and what are the underlying bones imposing as constraints to the shape of potential facial expressions? Approaches similar to backprojection in tomography extend the range of 2-D face recognition without requiring an explicit 3-D representation. Another major thrust are entry/access control systems with cooperative ID verification. These systems are suggested for purposes ranging from door openers in high security areas to credit card verification. The conditions are similarly controlled as in mug shot systems but the time constraints may be more stringent. In this case face recognition is used for convenience and should only be considered an additional biometrical quantity that is measured since other simple and reliable measures such as hand geometry, finger prints, iris etc. are available and may require less processing or are harder to fake than a 2-D facial image. Caution applies also since the currently validated error rates of around 5 1%may not be sufficient for stringent security needs. Commercial systems are available (in addition to the academic research named in the listed reviews) from Visionics as FaceIt PC Access [39], from IVI [24], from Berninger Software as BS-Control [40], from TRS, from the Zentrum fuer Neuroinformatik as ZN-Face [41], and from Siemens/Nixdorf as FaceVACS [42]. The gathering of sufficient statistics across a representative part of the population is required to reduce the current error rates.
2.3. Dynamic Matching The criterion for dynamic matching is that motion can serve as an additional cue for recognition. According to this the application examples 6-9 in Table 1 are dynamic matching applications. They have a wide variety of requirements and no consistent benchmarks exist so far except for compression. In compression, motion cues help to attain sparse encoding and high compression ratios. The motion cues in dynamic matching applications can serve to improve face recognition results by integrating the evidence and the stable features over an image sequence (e.g. [43-461 or 3-D shape recovery and motion analysis in [47]). Some interest with
672
M . Lades
respect to dynamic matching is directed ultimately towards the automated surveillance of crowds. Public surveillance with video cameras has, for example, led to a drastic reduction in crime rate in a field experiment in the UK (1996). Automating such public eyes could avoid viewer fatigue while retaining the benefits and even possibly identifying the criminals directly. For approaches, see the commercial vendors of security applications also listed under static matching. Crowd surveillance is a difficult problem because of the usually poor image quality delivered by the video cameras available in the law enforcement/surveillance environments and the difficulty of segmenting a face from a moving background, e.g. in a casino. The segmentation problem can be alleviated by additional constraints, e.g. physical separation of the crowd into lines. What remains is then the real-time constraint for the access of large databases already faced in milder form for static matching applications. Simpler problems with motion cues are the identification of simple changes (e.g. driver fatigue detection in an automobile) or the detection and identification of a single person (e.g. intruder detection under exclusion of false alerts by noise). They contribute to the continuing research interest in an extension of face recognition to dynamic matching and motion understanding (e.g. for facial expression analysis [48],for the annotation of video sequences with Multiple Perspective Interactive Video [49,50], or for human-computer interaction in [51]). The applications often require only minimal database sizes as in the case of the driver fatigue assessment (one person), e.g. [52] or advanced Nielsen rating (- 10 persons) for television. 2.4. Alternate Sensors and Paradigms
In addition to the work with plain 2-D images researchers attempt to extend face recognition to infrared images, range data, and an enhanced sensory dynamic range. For example the company TRS Inc. [53] conducts research into the use of thermograms for recognition. TRS Inc. implemented a prototype access control system including a state-of-the-art IR imager. They proved that thermograms support robust recognition if the changes for external body parts with weak circulation (e.g. nose, ears) are segmented out and disregarded for recognition. Thermograms have at least two distinct advantages as image sensors for face recognition. They allow TRS Inc. to 0
0
differentiate through high resolution imagery even between identical twins since the fine structure of their blood vessels is different due to development; nonintrusively monitor a person unawares, extending on an advantage of visual recognition. Infrared can be used to monitor a person without his knowledge in a sensitive surveillance situation in a fully illuminated scene.
TRS claims patent rights on all face recognition research based on thermal imagery. High contrast variance caused by illumination changes is a major problem in face recognition. However our eyes with their remarkable dynamic range and
3.9 Face Recognition Technology 673 amplification separate illumination changes from face-inherent variations with ease. Advanced sensors like the silicon retinas developed at the California Institute of Technology in C. Mead’s group, may include processing in the sensor and so extend for example the range and the stability of data available for a face recognition system (see, e.g. [54]). At LLNL we compared the performance of face recognition with an off-the-shelf CCD camera and a silicon retina in an experiment under identical environmental conditions and verified the retina’s advantages under difficult lighting conditions (see, e.g. [55] (real-time face recognition), [56] and [57]). The retina chip models the biological example of the outer plexiform layer in a biological retina at a device level in silicon. It has functional equivalents of cell structures, e.g. syncytia consisting of linked CMOS devices that implement the membrane dynamics instead of the membrane enclosed electrolytic fluids found in biological cells. The silicon retina functions in a face recognition system as an image sensor with a bandpass filter with local contrast adaptation. Our comparison proved, as expected, that the tested face recognition algorithms (global PCA and elastic matching) retained significantly better recognition rates with the chip under difficult lighting conditions (one sided illumination). See Fig. 1 for a comparison of images taken with a CCD camera and the retina under identical conditions with two-sided and one-sided illumination. The images captured by the retina retain significantly more information in the dark image portion. The retina offers enhanced dynamic range, collecting better information, while this information is lost
Fig. 1. CCD camera vs. silicon retina. Topto-bottom: two-sided and one-sided illumination; left-to-right: CCD, CCD matched to silicon retina (band-pass and distortion), silicon retina.
674
M. Lades
with a CCD sensor. Other research for integrated processing in silicon retinas targets motion-based segmentation and motion analysis with the goal to extract more robust features than is achievable by software alone. It is possible to avoid the inherent difficulty of processing under varying illumination by using a 3-D representation. The 3-D information is inherently illumination independent. It can be gathered either as range data [58] with a scanner or integrated from multiple/stereo images under controlled conditions (e.g. supported by structured lighting, see the chapter on 3-D modeling in this handbook). The controlled conditions avoid possible unreliability introduced into the calculation of the 3-D structures, e.g. uncertainties caused by variations in illumination or shadows. The work is done once the 3-D data set is computed. At this point 3-D representations appear to be computationally costly in comparison to the matching of 2-D images and therefore not practical for the law enforcement community. 3-D techniques should however prove invaluable for offline improvement on the projection techniques under investigation for the extension of robustness against in-depth rotation. Advanced associative memory designs (e.g. at the California Institute of Technology) use optical networks for face recognition [59,60]. Optics shows great potential for pattern processing, due to improved communication and interconnections and inherent parallelism, satisfying real-time constraints with ease.
3. Testing An important aspect for researchers, developers and customers is the competitive evaluation of different face recognition systems. A common testground in form of a common, public database and a common image format is essential for the evaluation of new algorithms and the fair comparison between different approaches to face recognition.
3.1. Image Standard for Mug Shots The question of a common image standard for facial data exchange was recently addressed by the National Institute of Standards (NIST) in the United States in collaboration with the Federal Bureau of Investigation (FBI) in a proposed addendum dated 1996-09-16 to the standard ANSI/NIST-CSL 1-1993. This addendum describes the standard data format for the interchange of fingerprint and signature information. The proposed exchange format for mug shots is a colored, 24-bit RGB, 480 x 600 pixel (portrait format) Joint Motion Picture Expert Group (JPEG) image in the so called JPEG File Interchange Format (JFIF), version 1.02. This image format permits, within the framework of the standard, the specification of a different number of pixels per line or a different number of lines than quoted above, as well as a different horizontal and vertical resolution or pixel aspect ratio. The full record of a mug shot also contains additional information on specific marks such as scars, and optional pose encoding including an angle for a nonfrontal or profile
3.9 Face Recognition Technology 675 pose. The proposal is available from NIST (by request for the dated proposal). The chosen JFIF format offers ease-of-use and results in compact files (it is also among one of two standardized image formats for data exchange on the World Wide Web). The good color information is an important additional recognition cue for face recognition. However, artifacts may be introduced by the cosine transform in the compression process. Even at low compression ratios noticeable grid-like texture may show up in blocks related to the tiling during compression. This false texture information adds to the challenge for the automated recognition process. GIF or TIFF image formats are another choice found in current face recognition systems. These formats are also widely used and the latter offers enhanced flexibility in the representation and a variety of choices for internal compression.
3.2. Face Databases The available datasets are limited in size and distribution, mostly by privacy concerns. Until recently only small databases were publicly available such as the Massachusetts Institute of Technology (MIT) database [61] with 27 greylevel images per person of 16 persons in raw format (lbyte/pixel, 120 x 128 pixels, second level of a pyramid decomposition) under different angles, scales and illumination; or the Olivetti Research Laboratory (ORL) database [62] with 400 images of 40 persons in Portable Grey Map (PGM) format (1byte/pixel, greylevel, 9 2 x 112 pixels). Larger, proprietary data sets could not be accessed for competitive testing. By the end of 1994, NIST had published a very challenging face database, Special Database No. 18., on a set of three compact disks. The database contains over 3000 high resolution greyscale images with approximately 1000 x 1000 pixels per picture. The images are frontal and profile views of over 200 individuals. For a subset of over 130 individuals there exist multiple views which show the person sometimes at a significantly different age. The maximum age difference is over 60 years. The database is very hard to analyze for the existing computerized face recognition systems. Beginning in 1993 the U S . Army Research Laboratory sponsored a research program investigating face recognition systems for mug shots. Simultaneously with this research the Laboratory collected data for what is currently the largest database in use for competitive evaluation of mug shot recognition. The database currently contains around 1200 persons and represents the best existing standard for evaluations. Further inquiries regarding the availability of the data set and necessary qualifications should be addressed to the Army Research Laboratory [63]. 4. Face Recognition with Wavelets and Elastic Matching: An Example This section presents the basic algorithm behind a successful face recognition paradigm in more detail. Variations of this algorithm is found in a selection of current face recognition systems (e.g. (in chronological order) [27,64-661 and many others). The paradigm was originally inspired by a philosophy for the modeling of
676
M. Lades
neural systems, the Dynamic Link Architecture (DLA) [67], but has evolved into an application more closely related t o the elastic matching of deformable templates [22], the elastic net algorithm [68] and modern character recognition schemes, e.g. [69].
Fig. 2. Schematic of face recognition with elastic matching: (a) preprocessing, wavelet transform, (b) face finding with “Global Move” template match, (c) elastic distortion of face graph, (d) evaluation of similarity distribution and match assignment.
4.1. Basic System Figure 2 shows a schematic summarizing the components behind the basic system for face recognition with elastic matching starting from a grey level image to identification of the closest match: Preprocessing: Wavelet-based features are extracted from the input image by a linear filter operation with a set of complex wavelets covering a range of frequencies and orientations. The set of wavelets consists, e.g. of complex Gabor wavelets,
reflecting the interesting range in the power spectrum of the objects (faces) under investigation. Instead of the Gabor function, a modified, DC free version of it or a quadrature filter pair can be used with some advantage. The vectors x and k in Eq. (4.1)represent the coordinates in the spatial and frequency domain. The responses t o the filter operation are arranged in a feature image with a feature vector at each pixel position. The feature vectors represent local texture descriptors closely related t o local power spectra. Since the filter operation enlarges the data volume of the input image, a sparse attributed graph (see Fig. 3 for its structure) is extracted from the feature image. Face finderlsegmentation: A generic face graph template is used t o locate the face in the input feature image. The template is at first projected at a random position on the input feature image. This defines a n input graph in the input image corresponding t o the template graph. The position of the input graph is now varied in a Monte Carlo-based optimization procedure which maximizes the average similarity between the input graph and the template graph. During each step a new random position of the input graph is tested for a n improvement of the average feature similarity between the template graph and the input graph.
3.9 Face Recognition Technology 677
Fig. 3. Attributed graph representing a sparse face model: each vertex at position z is labeled with the vector J I ( z ) , the wavelet transform W of the image Z at Z (W is parametrized by the set of wavelets $J and 4); each edge ( i , j ) E E of the graph is labeled with the distance vector between the vertices i , j at its endpoints.
In the case of an improvement the position of the input graph is updated and the next position is tested. This “Global Move” shifting of the input graph continues until an optimal position with maximal similarity is reached. (c) Now the process can continue with two cases: Storage - The sparse input graph (feature vectors and edge vectors) is stored in the database of persons known to the system. Feature quality can be improved in an additional step by selecting better features instead of the ones derived from the input graph, e.g. by analyzing local image properties. Matching - All memorized face graphs in the database are compared t o the input feature image and for each a similarity measure is calculated. The input graph now starts as the projection of the memorized graph positioned at the coordinates found in step (b). For a single memorized graph this comparison minimizes the cost function
which is a linear combination between the vertex costs
expressed as the sum over the similarity Sv between all corresponding feature vectors of the memorized graph and J,! of the input graph and the edge costs
JY
678 M. Lades
expressed as the sum over the dissimilarity D, of corresponding edges in the memorized graph and the input graph. A possible choice for the similarity S, is the normalized dot product between two corresponding feature vectors and for D, the square of the length of the difference vector between corresponding edges. The optimization procedure minimizes the combined total costs of the comparison between a face graph model from the database and the input graph by updating the positions of single vertices in the input graph. Each step is analyzed for a n improvement in combined vertex and edge costs. This combination favors improved feature similarity while also penalizing the distortion of the memorized face graph. The result of (c) is a distribution of costs for the comparison between all memorized faces and the input mug shot. (d) Match Assignment: The cost distribution is evaluated t o find if the match was significant or has t o be rejected. The face behind the graph with the lowest comparison costs is then assigned as the recognition result of the query. This basic system performs with an over 98% recognition rate on a test gallery of approximately 100 faces. It still exceeds a 85% correct recognition rate if all false positive candidates are excluded through statistically motivated significance criteria. These numbers were calculated in validation tests on several 100 person databases where a face gallery with a slight in depth rotation of around 15” and a gallery with facial expressions was matched against a database of straight frontal views as knowledge base of the system. Many different sets of wavelets and optimization criteria were investigated (see, e.g. [64]). 4.2. Extended System
The basic configuration can easily be extended to include scaling and orientation within the image plane. Figure 4 shows the schematic of a n extended system which follows a modified “Global Move” scheme in which the stages bi, bii transform the
Fig. 4. System schematic for distortion, orientation, and size-invariant matching: The extended “Global Move” scales ( b i ) and rotates ( b i i ) the generic face graph (or a representative subset of the database) to calculate an estimated size and orientation of the face in the mug shot within 10%. The local distortion stage (c) copes with the remaining deviations and the similarity is evaluated as before (d).
3.9 Face Recognition Technology 679
Fig. 5. Examples of matched graphs for the extended system: face graph model extracted from (a) matched to a facial expression (b), rotated (c) and scaled input at 70% (d) and 50% (e) size of the original.
face graphs in size and orientation as well as translating them. In an iterative procedure with refined search steps the system finds good approximations for size and orientation of a face. Examples of graphs matched by the extended system are shown in Fig. 5. Galleries with 100 face images at 70% and 50% of the originally stored version, face images rotated by 30°, and combinations of them were validated against the original straight views. The extended system shown here produced recognition rates of over 90%, an order of magnitude better than global PCA on the same data. Although rectangular grids are shown here for visualization purposes the extended system uses face specific graph structures. Further extensions for elastic matching with hierarchical, heuristic matching schemes were successfully investigated which reduce the complexity from linear in the number of stored graphs (relaxing every single graph to equilibrium) to a logarithmic complexity [56]. The simultaneous strength and weakness of the elastic matching paradigm is its complexity for feedforward recognition. The system can manipulate the stored face models explicitly and concurrently to the matching process. However the penalty is a high computational demand on systems. It is arguable that projectionbased, holistic systems such as PCA are faster for feedforward calculations but have trouble coping with the variations encountered in the real world because they cannot capitalize on the multiplicative decomposition of objects and have to store multiple perspectives. Research in face recognition with paradigms related to elastic matching continues to investigate how to cope with the matching of partial faces [70], strong in-depth head turns [38], to extract robust face information from image
680
M. Lades
sequences [71], and aid in reconstruction [72]. LLNL investigates related systems with respect to real-time systems based on embedded processing and silicon retinas and with respect to mug shot systems, large image databases and online systems I731 to extend the ken of machines under the label of KEN technology.
Acknowledgements This work was performed by the Lawrence Livermore National Laboratory under the auspices of the US. Department of Energy contract No. W-7405-ENG-48. My thanks goes to Prof. J. Buhmann and Dr. F. Eeckmann for their collaboration on the face recognition project with a silicon retina and to Dr. K. Boahen for the silicon retina hardware.
References [l] C. H. Chen, L. F. Pau and P. S. P. Wang (eds.), Handbook of Pattern Recognition and Computer Vision, 1st edn. (World Scientific, 1993). [2]R. Chellappa, C. L. Wilson and S. Sirohey, Human and machine recognition of faces: a survey, IEEE PTOC.83 (1995) 704-740. [3]D. Valentin, H. Abdi, A. J. O’Toole and G. W. Cottrell, Connectionist models of face processing: a survey, Pattern Recogn. 27 (1994) 1208-1230. [4]A. Samal and P. A. Iyengar, Automatic recognition and analysis of human faces and facial expressions: a survey, Pattern Recogn. 25 (1992)65-77. [5] D. Beymer and T. Poggio, Image representations for visual learning, Science 272 (1996) 1905-1909. [6]A. W. Young and H. D. Ellis (eds.), Handbook of Research on Face Processing (Elsevier, 1989). [7]PTOC.Second Int. Conf. Automatic Face €9 Gesture Recognition, World Wide Web (WWW), (1996) http://fg96.www.media.mit .edu/conferences/fg96. [8]PTOC.Face Recognition: From Theory to Applications, Stirling Scotland, Organizers: H. Wechsler, J. Phillips, V. Bruce, F. Fogelman, W W W (1997) http://chagall.gmu.edu/faces97/natoasi. [9]PTOC.ATR Symp. Face and Object Recognition, W W W (1996) http://www.hip.atr.co.jp/departments/Dept2/ATRSymposium~96.html. [lo] B. Moghaddam, VISMOD Face Recognition Home Page, W W W (1996) http://www-white.media.mit.edu/vismod. [ll] P. Kruizinga, The face Recognition Home Page, W W W (1996) http://www.cs.rug.nl/ peterkr/FACE/face.html. [12]J. J. Atick, P. A. Griffin and A. N. Redlich, The Faceit Homepage, W W W (1996) http://venezia.rockefeller.edu/faceit . [13]The Alexandria digital library, The Alexandria Web Team (T. Smith, Alexandria (project director), J. Frew (web team leader)), W W W (1996) http://alexandria.sdc.ucsb.edu. [14]The Informedia project, H. Wactlar (project director), T. Kanade (image processing), W W W (1996) ht t p ://www .informedia.cs.cmu.edu. [15] IBM Corporation, Query By Image Content (QBIC) Home Page W W W (1996) http://wwwqbic.almaden.ibm.com/ qbic/qbic.html.
3.9 Face Recognition Technology 681 [16] J. R. Bach, C. Fuller, A. Gupta et al., The Virage image search engine: an open framework for image management, W W W (1996) http://www.virage.com/literature/spie.pdf. [17]C. L. Wilson, Barnes, R. Chellappa and S. Sirohey, Human face recognition: problems, progress and prospects ( “Face recognition technology for law enforcement applications”), National Institute of Standards (NIST) NISTIR 5465 (1996) http://www.itl.nist .gov/div894/894.03/pubs.html#face. [18]P. S.Penev and J. Atick, Local feature analysis: a general statistical theory for object recognition, Network: Comput. Neural Syst. 3 (1996)477-500. [19]M. Turk and A. Pentland, Eigenfaces for recognition, J. Cogn. Neuroscience 3, 1 (1991) 71-86. (201 M. Kirby and L. Sirovich, Application of the Karhunen-Lo&e procedure for the characterisation of human faces, ZEEE Tr. PAMI 12, 1 103-108. [21] J. Buhmann, J. Lange and C. v. d. Malsburg, Distortion invariant object recognition by matching hierarchically labeled graphs in Proc. Znt. Joint Conf. Neural Networks (ZJCNN), Washington I (1989) 155-159. [22]A. Yuille, D. Cohen and P. Hallinan, Facial feature extraction by deformable templates, Technical Report, Center for Intelligent Control Systems CICS-P-124 (1988). [23] J. J. Atick, P. A. Griffin and A. N. Redlich, FaceIt DB, WWW, Visionics (1996) http: //venezia.rockefeller .edu/faceit /faceitdb .ht ml. 1241 C. Arndt (Vice President), IVSface, WWW, I V S Inc. (1996) http://www.wp.com/IVS-face. (251 B. Moghaddam, C. Nastar and A. Pentland, A Bayesian similarity measure for direct image matching, M I T Media Laboratory Technical Reports (1996)TR-393. [26]V. Berninger, Visec-FIRE, WWW, Berninger Software (1996) http://members.aol.com/vberninger/fire.html. [27]W. Konen, Phantomas, WWW, Zentrum fuer Neuroinformatik (1996) http://www.zn.ruhr-uni-bochum.de/work/kl/slle.htm. [28] J. Buhmann, M. Lades and C. von der Malsburg, Size and distortion invariant object recognition by hierarchical graph matching, in Proc. IJCNN Int. Conf. Neural Networks, SanDiego I1 (1990)411-416. [29]TrueFace, WWW, Miros Znc. (1996) http://www.miros.com/TrueFace-engine. htm. [30]H.A. Rowley, S. Baluja and T. Kanade, Neural human face detection in visual scenes, in Advances in Neural Information Processing Systems (NIPS) 8 (1995),also: http://www.cs.cmu.edu/ har/faces.html. [31]I. Craw, Machine coding of human faces, Technical Report, Department of Mathematical Sciences, University of Aberdeen (1996),also: ht tp://www.maths.abdn.ac.uk/maths/department /preprints/96126.ps. [32]S. Gutta, J.Huang, D. Singh et al., Benchmark studies on face recognition, Proc. Znt. Workshop on Automatic Face and Gesture Recognition (IWA FGR) Switzerland (1995). [33]T. K. Leung, M. C. Burl and P. Perona, Finding faces in cluttered scenes using random labeled graph matching, in Proc. Fifth Znt. Conf. Comp. Vision (Cambridge, MA, 1995). [34]D. J. Beymer, Face recognition under varying pose, Technical Report, M I T A I Lab 1461 (1993). [35]A. J. O’Toole and S. Edelman, Face distinctiveness in recognition across viewpoint: An analysis of the statistical structure of face spaces, in Proc. ZWAFGR (IEEE Computer Society Press, 1996).
682
M. Lades
[36] A. J. O’Toole, H. H. Bulthoff, N. F. Troje et al., Face recognition across large changes in viewpoint, in Proc. IWAFGR, M. Bischel (ed.) (1995) 326-331. [37] M. Stewart-Bartlett and T. J. Sejnowski, Viewpoint invariant face recognition using independent component analysis and attractor networks, Advances in Neural Information Processing Systems 9 (1996). [38] T. Maurer and C. v. d. Malsburg, Learning feature transformations to recognize faces rotated in depth, in Proc. Znt. Conf. Artificial Neural Networks, Paris (1995). [39] J. J. Atick, P. A. Griffin and A. N. Redlich, FaceIt PC Access, W W W (1996) ht tp: //venezia.rockefeller .edu/faceit/pcaccess/pcaccess. html. [40] V. Berninger, BS-Control, W W W (1996) http://members.aol.com/vberninger/control.html. [41] W. Konen, ZN-Face, W W W (1996)
http://www.zn.ruhr-uni-bochum.de/work/kl/slSe.htm. [42] J. Pampus, FaceVACS, W W W (1996) http://www.snat .de/nc6/face.htm. [43] K. Aizawa et al., Human facial motion analysis and synthesis with application to model-based coding, in Motion analysis and image sequence processing, M. I. Sezan and R. I. Lagendijk (eds.) (Kluwer, 1993) 317-348. (441 M. Buck and N. Diehl, Model-based image sequence coding, in Motion analysis and image sequence processing, M. I. Sezan and R. I. Lagendijk (eds.) (Kluwer, 1993) 285-3 15. [45] H. Li, P. Roivainen and R. Forchheimer, 3D motion estimation in model-based facial image coding, IEEE Tr.PAMI 15 (1993) 545-555. [46] T. Maurer and C. v. d. Malsburg, Tracking and learning graphs on image sequences of faces, in Proc. Int. Conf. Artificial Neural Networks (Bochum, 1996). [47] T. Morita and T. Kanade, Sequential factorization method for recovering shape and motion from image streams, in Proc. A R P A Image Understanding Workshop, Monterey I I (1994) 1177-1188. [48] Ekman, T. Huang and T. Sejnowski et al., Workshop on facial expression understanding, Technical Report, National Science Foundation (Human Interaction Lab./UCSF, 1993). [49] A. Katkere, J. Schlenzig, A. Gupta and Ramesh Jain, Interactive video on WWW: beyond VCR-like interfaces, in Proc. Fifth Int. World Wide Web Conf. ( W W W 5 ) (1996). [50] D. A. White and R. Jain, Similarity indexing: algorithms and performance, in Storage and Retrieval for Image and Video Databases IV, SPIE 2670 (1996). [51] E. Vatikiotis-Bateson, K. G. Munhall and M. Hirayama et al., The dynamics of audiovisual behavior in speech., Technical Report, A T R Human Information Processing Research Labor. TR-H-174 (1995). [52] K. Swingler and L. S. Smith, Producing a neural network for monitoring driver alertness from steering actions, Neural Comp. and Appl. 4 (1996) 96-104. [53] D. Evans, Positive identification using infrared facial imagery, Technology Recognition Systems (TRS), W W W (1996) http://www.betac.com/trs/aipr.htm. (541 K. A. Boahen and A. G. Andreou, A contrast sensitive silicon retina with reciprocal synapses, in Proc. NIPS 91 (IEEE, 1992). [55] J. Buhmann, M. Lades and F. Eeckman, Illumination-invariant face recognition with a contrast sensitive silicon retina, in Proc. NIPS 93 (Morgan-Kaufman, 1994) also: LLNL UCRL-JC-115988.
3.9 Face Recognition Technology 683 [56] M. Lades, Invariant object recognition with dynamical links, robust to variations in illumination, Ph. D. Thesis, ISCR/Lawrence Livermore National Laboratory (1995). [57]M. Lades, KEN Face Recognition with silicon retina preprocessing, W W W (1996)
http://www-iscr.llnl.gov/KEN/KEN-SR. [58]G. Gordon, Face recognition based on depth maps and surface curvature, in Proc. Geometric Methods in Computer Vision 1570 (SPIE, 1991) 234-247. [59]L.Hys, Y.Qiao and D. Psaltis, Optical network for real-time face recognition, Applied Optics 32,26 (1993)5026-5035. (601 D. Psaltis and F. Mok, Holographic Memories, Scientific American 273, 5 (1995) 70-76. [61]MIT Face Database, W W W (1990) ftp: //whitechapel.media.mit .edu/pub/images/faceimages. tar.Z. [62] Olivetti Research Laboratory Face Database, World Wide Web ( W W W ) (1996) http://www.cam-orl.co.uk/facedatabase.html. [63]Army Research Laboratory, FERET database, W W W (1996)http://www.arl.mil. [64]M. Lades, J. C. Vorbruggen and J. Buhmann et al., Distortion invariant object recognition in the dynamic link architecture, IEEE T. on Computers 42 (1993) 300-311. [65]B. S. Manjunath, R. Chellappa and C. v. d. Malsburg, A feature-based approach to face recognition, in Proc. IEEE Comp. SOC.Conf. Computer Vision and Pattern Recog. (1992)373-378. [66]C.v. d. Malsburg et al., EIDOS, W W W , University of Southern California (1995) http://www.usc.edu/dept/News~Service/chroniclehtml/1995.04.l0.html/heres.html. [67]C. v. d. Malsburg, The correlation theory of Brain, Internal Report, Max-PlanckInstitut fur Biophysikalische Chemie Gottingen, Germany 81-2 1981. [68]R. Durbin and D. Willshaw, An analogue approach to the travelling salesman problem using an elastic net method, Nature 326 (1987)689-691. [69]G. E.Hinton, C. K. I. Williams and M. Revow, Adaptive elastic models for character recognition, in Advances in Neural Information Processing Systems 4 , J. E. Moody, S. J. Hanson and R. P. Lippman (eds.) (Morgan-Kauffman, 1992). [70]L. Wiskott, J.-M. Fellous and N. Kruger et al., Face recognition and gender determination, in Proc. IWAFGR, M. Bichsel (ed.) (1995)92-97. [71]T. Maurer and C. v. d. Malsburg, Tracking and learning graphs on image sequences of faces, in Proc. Int. Conf. on Artificial Neural Networks (Bochum, 1996). [72]M. Potzsch, T. Maurer and C. v. d. Malsburg, Reconstruction from graphs labeled with responses of Gabor filters, in Proc. Int. Conf. Artificial Neural Networks (Bochum, 1996). [73]M. Lades and J. Sharp, KEN Online, W W W (1995)
http://www-iscr.llnl.gov/KEN/KENOnline.
PART 4 INSPECTION AND ROBOTIC APPLICATIONS
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 687-709 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 4.1 1 COMPUTER VISION IN FOOD HANDLING AND SORTING
HORDUR AFlNARSON and MAGNUS ASMUNDSSON Marel hf., Hofdabakki 9 112 Reykjavik, Iceland The need for automation in the food industry is growing. Some industries such as the poultry industry are now highly automated whereas others such as the fishing industry are still highly dependent on human operators. At the same time consumers are demanding increased quality of the products. In the food industry the objects are often of varying size and shape, and often flexible and randomly oriented when presented to the automation system. To automate handling of these objects, an intelligent system such as a vision system is needed to control the mechanical operations to ensure optimum performance and quality. This chapter describes vision techniques that can be used to detect and measure shape and quality of food products. It stresses the specific implementation context, needed performance, sensors, optics, illumination as well as vision algorithms. Algcrithms include those for the size measurement of flexible objects and for the colour measurement of objects with nonuniform colour. Some results are given. Keywords: Industrial computer vision, image acquisition, image processing, size sorting, colour measurements.
1. Introduction 1.1. Motivation The food industry is still highly dependent on the manual operation and manual feeding of machinery. The operations that are performed by humans are often very repetitive and the working conditions are difficult. The industry in many places is facing difficulties in getting skilled people to work, and therefore it is important to increase automation. Increased automation can also improve quality, increase the speed of production and simplify registration of production information. To increase automation in the food industry, intelligent sensing through computer vision will play a major role, as mechanical solutions are not able to automate handling of products of varying size and shape, without guidance from a n intelligent system. Computer vision is today used in several industries to sort and control handling of products. Most of these applications deal with objects of fixed size often also at a fixed place, with a known orientation. Examples of this are found in the electronic industry and the pharmaceutical industry, where computer vision techniques are 687
688 H. Arnarson @ M . Asrnundsson
used for quality control in production. Examples are also found in the food industry, but not many successful applications exist where the operation involves handling of objects of varying size and shape, and where there is little a priori knowledge of an accurate position of the object when it is feed to the automation system.
1.2. Survey There have been several successful applications in agriculture products. These include guiding robot to pick fruits from threes [l],quality inspection of surface defects of fruits [2], quality control and length measurements of french fries [2], and quality control of beans and small food productsa. In the meat industry recent work includes guiding of robots to cut meat, to evaluate the amount of fat in a piece of meat [4,5], portioning of boneless meat productsb [5,11] fat trimming of meat products using high pressured waterjets [ll].In the poultry industry recent work includes the sorting of poultry pieces based on shape into, for example, drumsticks, wings, breast and thighs using Computer Vision [5], quality evaluation of chicken carcasses [6], portioning of boneless chicken portionsb [5,11] and fat trimming of deboned chicken breastsb [ll]. Other applications include measuring the thickness of chewing gum [2], evaluating the shape and surface of pizza crusts [2] and inspection of packing quality. In the fish processing industry several applications [8] have been reported. These include sorting whole fish by length independent of its orientation [9], species sorting of dead fish [9,10],vision guidance for automatic portioning of fish filletsb [5,11], separating shells from shrimpC[17]in a continuos flow, quality evaluation of salmon fillets based on overall color, sports and zebra stripes [5], feeding deheading machines using a Vision controlled robotd. In the fish industry several applications [8] have been reported, these include sorting whole fish by length independent of its orientation [9], species sorting of dead fish [9,10], biomass estimation of live fish [8] and guidance of a robot portioning fish fillets in the optimum way [11,12]. Some of the above applications have been very successful, but others rely too much on a manual operation, which limits the operation speed and the economical impact of the automation.
1.3. Organisation of the Chapter This chapter deals with the use of computer vision for food handling and sorting. Section 2 describes the main implementation aspects for these applications. In Section 2.1 we discuss the very important issue of image acquisition; this includes definition of object characteristics, selection of sensors,.lenses, filters and viewing aKey technology, Product Information, Walla Walla, WA, USA. *Design Systems, Seattle WA, USA. ‘Elbicon. Product Information, Industrieterrein Nieuwland, B-3200 Aarschot, Belgium dR. 0. Buckingham et al., “This robot’s gone fishing” Int. Industrial Robot 22, 5 (1995) 12-14.
4.1 Computer Vision in Food Handling and Sorting 689 and lighting techniques. In Section 2.2 the general characteristics of the harsh environment often encountered in the food industry are described. Section 2.3 highlights the main characteristics of algorithms used in real-time applications in the food industry. In Section 2.4 the criteria for selecting hardware for industrial applications are discussed. In Section 3 we give two examples, which show real-time implementation of computer vision in the food industry. The first one deals with size sorting of whole fish, where the measurements are done independent of the fish orientation and its skewness. The other example is fish sorting by quality, based on the evaluation of flesh colour and surface defects of fish fillets, where the fillets are classified based on size, shape, and position of the defect.
2. Implementation Aspects 2.1. Image Acquisition One of the most important tasks in any machine vision application is to obtain a good image of the object under investigation. This rather obvious point cannot be over emphasised. Sometimes a little effort spent on improving the quality of the raw image can be worth any amount of signal or image processing. 2.1.1. Object characteristics Before the image acquisition part of a system is defined the optical and physical characteristics of the object have to be studied carefully. It is very important to regard the object as an integrated part of the optical system. Optical characteristics of food products are different. The most important features that have to be identified before the acquisition part of the vision system can be defined are: 0
0
0
Transparency of the object. Transparency can be a major obstacle especially when using backlighting techniques. Uniformity of the surface colour of the object. When using front lighting techniques it can be difficult to obtain good contrast between the object and the background if the surface colour is non-uniform. Reflectance from the object. High reflectance causes a mirror-like effect which reduces the contrast when inspecting the surface of the object. The physical characteristics of the object that are of special importance are:
0
0
The size of the object defines the size of the needed field of view. In the food industry the size is often varying, e.g. some pieces are a quarter of the size of others. When dealing with large objects (> 50 cm) special care has to be taken to get good image quality over the whole field of view. The shape of the object, to guide the selection of features to be identified. This can be a difficult task because food products are often non-rigid and easily damaged.
690
H. Arnarson & M. Asrnundsson
2.1.2. Sensors
There are several important characteristics one has to take into consideration when selecting cameras for vision systems. In this section the most important ones, when selecting between Charge Transfer Devices (CTD) [13] cameras and tube cameras, will be described. (i) Shape of the sensor. When using CTD cameras it is possible to select between different shapes of sensors. The three most common types are: array, line and disk shaped sensors. Tube cameras are limited to array shaped sensors. The shape of the sensor is important, especially if the objects are moving. When the object is stationary or slow moving an array shaped sensor can be used. For rapidly moving objects improved performance is obtained by using linear or disk shaped sensors [14] coupled to a motion synchroniser. (ii) Sensor resolution. In array sensors the resolution is normally expressed as the number of lines per pictures height. In CTD cameras the limit is set by the pixels available in the image area. In tube cameras resolution is influenced by the type and size of the photoconductive layer, the image size on the layer, beam and signal levels, and the spot size of the scanning beam. (iii) Spectral sensitivity. Spectral sensitivity is the sensor’s relative response at different wavelengths of light. Usually CTD cameras cover wavelengths in the range 0.3-1.2 pm, while tube cameras cover 0.2-14 pm (not by one tube). (iv) Sensitivity. Sensitivity is the efficiency of the light to the charge conversion of the sensor. There are several ways of measuring this sensitivity. The two most frequently used are luminous sensitivity and radiant sensitivity. 0
0
Luminous sensitivity is measured at a specific colour of light, usually 2856 K. The output is expressed in mA/(lumen mm2) or V/(mW mm’). Radiant sensitivity is measured over a range of wavelengths, usually from 400 to 1000 nm. The output is expressed in mA/(Wmm2).
In CTD cameras, sensitivity is influenced by variables such as quantum efficiency, the length of integration time and the dominant source of noise in the device. Tube camera sensitivity is dependent on the type and size of the photoconductive layer. It also varies with the target voltage level in certain types of tube cameras. (v) Dynamic range. This represents the overall usable range of the sensor. It is usually measured as the ratio between the output signal at saturation and the RMS value of the noise of the sensor (sometimes peak to peak noise). In CTD cameras this RMS noise does not take into account dark signal nonuniformities [13]. In CTD cameras the saturation voltage is proportional to the pixel area. Factors influencing the dynamic range of tube cameras include photoconductive characteristics of the faceplate, as well as scanning rate and electronic gun characteristics.
4.1 Computer Vision in Food Handling and Sorting 691 Table 1. Performance of CTD and tube cameras
Units Resolution Dynamic range Max. sensitivity Geometric distortion Nonuniformity Lag Spectral sensitivity Mean time between failure Frame rate Damage by overlighting Price Supply voltage
CTD camera Typical value Max. value
lines peak sign /ms noise lux
500 4000 : 1
% % %
750 100 : 1
1000 : 1
10-6
*
20
0 2
0.1
1
2
12
10
20
0 300-1200
0.1
10 200-14000
hours
unlimited
* *
15-20
nm
10000
*
25
400
frames/s
*
1000 10000 : 1
Tube camera Typical value Max. value 2000
*
*
*
25
2000
No
Yes
*
USD V
2500 15
* *
1500 500
*
*
Among other important characteristics of cameras are: signal to noise ratio, geometric distortion, lag, nonuniformities, readout speed, camera synchronisation, mean time between failure, operating temperature, damage by overlighting, operating power, operating voltage, size, weight, price. Table 1 lists typical performances of CTD and tube (Vidicon) cameras currently available on the market. 2.1.3. Lighting and viewing techniques Selection of illumination equipment and viewing geometry is a n important step in the development of the acquisition part of a vision system [13,14]. Based on the application (inspection, handling, sorting) to be implemented and the characteristics of the object, both the physical and optical characteristics, the optimum lighting and viewing technique is defined. In one- and two-dimensional size and shape measurements, diffused or direct backlighting are most likely to give good image quality. Although special care has to be taken with some food products, e.g. fish where the fins can be partly transparent. In three-dimensional size and shape measurements, structured light [15] is often used, but the use of two or more sensors can also give robust and accurate three-dimensional measurements. When doing surface inspection the most appropriate set-up is diffused front lighting, where the contrast is often enhanced using coloured light and colour filters (Section 2.1) in front of the camera. Using front lighting techniques it can be difficult to cope with variations in the color of the object.
692
H. Arnarson & M . Asrnundsson
Inspection inside the food object, includes search for parasites and bones. In these applications it is generally very difficult to develop the lighting and viewing part of the vision system, because of the optical characteristics of the food [IS]. In the fish industry different lighting techniques have been tested. This includes Xrays €or bone detection [17], laser scanning for bones and parasites [IS], ultrasound for bones and parasites [19],and fluorescence of fish bones [20]. Today it is possible to detect bones inside meat and fish flesh using soft X-rays, while the problem of parasites in fish still remains unsolved [18]. 2.1.4. Optics The optical front end of a vision system must be designed with equal care to that applied to electronics, otherwise there is a risk that an apparently precise measurement will hide a significant error caused by the limitation in optics. Special care has also to be taken, because applications in the food industry involve sensing of large images outside the optical axis of the lens system. In this section important characteristics of lenses for food handling and sorting will be discussed (see Fig. 1).
FOV
V
U
Fig. 1. Image forming basics.
(i) Magnification m. The optical magnification is defined as the image distance IJ over object distance u or Field-Of-View (FOV) over the sensor size d.
m = FOV/d
= v/u
.
(2.1)
The FOV should be large enough to see the object, and because of object movement which is very often the case in the food industry an FOV 30% larger than the largest object is recommended. (ii) Focal length f . The optimum focal length for each application is related to u and IJ through the well known lens equation: l/f = 1 / u
+l/v.
(2.2)
4.1 Computer Vision an Food Handling and Sorting
693
Equation (2.2) assumes the light is perpendicular to the optical plane of the lens. In practice, for many applications in handling and sorting in the food industry, u >> v so a good approximation of Eq. (2.2) is :
l/f
FZ
l/v.
(2.3)
(iii) Lens quality. There are two main factors that influence lens quality. Special care has to be taken when working on applications involving large objects. 0
Resolution T . Because of the diffraction effect of the light going through the lens, there is a theoretical limitation on the resolution of the lens. For lenses working at high demagnification, this theoretical value is given by [13]: T
=
1.22 * X * f / A
(2.4)
where
X : wavelength of light f : focal length of lens
A : diameter of the aperture of the lens. 0
Aberration. There are two kind of aberrations: monochromatic and chromatic [21]. Monochromatic aberrations are divided into five subclasses and can be calculated theoretically using a lens formula assuming oblique line directions [22]. The chromatic aberrations are caused by the changing diffraction index of the lens material, with wavelength of light. All the aberrations get worse as the lens aperture is increased, and all of them except one subclass of monochromatic aberrations get worse with increased field angle. Monochromatic aberrations need special attention when doing accurate measurements outside the optical axis of the lens system. Typically, no information on resolution or aberrations are provided from the lens producer, instead a measure of the Modulation Transfer Function (MTF) is given. The MTF is the ability of a lens to image a particular sine wave pattern. The MTF is determined by measuring, through a lens, the contrast of such a sine wave pattern image, while changing the aperture and the off axis position of the object.
(iv) Depth of View (DOV) is defined as the distance along the optical axis on which the object can be located and still be properly imaged :
DOV = C U / ( A - C)
+ C U / ( A+ C)
where
A : lens aperture diameter u : object distance c : blur circle diameter at object.
(2.5)
694
H. Arnarson & M. Asrnundsson The blur circle is the amount of blur which can be tolerated, often set at one pixel. From Eq. (2.5), it can be seen that there is a maximum A for a given DOV.
2.1.5. Filtering Filtering is used to improve image quality, reduce noise, and enhance features of interest. Three types of filtering are described in this section: neutral density filtering, polarisation filtering, and colour filtering. (i) Neutral density filtering. Neutral density filters [23] are used to attenuate the intensity of a beam of light over a broad spectral region, without altering its spectral distribution. A neutral density filter can thereby for example be used to decrease the light intensity incident on a photodetector. Because of optical resolutions it is important to allow a large enough aperture of the lens (Section 2.1). Using a neutral density filter allows a larger aperture of the lens. A neutral density filter is characterised by its optical density D:
where
I0 : incident power IT : transmitted power T : transmittance. (ii) Polarisation filtering. Light travels as a transverse electromagnetic wave, the electric and magnetic fields being perpendicular to each other as well as to the direction of propagation. A light beam is said to be linearly polarised if its electric field vectors are oriented in the same direction. A substance can affect the polarisation of light, reflected or transmitted, giving a significant feature for that same substance. The polarisation state of the resulting light beam can be detected with the aid of polarising filters, and by comparing it to the polarisation of the incident light information regarding the substance can be obtained. A polarisation filter can also be used to reduce glinting in an image. Dichroic (iii) film polarisers, fabricated from sheets made of long grain organic molecules are probably the most convenient type of polarisers for image processing purposes. They are inexpensive and have a convenient shape. (iv) Colour filtering. Colour filtering [23] may be of the most obvious importance in image processing for the food industry. With colour filtering it is possible to extract information from well defined bands of the spectrum or to increase the amount of information in the image by examining more than one different band (wavelength regions). Colour images, as we know them (for example, TV
4.1 Computer Vision in Food Handling and Sorting 695 images), are often based on the combination of three images in separate bands, which all together cover the visible spectrum. These bands are referred to as Red, Green and Blue (RGB) [23]. Such images are most often acquired using three well defined colour filters. A three-band colour image can sometimes have excess information and, in that way, will slow down the processing. It is therefore essential to analyze the optical characteristics of the subject with the purpose of choosing a spectral band of interest. Accordingly the right colour filtering can be applied. Often a much narrower band than R, G or B is more effective, where the wave bands are selected using spectroradiographic study of the product to be investigated. Different types of colour filters are available. Coloured glass filters operate through ionic absorption or via absorptive and scattering crystallites formed within the glass. They are available as Long Wave Pass filters (LWP) with a relatively sharp cut and a variety of bandpass filters which are not so sharp. Gelatin filters have similar characteristics to glass filters. They operate through absorption as well. They are commonly used in photography and are inexpensive. Gelatin is a “plastic-like” material and therefore gelatin filters are vulnerable to scratches. Interference filters operate through interference to select a range of wavelengths. Wavelengths not falling within this range are reflected. They are fabricated as thin coatings of various dielectric materials on a glass plate. Interference filters are available as LWP, SWP (short wave pass) and BP (bandpass) filters with very sharp cut characteristics. By tilting an interference filter its characteristics are changed, the cut wavelength(s) being displaced. This effect can also occur when observing an object which is not lying on the system’s optical axis, through such a filter. This should be noted or taken into account when used with cameras.
2.2. Environment
The environment in the food industry is generally harsh. The main characteristics are: 0
0
0
0
The humidity is often high (95%-100%). This is caused by continuous washing of the machinery for sanitary reasons. It is frequently recommended to keep the temperature in the processing plants between 5-10°C. There are strict limitations on what types of waste are allowed from machines in the food industry. There are regulations on what types of materials are allowed in the food processing plants. For example, it is often forbidden to use conventional glass for direct contact with the food.
696
H . Arnarson €4 M. Asrnundsson
2.3. Image Processing for Food Products All algorithms used in sorting and handling food products, use a priori knowledge, although at different levels. This a priori knowledge is used to build up a model of the process and provide strategies for the algorithms to be designed. Some of this knowledge is imposed on the process by selecting colour and texture of the background and by the viewing angle of the light source and the camera. Another part is controlled by the feeding system which determines the direction of motion of the object and whether or not the objects are overlapping. Also the object to be sorted gives information on what kind of algorithm should be used, for example, for fish, the fish has a head and a tail and some fins that can be used for classification. It is desirable to use as much a priori information on the object and the process as possible. In that way the algorithms can be simplified, the hardware requirements reduced, and the possibility of satisfying the needs of the industry at a cost and speed it accepts are increased. Algorithms used in the food industry are made of the same basic elements as in most other industries, i.e. 0 0 0
preprocessing feature extraction classification.
However the emphasis on these basic elements can be quite different compared to other types of applications. 2.3.1. Pre-processing
Although in real-time industrial applications special care is taken in designing the optimum image acquisition part, there is often a need to improve the image quality before extracting the feature of interest from the image. In applications where the results are presented to the user in an image format, image enhancement techniques such as histogram equalisation, and look-up table operations are used to improve the contrast of the feature of interest in the image. On the other hand, in automatic control systems the computer controls some actions. Based on results from the image processing this type of enhancement technique is of no use and can in fact degrade the quality of the image because of quantization effects. The nature of noise in applications in the food industry is different from what most textbooks discuss, where the focus is on random noise or spot noise. In applications in the food industry the noise usually has some physical explanation, e.g. dirt on the background or shadows because the camera has a different viewing angle than the light source. [9]describes methods based on mathematical morphology for filtering out noise, where the noise has some predefined form and some maximum size. In Section 3.2 on colour inspection, results are shown on how noise can be filtered out using this method when detecting surface defects on fish fillets.
4.1
Computer Vision in Food Handling and Sorting
697
The primary goal in pre-processing images is to reduce the amount of data in the image, for example by binarizing [24] the image. Global thresholding is used when possible. Otsu [25] describes a method for calculating the optimum threshold between classes. His method is theoretically well based but requires too much computation to be used on-line in a real-time application especially when using more than two classes. It can however be very useful in a training phase of an automatic system. When the Field Of View (FOV) is large it is difficult to get even lighting in the whole FOV. In these cases it is necessary to use local thresholding, especially when the performance of the application is dependent on accurate t hresholding. 2.3.2. Feature extraction
When selecting features to be measured it is important to select features that can be measured with good accuracy and good repeatability. Generally there are two kinds of errors that affect the feature extraction: Measurement error, because of limited accuracy of the sensing equipment, or the sensing process. This includes limited resolution of the sensor and optics, blur caused by the movement of the object, and quantization error in the A/D conversion process. The measurement error is controlled by selecting the appropriate sensing equipment. Presentation error, because of variations in the way the object is presented to the vision system. This error needs special attention when dealing with food products of non-uniform shape, varying size, flexibility and imperfect operation of feeding systems dealing with these kind of products. It is important, based on knowledge of the products to be processed and also very importantly on the nature of the feeding system, to select features to be extracted. Generally features such as location, dimension, and orientation, are common to most handling and sorting problems in the food industry. These features are used to localise the parts to be classified. Further feature extraction includes identification of corners, lines, holes, and curves. It is very useful to use information on object location to reduce the amount of data to be processed, on dimensions of the object to get size invariance and on orientation of the object to reduce dependency on orientation. In this way it is possible to focus the attention of the vision system to Areas of Interest (AOI) and in that way speed up processing. The algorithms used to extract the features selected have to be able to work on objects of random orientation and in real time. A number of algorithms are used for feature extraction. A good overview of these are in 1261. One example of algorithms used is based on identifying the contour of the object. Edge detectors are used to enhance the edge pixels which are then connected in a chain code and further connected into shape primitives which are used to describe the object. These algorithms are very time consuming and of
698 H. Arnarson €9 M . Asrnundsson
limited use in real-time applications in the food industry. Another type of feature extraction algorithm is a space domain technique like skeleton algorithms, where the object is characterised by its skeleton. These algorithms provide useful feature descriptions of the object using a limited amount of data. Of special interest is the distanced labelled skeleton [27], which can give a complete description of binary images. Another example of feature extraction algorithms is scalar transform techniques, such as moment invariants [28] and Fourier transform techniques. Mathematical morphology [29] is a technique well suited for real-time applications in the food industry. This is due to the parallel nature of its basic operations. It is a set-theoretical approach to image analysis and its purpose is the quantitative description of geometrical structures. Mathematical morphology extracts information about the geometrical structures of image objects by transforming the object through its interaction with another object (structuring element) which is of simpler shape and size than the original image object. Information about size, spatial distribution, shape, connectivity, smoothness, and orientation can be obtained by transforming the image object using different structuring elements. In [30] the Hit or Miss transform [29] from mathematical morphology is used to extract information on the presence and position of shape primitives (line, corner, convex, concave) in a fast and reliable way. 2.3.3. Classification Based on the features extracted from the image, the object is classified into one of the possible classes. There exist numerous methods for classification based on features extracted from an image. Statistical pattern classification [31] is a very sound theoretical method for classification of patterns. It is well suited in applications where a limited number of features is used. However when using many features it can be difficult and time consuming to design the classifier. Graph matching [32] is a method where the presence and position of features in relation to other features is used to classify the object. The features used could for example be corners and lines used to recognise fish species. Neural networks [30] is a suitable method, when a large number of features are available, but it is difficult to identify which are the most important features for classification. Of special interest is the possibility of training the classifier, in such a way that the classification rules are determined automatically. 2.4. Hardware
The performance of conventional sequential computer architectures is inadequate for the majority of machine vision applications in the food industry. The problem arises from the sheer amount of data presented in the image. Simple real time neighbourhood operations require 20 million operations per second (MOPS). The typical computational power of a sequential computer is less (e.g. 1-5 MOPS),
4.1 Computer Vision in Food Handling and Sorting 699 and therefore they are too slow. Confronted with such computational problems numerous researchers and manufacturers have sought to develop new computer architectures to provide the necessary computational power required by real time machine vision applications [33,34]. One approach is to develop more powerful processors to handle the workload, typified by the newest Digital Signal Processor (DSP) chips. Another approach is to create architectures which allow many processors to work on the image data in parallel at the same time. It is important to note that algorithms have quite different possibilities of being implemented in parallel, and right from the beginning it is important to focus on the software side on algorithms that are parallel in nature. 2.4.1. Image processing hardware
There is no single architecture that is the optimum one for all vision algorithms or industrial applications [35]. Therefore it is important that there is a flexibility in the selection of arithmetic units that can be installed in an industrial vision system. Figure 2 shows an example of a modular hardware structure, available from several companies (ITI, Vicom, Datacube, Eltec) today at a price of less than 25K USD.
Host bus
Vi&obm
0Pipehebus
Fig. 2. Example of a modular hardware structure in an industrial vision system.
The basic blocks of this kind of industrial vision system are: Host computer which controls the system. This is typically an Intel 80x86 or a Motorola (MC 680x0) based computer. Analogue/Digital interface, to provide an interface to cameras. This is typically an 8-bit flash A/D converter with a 10 MHz sampling rate. Frame buffer, to store images. Often it is possible to install up to four frame buffers, where each frame buffer stores a 512 x 512 x 8 image. It is important that there is more than a single port access to the frame buffer.
700
H . Arnarson &
M. Asrnundsson
(4) Special purpose arithmetic units. It is very important that it is possible to select between different types of architectures dependent on the algorithms to be performed at each time. An example of the arithmetic units available are: (i) Pipeline processors, well suited for simple arithmetic, logical, conditional and bit-plane operations, that can be performed in real time. (ii) Rank value filters which perform real time median filtering and grey scale morphological operations. (iii) Binary correlators which perform real binary operations including convolution, correlation, erosion and dilation. (iv) Signal processors, general purpose vision algorithms. (v) FUSC processors, for general purpose vision algorithms. Which of the modules described above are used in each application is highly dependent on both the application and the way the object is presented to the vision system.
2.4.2. Complexity of the vision system The levels of complexity for a vision system for food products is highly determined by the feeding part of the system. These levels are mainly determined by: (i) Distance between the objects. Is there a minimum distance between the objects? Can they be side be side, or overlapping? (ii) Orientation of the object. Is the object oriented, or not?
If objects are fed to the system on a conveyor and there is no minimum distance between the objects, then there is a need for at least two processes, one which constantly searches for the object while the other which measures the features of interest of the object. The orientation on the other hand directly influences the complexity of the algorithm. Table 2 lists the different levels of complexity in a vision based sorting system. In the food industry the practical levels are mainly levels 2-4. Level 1 is excluded because of the orientation demand which is difficult to obtain for an elastic object like food. Level 5 is excluded because it normally results in high sorting errors, and Table 2. Levels of complexity in the vision part of a sorting system. Level 1 2 3 4
5
Feeding System Objects oriented, and there is a minimum distance between objects. Objects not oriented, and there is a minimum distance between objects, Objects oriented, no minimum distance between, not overlapping. Objects random, not overlapping. Objects random, and overlapping.
4.1 Computer Vision an Food Handling and Sorting 701 although the food products can be measured accurately, it is difficult to direct it mechanically to different places. The compromise, which has to be made when selecting a working level for a vision based sorting system, lies between the requirements (cost) of the feeding system and the complexity and the speed of the algorithm. The level of complexity selected when specifying a vision application, does not only determine the cost of the system, but also the possible accuracy of the features to be measured.
3. Applications 3.1. Size Sorting of Fish 3.1 .l. Motivation
Sorting dead or dying fish is required on board fishing boats for packaging and storage purposes. Typically, the catch must be sorted by species, length or weight before going into boxes, the contents of which are compatible with the auctioning process. At the processing level, the machines (filleting and head cutters) still cannot handle in one batch diverse fish types or sizes without reduction in yield. Any set-up due to such variations is both time consuming and costly. By sorting the fish by size the fishing industry is able to get both increased yield and added production control [36]. 3.1.2. Image acquisition
There are different size parameters that can be measured on a fish using computer vision, these include volume, area, length, thickness and width [37,38]. Much depends on the feeding system as to how difficult it is to measure these features. Here we will assume that the feeding system is working on level 4 (Section 2.4), that is the fish is lying randomly but not overlapping when fed to the vision system. Further we assume that the fish is round (e.g. cod, haddock). When measuring the volume and the area, a high (& 15%) presentation error (Section 2.3) is observed due to the irregular position of fins and belly flaps [9], whereas the length can be defined and measured with a low presentation error. Therefore the length is selected as a size feature when sorting round fish by size, where the length of the fish is defined as the length of a line starting at the middle of the tail to the top of the head, following the bone structure of the fish. A fish is a highly reflecting object when under direct illumination. Fish colour varies for many species. Usually the fish is dark on the back, and white or greyish on the belly. Because of this it is very difficult to get a good contrast in an image using front lighting techniques. Diffused backlighting is used, but special care has to be taken because of transparency of fins, especially the tail fin. This application has to be able to work on board rolling ships. This fact excludes the use of line-scan and circular-scan sensors, because of their dependency on object
702
H. Arnarson €d M. Asrnundsson
motion. A frame camera which is able to sense the whole fish is therefore selected. A CTD camera is selected here mainly because of its robustness.
3.1.3. Image processing
A block diagram of the algorithm is shown in Fig. 3. In start-up the system goes through a training phase where the threshold for the background and for the fish are automatically determined using the method in [25]. In the training phase the optical magnification of the system is determined, by measuring n objects of a known size. This training phase can also be entered into interactively by the user. The algorithm starts by searching for a fish, in binary images that are snapped continuously. As soon as a whole fish is detected inside the field of view, its position is registered and a rectangular area of interest encircling the fish is determined. Based on a priori knowledge on the shape of the fish, that the fish is an elongated object and that the fish is thicker close to the head than close to the tail, the position
Initialize system
I
Search for fish Search-I
Estimate an aoi encircling the fish
Reduce blur
Register length in database
Fig. 3. Block diagram of an algorithm for length sorting of fish.
4.1 Computer Vision in Food Handling and Sorting 703 of the head, tail, fins and belly flaps are determined. Then the position of the length estimation line can be determined accurately. The length is then measured using piecewise linear approximation and knowledge on the optical magnification of the system. Based on sorting criteria programmable by the user the fish is then classified as belonging to different groups based on the length measured. The measured length is also registered in a database. 3.1.4. Results
The length estimation system is now a commercial system [5]. See Fig. 4.
Fig. 4. Prototype for length estimation of whole fish.
Fish are fed to the vision system on a conveyor running at 1.2 m/s. The system is able to length estimate whole fish with an accuracy of f 0.3 cm (one standard deviation), independent of fish orientation (Fig. 5). The processing time for each fish is 0.2-0.3 seconds, depending on fish size. 3.2. Colour Inspection 3.2.1. Motivation
Fish flesh (for example, that of cod fish) is graded in quality groups according to: colour, coloured spots (blood spots and liver spots), gaping, and shape. The most important factor in quality control is the colour of the fish flesh. Briefly, one can say that the lighter the flesh the better the quality. Today quality control is
704 H. Arnarson & M . Asmvndsson
Fig. 5. Length estimation of cod fish.
done manually in all processing plants and under different circumstances in each place. Therefore the manual quality control is bound to be very dependent upon the individual performing it and the circumstances it is performed under. The benefits of automation in quality control in the fish industry are evident. Coordination of control and standardisation will benefit both sellers and buyers.
3.2.2. Image acquisition
A first proposal for sensing equipment for a “colour grader” would presumably be a colour camera or a colorimeter. A wide variety of colorimeters are available on the market [39], some of them have been successfully applied in the food industry, for example, in inspection of fruit. Most colorimeters feature three sensors as the sensing equipment: these sensors are light sensitive in the visible (VIS) range but filtered with red, green, and blue filters respectively. Furthermore most colorimeters are point measurement devices. This is not a feasible alternative for the purpose of grading fish as none of the R, G, and B ranges fit the narrow range of wavelengths representing the difference between quality groups. Point measurement is not attractive either since the measurement must be more “intelligent”. The colour must be measured locally in certain areas of the fish and other areas or picture elements must be avoided if they do not represent the colour of the healthy fish flesh, for example, blood spots, bones and skin. A colour camera could be an alternative, but it gives excessive and unwanted information in spectral ranges we do not want to measure. Because no theoretical colour standard is available for the different quality groups of the fish, the most important issue is to define the colour of the fish and the colour difference between quality groups. The most accurate way to represent colour is by its reflectance or transmittance of different wavelengths of light.
4.1 Computer Vision in Food Handling and Sorting 705 In an effort to characterise the colour of fish, a considerable amount of fish was chosen, from each of the quality groups used, as a sample for the measurement. The fish was graded by five quality control personnel. Three samples were cut from each fish and the spectrum of each sample was measured with a spectrophotometer. The measurement covered all of the VIS spectrum and stretched into the NIR. All samples were measured in the range 350-1050 nm, with a 5 nm resolution, and some also in the range 1050-1600 nm. The results from the measurement showed that the spectral difference between groups was high in a certain narrow range of wavelengths in the VIS spectrum while there was little or no difference outside this range. From results of the spectrum measurements the optimum lighting and sensing equipment was selected. For lighting we choose diffused front lighting, light sources with spectral characteristics strong in the range where the difference between groups was most evident and weaker outside this range, thereby exaggerating the difference. A sensing equipment that fits our purpose is a black and white CTD frame camera with the appropriate bandpass filter in the specific wavelength range. Another colour feature that is to be taken into account when estimating the quality of fish is reddish bloodspots [40] that can occur in the fish flesh and which decrease the quality of the fish. These bloodspots must be detected locally in the fish flesh in some well defined areas, since the position of the spot plays a role concerning the weight of the defect (a blood spot on the more expensive loin piece of the fish is a more serious defect than a spot on the tail). A camera is therefore also suitable for detecting bloodspots. When using a black and white CTD camera to detect reddish bloodspots on the light fish flesh, the use of a bluish filter is appropriate to enhance the difference between healthy flesh and blood, making the discrimination easier. 3.2.3. Image processing
As the fish arrives on a conveyor under the camera an image of it is acquired. The fish position and orientation in the image is determined and then the image is segmented into predefined areas based on the form and the aspect ratio of the fish. The segmentation is necessary because of the different weights the fish pieces have in the quality evaluation. The colour of the fish flesh is computed individually for each area. A grey scale histogram is computed for the area and after smoothing the histogram with a moving average, two thresholds are computed deciding the interval of grey values belonging to the healthy fish flesh. The colour of the area is characterised by the average grey value of the healthy fish flesh pixels (in this case however the “grey values” of the image represent a very narrow range in the spectrum since the image is filtered). Bloodspots are also tackled locally in predefined areas. The area is thresholded with a local thresholding operator, and a binary image of blood spots on healthy
706
H. Arnarson & M. Asrnundsson
Fig. 6. Noise reduction, and detection and classification of surface defects of fish fillets. (a) Original image of fillet, (b) binary image of fillet, (c) results from classification of spots by size and shape.
fish flesh is produced. Morphological operators are then used to evaluate the size and shape of the bloodspots (Fig. 6). 4. Concluding Remarks
By studying the work effort that people perform in the food industry, it is clear that one of the main obstacles in automation is the need of intelligent human-like operation of the machines. That is the machine has to be able to sense the food product to adjust and optimise the handling of the food. If increased automation in the food industry is to come, it has to rely on intelligent sensing, where computer vision will play a major role. In this chapter we have discussed the use of computer vision in the food industry. To be successful in this type of application special attention has to be put into the development of the image acquisition part of the vision system. This includes study of the object characteristics, lighting and viewing techniques, sensors and optics. The image processing algorithms have t o be able to work in real time on randomly oriented objects of varying size, and shape. This processing demand limits what type of algorithms can be used, and imposes the need for special purpose arithmetic units to perform the processing. There exist several commercial applications today where computer vision is used to guide food handling and sorting. Nevertheless it is a fact that this field has
4.1 Computer Vision in Food Handling and Sorting 707 been growing slower than expected in the past five years. The main reason being that people underestimated the difficulties of applying this technique to objects of varying size and shape as food products are. These characteristics in fact demanded a processing power that was not available at a price acceptable to the food industry. The evolution of the computer industry is also clear, as the price is still going down and the performance of the system is increasing. In recent years there has been significant development in sensors, for example the Charge Injection Cameras [41] offering random scanning of the sensor. Another interesting development is also in the developing of intelligent camerase, where using CMOS technology the sensor and driving circuit are developed in one computer chip. This makes it possible to access the sensor faster than conventional CCD cameras, taking this step further where a processor and 110 chips can be added to the system, resulting in a Vision system capable of fast and effective preprocessing. An example of this is the 3-D range systemf which can process up to 70,000 range valuesls on a low cost hardware. This kind of intelligent system will open up a range of new applications for the Vision system where the price performance ratios were too high before. Because of the continuous evolution of the technology we believe that it is only a question of time before computer vision plays a major role in controlling and handing of food products.
References R. C. Harrell, D. C. Slaugter and P. D. Adsit, A fruit-tracking system for robotic harvesting, Machine Vision and Applications 2 (1989) 69-80. C. Pellerin, CRE cross the pond with a DAB hand for food inspection, Sensor Review 11,4 (1991) 17-19. K. Khodabandeloo, Getting down to the bare bones, The Industrial Robot 16,3(1989) 16G165. A. MacAndrew and C. Harris, Sensors detect no food contamination, Sensor Review 11,4 (1991) 23-26. Mare1 H/F, 1989. Product Information, Reykjavik, Iceland. W. V. D. Sluis, A camera and P C can now replace the quality inspector, Misset- World Poultry 7,10 (1991) 29-31. R. K. Dyche, INEX, 100 per cent on-line visual inspection of consumer products, Sensor Review 11,4 (1991) 14-17. L. F. Pau and R. Olafsson (eds.), Fish Quality Control by Computer Vision (Marcel Dekker, New York 1991). H. Arnarson, Fish Sorting Using Computer Vision, Ph.D. report LD 78, EMI, Technical University of Denmark, 1990. N. J. C. Strachan and C. K. Murray, Image analysis in the fish and food industries, in [5]. Lumitech, 1988. Product Information, Copenhagen, Denmark. Baader, 1990. Product Information, Lubeck, Germany. VISI Vision Ltd. (1996) Product Information, Aviation House, 31 Pinkhill, Edinbourgh, UK. fIVP AB (1996), Product Information, Teknikringen 2C, S-58330, Linkoping, Sverge.
708
H. Arnarson F4 M. Asmundsson
(131 B. G. Batchelor, D. A. Hill and D. C. Hodgson, Automated Visual Inspection (IFS, Bedford, UK, 1985). [14] A. Novini, Fundamentals of machine vision lighting, Proc. SPIE, Vol. 728, 1987,84-92. [15] D. Poussard and D. Laurendeau, 3-D sensing for industrial computer vision, in J. L. C. Sanz (ed.), Advances in Machine Vision (Springer, New York, 1988). [16] J. PBtursson, Optical spectra of fish flesh and quality defects in fish, in L. F. Pau and R. Olafsson (eds.), Fish Quality Control by Computer Vision (Marcel Dekker, 1991) 45-70. [17] Pulsar, 1990. Product Information, Eindhoven, Holland. [18] D. L. Hawley, Final Report: Fish Parasite Research, Federal grant No. NA-85-ABH00057, USA, 1988. [19] H. Hafsteinsson and S. S. H. Rizvi, Journal of Food Protection 50, 1 (1987) 70-84. [20] H. H. HUSS,P. Sigsgaard and S. A. Jensen, Fluoresence of fish bones, Journal of Food Protection 48, 5 (1984) 393-396. (211 K. Harding, Lighting & Optics Tutorial, VISION’87, SME, Detroit, Jun. 1987. [22] W. T. Welford, Abemtion of the Symmetrical Optical System (Academic Press, New York, 1974). [23] Oriel Corporation, 1990. Optics and Filters, Vol. 111, Stratford, CT, 1990. [24] J. S. Weszka, A survey of threshold selection techniques, Comput. Graph. Image Process. 7 (1978) 25+265. [25] N. Otsu, A threshold selection for gray-level histograms, IEEE Trans. Syst. Man Cybern. 9, 1 (1979) 62-66. [26] M. D. Levine, Vision in Man and Machine (McGraw-Hill, New York, 1985). [27] P. Maragos and R. W. Schafer, Morphological skeleton representation and coding of binary images, IEEE Trans. Acoust. Speech Signal Process. 34 (1986) 1228-1244. [28] M. K. Hu, Visual pattern recognition by moment invariants, IRE Trans. Inf. Theory 8 (1962) 179-187. [29] S. Serra, Image Analysis and Mathematical Morphology (Academic Press, New York, 1982). [30] H. Arnarson and L. F. Pau, Shape classification in computer vision by the syntactic, morphological and neural processing technique PDL-HM, in PTOC.ESPRIT-BRA Workshop on Specialized Processors f o r Real Time Image Analysis, Barcelona, Spain, Sept. 1991. [31] K. F’ukunaga, Introduction to Statistical Pattern Recognition (Academic Press, New York, 1972) 260-267. [32] A. K. C. Wong, Knowledge representation for robot vision and path planning using attributed graphs and hypergraphs, in A. K. C. Wong and A. Pugh (eds.), Machine Intelligence Knowledge Engineering Robotic Applications (Springer-Verlag, New York, 1977). [33] J. Kittler and M. J. B. Duff (eds.), Image Processing System Architectures (Research Studies Press Ltd, UK, 1985). [34] L. Uhr, K. Preston, S. Levialdi and M. J. B. Duff (eds.), Evaluation of Multicomputers for Image Processing (Academic Press, Orlando, 1986). [35] J. L. C. Sanz, Which parallel architectures are useful/useless for vision algorithms? Machine Vision and Applications 2 , 3 (1989). [36] J. Heldbo, Information teknologi og Productionsstyring i Konsumfiske industrien (in Danish), Ph.D. Report, EF201, Technical University of Denmark, 1989. [37] H. Arnarson, K. Bengoetxea and L. F. Pau, Vision applications in the fishing and fish product industries, Int. J. Pattern Recogn. Artif. Intell. 2 , 4 (1988) 657-673. [38] H. Arnarson, Fish and fish product sorting, in [5].
4.1
Computer Vision in Food Handling and Sorting
709
[39] Honeywell, USA, Product Information. [40] K. Bengoetxea, Lighting setup in the automatic detection of ventral skin and blood spots in cod fish fillets, Report No. 497, EMI, Technical University of Denmark, 1988. [41] CID Technologies Inc., 1988. Product Information, Liverpool, USA.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 711-736 Eds. C. H. Chen, L. F. P a u and P. S . P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 4.21 APPROACHES TO TEXTURE-BASED CLASSIFICATION, SEGMENTATION AND SURFACE INSPECTION
MATTI PIETIKAINEN, T I M 0 OJALA and OLLI SILVEN Machine Vision Group, Infotech and Department of Electrical Engineering, University of Oulu, FIN-90570 Oulu, Finland E-mail: {mkp,skidi,olli} @ee.oulu.fi Over the last few years significant progress has been made in applying methods using distributions of feature values to texture analysis. Very good performance has been obtained in various texture classification and segmentation problems. This chapter overviews recent progress and presents some examples to demonstrate the efficiency of the approach. Problems of analyzing textured surfaces in industrial applications are also discussed. A general overview of the problem space is given, presenting sets of solutions proposed and their prerequisites. Keywords: Texture, classification, segmentation, feature distributions, visual inspection, quality control.
1. Introduction Texture analysis is important in many applications of computer image analysis for classification, detection, or segmentation of images based on local spatial variations of intensity or color. Important applications include industrial and biomedical surface inspection, for example for defects and disease, ground classification and segmentation of satellite or aerial imagery, segmentation of textured regions in document analysis, and content-based access to image databases. Most of the texture analysis applications can be regarded as texture classification or segmentation problems, or as a combination of both. A wide variety of techniques for discriminating textures have been proposed. For recent surveys of texture analysis, see [l-51. The methods can be divided into four categories: statistical, geometrical, model-based and signal processing. Various methods of each category are decribed in Chap. 2.1 [4]. Among the most widely used approaches are statistical methods based on co-occurrence matrices of secondorder gray level statistics or first-order statistics of local property values (difference histograms), signal processing methods based on local linear transforms, multichannel Gabor filtering or wavelets, and model-based methods based on Markov random fields or fractals. In texture classification, the goal is to assign an unknown sample to one of several predefined categories. The choice of proper texture measures is crucial for 711
712
M. Pietikainen, T. Ojala & 0. Silven
the performance of classification. Most of the approaches to texture classification quantify texture measures by single values (means, variances etc.), which are then concatenated into a feature vector. The feature vector is fed to an ordinary statistical pattern recognition procedure or neural network for purposes of classification. Recent research results, however, demonstrate that methods based on comparison of distributions of feature values provide very good classification performance for various types of textures [6,7], and should be considered for many applications. Segmentation is a process of partitioning the image into regions which have more or less homogeneous properties with respect to color, texture, etc. [8].The methods of texture segmentation are usually classified as region-based, boundarybased or as a hybrid of the two. For a recent survey of texture segmentation techniques, see [3]. The segmentation can be supervised or unsupervised. In unsupervised segmentation, no a priori information about the textures present in the image is available. This makes it a very challenging research problem in which only limited success has been achieved. The application of the histogram comparison approach to texture segmentation appears to provide significant improvement in performance [9]. There are many potential areas of application for texture analysis in industry [lo-121, but only a limited number of examples of successful exploitation of texture exist. A major problem is that textures in the real world are often not uniform, due to changes in orientation, scale or other visual appearance. In addition, the degree of computational complexity of many of the proposed texture measures is very high. Before committing effort into selecting, developing and using texture techniques in an application, it is necessary to thoroughly understand its requirements and characteristics. Section 2 of this chapter presents an approach to texture classification using feature distributions. It describes two simple but powerful texture measures and a classification principle based on comparison of feature distributions, and demonstrates the performance of the approach in a case study. Section 3 describes a powerful unsupervised texture segmentation method utilizing texture discrimination techniques discussed in Section 2. The focus of Section 4 is more application specific and not directly linked to the previous sections. The use of texture-based approaches in industrial surface inspection problems is considered. 2. Texture Classification Using Feature Distributions In texture classification an unknown image sample is assigned to one of a priori known texture categories (Fig. 1). The features derived from the unknown sample are compared to the features of each category in the training set, and the sample is assigned to the class with the closest match. The performance of texture classification is largely dependent on the efficiency of texture features used. Chap. 2.1 [4] provides a description of various types of texture features and Chap. 2.2 [5] discusses model-based approaches in more detail.
714 M . Pietikainen, T. Ojala & 0. Silven
can be obtained by using distributions of simple texture measures, like absolute gray level differences, local binary patterns and center-symmetric auto-correlation. The performance is usually further improved with the use of two-dimensional distributions of joint pairs of complementary features [7]. In experiments involving various applications we have obtained very good results with the distribution-based classification approach. Among the studies we conducted are the determination of the composition of mixtures of two materials [24] (Section 2.3) and the average grain size of chrome concentrate [25] (Section 4.2.1), metal strip inspection [26], discrimination of melanoma (skin cancer) cell samples from naevus cell samples [27], and rotation-invariant texture classification [28]. A similar approach has also been successfully applied to accurate color discrimination [29]. 2.1.1. Gray level difference method
The method based on histograms of absolute differences between pairs of gray levels or of average gray levels has been successfully used for texture classification, for example in [13,30,7]. It should be noted that the difference histograms can also be derived from co-occurrence matrices which were described in Chap. 2.1 [4]. For any given displacement d = (dz,dy), where dx and dy are integers, let f’(z,y) = lf(z,y) - f(z + dx,y dy)l. Let P’ be the probability density function off‘. If the image has m gray levels, this has the form of an m-dimensional vector whose ith component is the probability that f‘(z,y) will have the value i. P’ can be easily computed by counting the number of times each value of f’(z, y) occurs. For a small d the difference histograms will peak near zero, while for a larger d they are more spread out. As an example consider the 4 x 4 image used in Chap. 2.1:
+
1100 1100 0022 0022
The difference histogram for this image for a displacement vector of d
=
(1,O)
is: DIFFX = [8, 2, 21. The rotation invariant feature DIFF4 used in Section 2.3 is computed by accumulating, in the same one-dimensional histogram, the absolute gray level differences in all four principal directions at the chosen displacement D. If D = 1, for example, the displacements d = ( O , l ) , (l,l),(1,O) and (1,-1) are considered.
4.2 Approaches to Texture-Based Classification . . . 715 The difference histogram approach is not invariant with respect to gray scale variance, which means that the textures to be analyzed should be gray scale corrected by, e.g. histogram equalization.
2.1.2. Local binary patterns Recently, Ojala et al. [7] introduced a Local Binary Pattern (LBP) texture operator, which is a two-level versions of the texture operator proposed by Wang and He [31]. The LBP histogram computed over a region is then used for classification. LBP provides us with knowledge about the spatial structure of the local image texture. However, LBP does not address the contrast of texture which is important in the discrimination of some textures. For this purpose, we can combine LBP with a simple contrast measure C. By considering joint occurrences of LBP and C we usually achieve better discrimination than just with LBP alone. The descriptions of LBP and C are shown in Fig. 2. For each local 3 x 3 neighborhood, if Pi
1. Tkeshold pixels Pi by the value of the center pixel: P,! =
< PO
8
2. Count the number n of resulting non-zero pixels: n = X
Pl
i=l 8
3. Calculate the local binary pattern LBP = c P i 2 i - 1 i=l
pixels
example
thresholded
weights
LBP = 1+8+32+128 = 169 C = (6+7+9+7)/4-(5+2+1+3)/4
= 4.5
Fig. 2. Computation of Local Binary Pattern (LBP) and contrast measure C.
LBP and LBP/C perform well also for small image regions (e.g. 16 x 16 pixels), which is very important in segmentation applications. A simple way to define a “multiresolution” LBP would be to choose the eight neighbors of the center pixel from the corresponding positions in different neighborhoods (3 x 3, 5 x 5, 7 x 7, etc.). By definition, LBP is invariant against any monotonic gray scale transformation.
716
M. Pietikainen, T. Ojala €9 0. Silven
The method is rotation variant which is undesirable in certain applications. A rotation invariant version of LBP is considered in [32,28].
2.2. Classification Using Feature Distributions
A log-likelihood-ratio pseudo-metric, the G statistic, is used for comparing feature distributions in the following experiments, but it could be replaced with some other related method, like histogram intersection [33] or by the statistical chi-square test. The value of the computed G statistic indicates the probability that the two sample distributions come from the same population: the higher the value, the lower the probability that the two samples are from the same population. The distribution of G asymptotically follows a chi-square distribution, but it has some theoretical advantages and it is computationally simpler. For a goodness-of-fit test the G statistic is: n G = 2 X s i l o g - ,si mi i=l where s and m are the sample and model distributions, n is the number of bins and si, mi are the respective sample and model probabilities at bin i. In the experiments presented in this section, a single model distribution for every class is not used. Every sample is in its turn classified using the other samples as models, hence the leave-one-out approach is applied. The model samples are ordered according to their probability of coming from the same population as the test sample. This probability is measured by a two-way test-of-independence:
where s, m are the two texture samples (test sample and model), n is the number of bins and fi is the frequency at bin i. For a detailed derivation of the formula, see Sokal and Rohlf [34]. After the model samples have been ordered, the test sample is classified using the Ic-nearest neighbor principle, i.e. the test sample is assigned to the class of the majority among its Ic nearest models. The feature distribution for each sample is obtained by scanning the texture image with the local texture operator. The distributions of local statistics are divided into histograms having a fixed number of bins. The histograms of features with continuous-valued output, like contrast C in LBP/C, are quantized by adding together feature distributions for every single model image in a total distribution which has been divided into N bins having an equal number of entries. Hence, the cut values of the bins of the histograms correspond to 100/N percentile of
4.2 Approaches to Texture-Based Classification
...
717
total distribution
I bin 0
bin 1
bin 2
bin 3
cutvalues
(b) Fig. 3. Quantization of the feature space, when 4 bins are requested. Single distributions are added together in a total distribution (a), which is divided into 4 equal portions, i.e. the cut value between bin 0 and bin 1 corresponds t o 25% percentile of the combined data (b).
the combined data (Fig. 3). Deriving the cut values from the total distribution and allocating every bin the same amount of the combined data guarantees that the highest resolution of the quantization is used where the number of entries is largest and vice versa. Output of discrete operators like LBP does not require any quantization; operator outputs are just accumulated into a histogram. The empty bins are set to one. To compare distributions of complementary feature pairs, like LBP/C, metric G is extended in a straightforward manner to scan through the two-dimensional histograms. If quantization of the feature space is required, it is done separately for both features using the same approach as with single features.
2.3. Case: Determining Composition of Mixtures of Materials Kjell proposed that the composition of grain mixtures can be determined by texture classification when different compositions are seen as different textures and the classification into discrete classes is considered as a measurement event [35]. The accuracy of the measurement is heavily dependent on the number of classes and the discriminative power of texture features. Kjell examined the performance of Laws’ texture energy measures [23] and ordinary feature vector based classification using images of eleven different mixtures of rice and barley as test material. He achieved promising results when using all nine of Laws’ 3 x 3 features at the same time. About 61 percent of the samples were classified into correct classes, and misclassified samples were close t o their own classes on the diagonal of the confusion matrix. A sample size of 128 x 128 pixels was used. Recently, the distribution-based classification approach was applied to the same problem [24]. In order to have comparable test material, eleven different mixtures of rice and barley grain were prepared. Four images of each different mixture were taken using a SONY 3 CCD DXC-755 color camera. Images were converted to 512 x 512 gray scale images with square pixels. Four test images of different mixtures are shown in Fig. 4.
718
M. Pietikainen, T. Ojala €9 0. Silven
Fig. 4. Rice 100% (a), Barley 100% (b), Rice 70%, barley 30% (c), Rice 30%, barley 70% (d).
We present results of distribution based classification for the gray scale difference feature DIFF4 (Section 2.1.1), with histograms quantized into 32 and 256 bins, respectively. For nearest-neighbor selection, a value of 3 was used for k(3-NN classification). The sample data was split into training and test sets using the leave-one-out method. The classifier was designed by choosing all but one sample for inclusion in the design set and the single sample in the set was then classified. This procedure was repeated for all samples and the classification error rate was determined as the percentage of misclassified samples out of the total number of samples. The effects of sample size and image preprocessing were also examined. The samples were obtained by dividing the original 512 x 512 images into non-overlapping subimages, resulting in 176, 704, 2816 and 11264 samples in total for sample sizes of 256 x 256, 128 x 128,64 x 64 and 32 x 32 pixels, respectively. Histogram equalization was performed prior to feature extraction t o remove the effects of unequal brightness and contrast. It was applied to the whole set of 512 x 512 test images instead of the separate samples. Table 1 shows the results. The numbers in the tables denote the percentages of misclassified samples. Misclassification rate does not reveal how close the misclassified samples are to their correct classes, but it gives enough information t o decide which features are suitable for this kind of application. More detailed information can be extracted from confusion matrices.
4 . 2 Approaches to Texture-Based Classification . . . 719 Table 1. Distribution based classification. DIFF4 sample size
32 bins D = l
128 x 128
38.21
256 bins
D=2
D = l
D = 2
38.35
43.61
42.61
256 x 256
28.41
30.11
30.68
34.66
EQ 32 x 32
43.47
63.22
7.28
14.58
EQ 64 x 64
22.83
37.82
0.11
0.32
EQ 128 x 128
13.07
21.16
0.00
0.00
EQ 256 x 256
11.93
22.73
0.00
0.00
The results are tabulated for two different quantizations (32 and 256 bins) and displacements (D = 1, 2). Histogram equalization improved significantly the performance. The reason for this is that DIFF4 is not invariant with respect t o gray scale variance. This means that the textures t o be analyzed should be gray scale corrected in order to have gray level differences of equal scale in all textures. For example, using a 128 x 128 pixel sample size and distributions with 256 bins, the error for DIFF4 was reduced from 43.61% to 0.00%. Even with 64 x 64 samples, the classification error was as small as 0.11%. This means that only 3 out of all 2816 samples were misclassified. All these misclassified samples were classified to the neighboring class of their correct class which means that the measurement error of the composition for these samples was 10%. The sample size 32 x 32 appeared to be too small for this kind of measurement purpose. Even if the total error rate was only 7.28% for D = 1, a lot of samples were classified far away from their correct classes. One reason for this might be that with a little sample size like 32 x 32 the composition of a sample is not necessarily correct. Table 2. Robustness tests. DIFF4 for sample size 64 x 64 image size (k)
32 bins
256 bins
D = l
D=2
D = l
D = 2
512 x 512(k = 1)
22.43
42.40
0.11
0.64
512 x 512(k = 3)
22.83
37.82
0.11
0.32
512 x 512(k = 5)
22.02
36.08
0.11
0.32
256 x 256(k = 1)
36.79
55.11
0.43
0.00
256 x 256(k = 3)
32.81
51.28
0.00
0.00
256 x 256(k = 5)
34.52
50.71
0.00
0.14
720
M. Pietikainen, T. Ojala €4 0. Silven
In order to test the robustness of the presented approach, experiments with three different values of k(k = 1, 3, 5) and with two image resolutions (original 512 x 512 images and 256 x 256 images obtained by bilinear interpolation from the original images) were performed. A sample size of 64 x 64 was used. Table 2 presents the results. It can be seen that distributions with 256 bins provide robust performance with respect to variations of Ic, displacement D, and image resolution, achieving error rates of 0.64% or less in all cases. 3. Texture Segmentation Using Feature Distributions
Segmentation of an image into differently textured regions is a difficult problem. Usually one does not know a priori what types of textures exist in an image, how many textures there are, and what regions have which textures [4]. In order to distinguish reliably between two textures, relatively large samples of them must be examined, i.e. relatively large blocks of the image. But a large block is unlikely to be entirely contained in a homogeneously textured region and it becomes difficult to correctly determine the boundaries between regions. The performance of texture segmentation is largely dependent on the performance of texture features used. The features should easily discriminate various types of textures. The window size used for computing textural features should be small enough to be useful for small image regions and to provide small error rates at region boundaries. Recent comparative studies performed by Ohanian and Dubes (151 and Ojala et al. [7] indicate that texture measures based on co-occurrence matrices, difference histograms and local binary patterns (LBP) perform very well for various types of textures. The performances of these features are good for small window sizes as well. Recently, an unsupervised texture segmentation algorithm utilizing the LBP/C texture measure (Sec. 2.1.2) and histogram comparison (Section 2.2) was developed. The method has performed very well in experiments. It is not sensitive t o the selection of parameter values, does not require any prior knowledge about the number of textures or regions in the image, and seems to provide significantly better results than existing unsupervised texture segmentation approaches. The method can be easily generalized, e.g. to utilize other texture features, multiscale information, color features, and combinations of multiple features. This section presents a n overview of the method and shows some experimental results. 3.1. Segmentation Method The segmentation method consists of three phases: hierarchical splitting, agglomerative merging and pixelwise classification. First, hierarchical splitting is used to divide the image into regions of roughly uniform texture. Then, an agglomerative merging procedure merges similar adjacent regions until a stopping criterion is met. At this point, we have obtained rough estimates of the different textured regions present in the image, and we complete the analysis by a pixelwise classification to
4.2 Approaches to Texture-Based Classafication . . .
721
Fig. 5. Texture mosaic #l; the main sequence of the segmentation algorithm.
improve the localization. The method does not require any prior knowledge about the number of textures or regions in the image, as many existing approaches do. Figure 5 illustrates the steps of the segmentation algorithm on a 512 x 512 mosaic containing five different Brodatz [36] textures. 3.1.1. Hierarchical splitting
A necessary prerequisite for the agglomerative merging to be successful is that the individual image regions be uniform in texture. For this purpose, we apply the hierarchical splitting algorithm, which recursively splits the original image into square blocks of varying size. The decision on whether a block is split into four subblocks is based on a uniformity test. Using Eq. (2.2) we measure the six pairwise G distances between the LBP/C histograms of the four subblocks. If we denote the largest of the six G values by G,,, and the smallest by Gmin,the block is found to be non-uniform and is thus split further into four subblocks, if a measure of relative dissimilarity within the region is greater than a threshold:
R=- Gmax > X Gmin
Regarding the proper choice of X , one should rather choose too small a value for X than too large a value. It is better to split too much than too little, for the following agglomerative merging procedure is able to correct errors, in cases where a uniform block of a single texture has been needlessly split, but error recovery is not possible if segments containing several textures are assumed to be uniform. To begin with, we divide the image into rectangular blocks of size S,,,. If we applied the uniformity test to arbitrarily large image segments, we could fail to detect small texture patches and end up treating regions containing several textures as uniform. The next step is to use the uniformity test. If a block does not satisfy the test, it is divided into four subblocks. This procedure is repeated recursively on each subblock until a predetermined minimum block size Sminis reached. It is necessary to set a minimum limit on the block size, for the block has to contain a sufficient number of pixels for the LBP/C histogram to be reliable.
722
M. Pietikainen, T. Ojala €4 0. Silven
Figure 5(b) illustrates the result of the hierarchical splitting algorithm with = 64 and Smin 16. As expected, the splitting goes deepest around the texture boundaries.
X
= 1.2, S,
3.1.2. Agglomerative merging Once the image has been split into blocks of roughly uniform texture, we apply an agglomerative merging procedure, which merges similar adjacent regions until a stopping criterion is satisfied. At a particular stage of the merging, we merge that pair of adjacent segments which has the smallest Merger Importance ( M I ) value. M I is defined as MI+pxG (3.2) where p is the number of pixels in the smaller of the two regions and G is the distance measure defined in Eq. (2.2). In other words, at each step the procedure chooses that merge, of all possible merges, introduces the smallest change in the segmented image. Once the pair of adjacent segments with the smallest M I value has been found, the regions are merged and the two respective LBP/C histograms are summed to be the histogram of the new image region. Before moving to the next merge we compute the G distances between the new region and all regions adjacent to it. Merging is allowed to proceed until the stopping rule (3.3) triggers. Merging is halted if M I R , the ratio of MI,,,, the Merger Importance for the current best merge, and MI,,,, the largest Merger Importance of all the preceding merges, exceeds a present threshold Y . Threshold Y determines the scale of texture differences in the segmentation result and therefore the choice of Y depends on the application. In theory, it is possible that the very first merges have a zero MI value (i.e. there are adjacent regions with identical LBP/C histograms), which would lead to a premature termination of the agglomerative merging phase. To prevent this the stopping rule is not evaluated for the first 10% of all possible merges. Figure 5(c) shows the result of the agglomerative merging phase after 174 merges. The M I R of the 175th merge is 9.5 and the merging procedure stops. For comparison, the highest M I R value up to that point had been 1.2. 3.1.3. Pixelwise classification If the hierarchical splitting and agglomerative merging phases have succeeded, we have obtained quite reliable estimates of the different textured regions present in the image. Treating the LBP/C histograms of the image segments as our texture models, we switch to a texture classification mode. If a n image pixel is on the boundary of at least two distinct textures (i.e. the pixel is 4-connected t o at least one pixel with a different label), we place a discrete disc with radius r on the pixel
4.2 Approaches to Texture-Based Classification . . . 723 and compute the LBP/C histogram over the disc. The reason for using a discrete disc instead of a square window is that the latter weighs the four principal directions unequally. We compute the G distances between the histogram of the disc and the models of the regions which are 4-connected to the pixel in question. We relabel the pixel, if the label of the nearest model is different from the current label of the pixel and there is at least one 4-connected adjacent pixel with the tentative new label. The latter condition improves the smoothness of texture boundaries and decreases the probability of small holes occurring inside the regions. If the pixel is relabeled, i.e. it is moved from an image segment to an adjacent segment, we update the corrsponding texture models accordingly; hence the texture models become more accurate during the process. Only those pixels at which the disc is entirely inside the image are examined; hence the final segmentation result will contain a border r pixels wide. In the next sweep over the image we only check the neighborhoods of those pixels which were relabeled in the previous sweep. The process of pixelwise classification continues until no pixels are relabeled or a maximum number of sweeps is reached. This is set to be two times Smin, based on the reasoning that the boundary estimate of the agglomerative merging phase can be a t most this far away from the “true” texture boundary. Setting an upper limit on the number of iterations ensures that the process will not wander around endlessly, if the disc is not able to capture enough information about the local texture to be stable. According t o our experiments the algorithm generally converges quickly with homogeneous textures, whereas with locally stochastic natural scenes the maximum number of sweeps may be used. Figure 5(d) demonstrates the final segmentation result after the pixelwise classification phase. A disc with a radius of 11 pixels was used and 16 sweeps were needed. The segmentation error is 1.7%.
3.2. Experiment a Results Next we present some quantitative results obtained with the method. The segmentation results for three texture mosaics and a natural scene are presented. The same set of parameter values was used for all texture mosaics to demonstrate the robustness of the approach: b = 8 (number of bins for contrast measure C in the texture transform),, , S , = 64 (largest allowed block size in the hierarchical splitting algorithm), Smin= 16 (smallest allowed block size in the hierarchical splitting algorithm), X = 1.2 (threshold determining when a block is divided into four subblocks in the hierarchical splitting algorithm), Y = 2.0 (threshold determining when the agglomerative merging algorithm is to be halted), and r = 11 (radius of the disc in the pixelwise classification algorithm). For each image, we provide the original image, the rough segmentation result after the agglomerative merging phase and the final segmentation result after the pixelwise classification phase. The segmentation results are superposed on the original image and the final segmentation result contains only the area processed by
724
M. Pietakainen, T. Ojala & 0. Silven
Fig. 7. Texture mosaic #3.
the disc. Also, in each case we collect values M I h & , and MIRhi. MI&top is the value of M I R when the agglomerative merging is stopped and MIRhi is the highest M I R of the preceding merges. The relationship between MI&top, MIRhi and the threshold Y reflects the reliability of the result of the agglomerative merging phase. Mosaic #2 is a 512 x 512 image with a background made by a GMRF process and four distinct regions; the square and the circle are painted surfaces with different surface roughnesses and the ellipse and the triangle are made by a fractal process [15]. As we can see from the values of MI&top (8.1) and M I & (1.2), the rough estimate (Fig. 6(b)) of the texture regions is obtained relatively easily. Figure 6(b) contains 4.6% and Fig. 6(c) only 1.9% misclassified pixels. Mosaic #3 (Fig. 7(a)), which is 384 x 384 pixels in size, is composed of textures taken from outdoor scenes [37]. Our method gives a very good segmentation result of 2.1%. Note that the pixelwise classification clearly improves the result of the agglomerative merging phase (7.8%). The difference between M I h t o p (2.8) and M I & (1.2) is still noticeable, but by far the smallest in the three cases, reflecting the inherent difficulty of this problem. We also applied the texture segmentation method t o natural scenes. The scenes were originally in RGB format [38], but we converted them to gray level intensity
726
M. Pietikainen, T. Ojala & 0. Silven
by straightforwardly computing the desired feature for suitably symmetrical discrete neighborhoods of any size, such as disks, or boxes of odd and even size. The remaining question is how to combine the multiple feature channels obtained with several features and/or scales. We can hardly expect to reliably estimate joint distributions for a large number of features. Also, multidimensional histograms with large numbers of bins are very computationally intensive and consume a lot of memory. An alternative is to use an approximation with marginal distributions and to employ each independent feature separately, as a 1-D histogram, to compute a similarity score such as G for each feature, and then integrate individual scores into an aggregate similarity score, as was done in recent texture and color classification experiments [32,29]. Combining it with a carefully chosen set of nonredundant complementary features, we expect to further improve the performance of the segmentation method. In a similar way, joint pairs of features, like LBP/C, can be combined with other single features or feature pairs. It would also be possible to use single features or joint features one by one, e.g. by first comparing the uniformity of regions with respect to texture and then with respect t o color. 4. Texture Analysis in Surface Inspection Textured materials may have defects that should be detected and identified as in crack inspection of concrete or stone slabs, or the quality characteristics of the surface should be measured as in granulometry. In many applications both objectives must be pursued simultaneously, as is regularly the case with wood, steel and textile inspection. Because these and most natural and manufactured surfaces are textured, one would expect this application characteristic to be reflected by the methodological solutions used in practical automatic visual inspection systems. However, only a few examples of successful explicit exploitation of texture techniques in industrial inspection exist , while most systems, including many wood inspection devices, attempt to cancel out or disregard the presence of texture, trying to transform the problems solvable with other detection and analysis methods, e.g. as done by Dinstein et al. [39]. This is understandable against the high costs of texture inspection, and the fact that often the defects of interest are not textured, but embedded in it like cracks. The inspection of textured surfaces is regularly treated more as a classification and less as a segmentation task, simply because the focus is on measuring the characteristics of regions and comparing them t o prior trained samples. Actual working texture based industrial inspection solutions are available mostly for homogeneous periodic textures, such as on wallpaper and fabric, where the patterns normally exhibit only minimal variation, making defect detection a two category classification problem. Natural textures are more or less random with large non-anomalous deviations, as anyone can testify by taking a look at a wood surface, resulting in the need to add features just to capture the range of normal variation, not to mention of the detection and identification of defects.
4.2 Approaches
to Texture-Based Classification . . .
727
Defect detection may require continuous adaptation or adjustment of features and methods based on the background characteristics, possibly resulting in a complex multi-category classification task already at the first step of inspection. Solutions providing adaptability have been proposed, among others, by Dewaele et al. [40]and Chetverikov [41].Proprietary adaptation schemes are regularly used in commercial inspection systems. In most industrial applications inspection systems must process 1&40 Mpixels/s per camera, thus requiring dedicated hardware for at least part of the system, so the calculation of each new texture feature can be a significant expense that should be avoided. Therefore, the system developers try to select a few powerful straightforwardly implementable features and tune them precisely for the application problem. A prototypical solution depicted in Fig. 9 uses a bank of filters or texture transforms characterising the texture and also defect primitives, each transform producing a feature image that is used in either pixel-by-pixel or window based classification of the original image data.
classifier
transform n Fig. 9. Typical methodological architecture of texture based inspection systems.
The dimensions of the filters used in applications have ranged up to 63 x 63 for pixel classification [42], while most implementors rely on 3 x 3 Laws’ masks [23] or other convolution filters in classification of partially overlapping or non-overlapping windows, e.g. based on means and variances of texture measures. The developments in feature distribution based classification of texture should have a major simplifying impact on future systems, as the techniques have recently matured to the brink of real applicability. The improved efficiency in using the texture measures cuts the number of features needed in an application, enables classifying small regions, and potentially reduces training effort by relieving the dimensionality problem of classification. Nevertheless, many applications will always demand dedicated techniques for the detection of their vital defects. Regardless of the feature analysis methodology, the effort needed for training an inspection system to detect and identify defects from sound background remains a key cost driver for system deployment and use. As texture inspection methods are notoriously fragile with respect to resolution, a minor change between the distance of the camera to the target may result in a retraining need. This need may also arise from normal variations between product batches. Typically, training done in
728
M . Pietikainen, T . Ojala €4 0. Silven
6;
+background transition defect
I I I 1 I
regions with defects selected and used for training with remaining good background L___d I I I
I
(b)
( 4
(c)
Fig. 10. Alternatives for training defect detection methods. (a) Defect; (b) Pixel based training; and (c) Region based training.
laboratory turns out to be useless after an inspection system has been installed on-line. Furthermore, on-line training performed by production personnel tends to concentrate on teaching in “near-misse~~~ and ‘hear-hits” rather than representative defects and background, so non-parametric classifiers should be favored. Figure 10 shows two basic approaches t o training defect detection: pixel-based training assumes that a human operator is able t o correctly pinpoint pixels belonging to defects in the image and pixels that are from sound background. In region-based training the operator roughly labels regions that contain a defect or defects, but may also have a substantial portion of sound background, while the non-labeled regions are assumed sound. We strongly advocate the latter approach, because it is less laborious, and because it is difficult for a human to precisely determine the boundaries of defects. It should be noticed that pixel based training disregards the transition region t o defect, the characteristics of which may have high importance. For instance, the grain around a suspected defect in a lumber board helps in discriminating frequent stray bark particles from minor knots.
defect
-
/ b 1 0 b - t ~-~--c texture deviation
1 line-type
non-texturedblob known shape 1 meandering J
Fig. 11. Categories of defects on textured surfaces.
The detection and recognition of large defects from textured surfaces is relatively straightforward as changes of texture characteristics, but many defects are small local imperfections rather than ‘real’ texture defects such as knots with exactly the color of defectless background in wood. The detection of minor flaws from the background requires application of specific knowledge. In addition, segmentation may be required for measuring the defects and determining their characteristics. Figure 11 presents a categorization of defects on textured surfaces. In the following we briefly review means for detecting of defects from texture and then proceed into application examples on quality measurements and defect detection.
4.2 Approaches to Texture-Based Classification . . . 729 4.1. Detection of Defects
from Texture
The detection and segmentation of “sufficiently” large defects in texture images can be performed reliably with pure texture measures both for periodic and random textures using proposed texture measures [4]. But because texture is a statistical concept, texture measures are good only for regions that have the minimum size that allows the definition of features [41]. The relative sizes of these minimum patches for various features and textures can be roughly concluded from the boundaries in texture segmentation results given in literature: The lower the error, the smaller the defects that can be detected using that family of features. With a small patch size even most of the local texture imperfection can be detected, reducing, if not eliminating the need for application specific detection solutions for purposes such as the locating non-textured blobs from the background. In practice, choosing the patch size for an application depends on the desired balance between false alarm and error escape rates. A smaller patch size increases the number of misdetections from normal variations, while using larger patches may contribute to detection failures. Normally all detections are subjected to further scrutiny, so in the end the patch size is defined by the general purpose computational resources available for detailed analysis. The minimum patch size is smallest for periodic textures such as in textiles that must be inspected for both large and small weaving flaws that are generally multiples of the mesh size. In textile inspection, Ade et al. [43] found that using an imaging resolution of three pixels per mesh width and averaged outputs of 3 x 3 to 5 x 5 pixel filters, derived via Karhunen-Loeve expansion of the spatial covariance matrix, the diameter of the minimum patch is around 15 pixels. The smallest detected defects in [43] appear to be around 10% of the patch area. Neubauer [44], using approximately the same imaging resolution, exploited three 5 x 5 FIR-filters and performed classification using histograms of features calculated from 10 x 10 pixel regions, achieving 1.6% false alarm and 9.3% escape rates. The tests with LBP/C method for quasi-periodic textures (Fig. 5) performed by Ojala [9] with 16 x 16 pixel patch size and distribution classification detected 100% of the cases where more than about 25% of the block area did not belong to the same category. With natural textures (Fig. 7), the average detection threshold of other categories increased to around 35%. It is evident that the inspection accuracy may significantly benefit from dedicated methods for detecting small defects. Crack or scratch detection is undoubtedly the most common purpose for which specific techniques have been included in visual surface inspection systems. 4.1.1. Crack detection The relative difficulty of detecting cracks depends on whether their shape and typical orientation is known a priori, whether they start from the edge of the object, and on whether the texture is periodic or random. A key problem is the
730 M . Pietikainen, T. Ojala & 0. Silven
typically very small transverse dimensions and poor contrast of cracks: the human visual system may easily detect them, but they may actually consist of "chains" of nonadjacent single pixels in the image. In the worst case, the surface is randomly textured and the cracks may meander freely, starting and ending anywhere, leaving few application specific constraints that could be exploited. The detection of cracks having a known shape is often the straightforward application of Hough-transform or RANSAC to edge detected or high-pass filtered versions of the image. For instance, Gerhardt et al. [45] used Hough transform for detecting wrinkles in sandpaper in this manner. With meandering cracks, the problem of discriminating them from other high frequency components in the image is very difficult. If the texture is periodic or quasi-periodic, texture measures characterizing the background may be powerful enough for detecting their presence. An alternative, a rather unusual simple method for defect detection from periodic patterns, based on a model of human preattentive visual detection of pattern anomalies, has been proposed by Brecher [46]. Detection is performed by comparing local and global first order density statistics of contrast or edge orientation. Song et al. [47] have presented a trainable technique based on a pseudo-Wigner model for detecting cracks from periodic and random textures. The motivation behind selecting the technique is the better cojoint spatial and spatial frequency resolution offered by the Wigner distribution when compared to Gabor, differenceof-Gaussians and spectrogram approaches: this is an important factor due to the localness of cracks. The technique is trained with defectless images. During inspection it produces probabilistic distance images that are then postprocessed using rough assumptions on the shape of the cracks in the application. 4.2. Application Cases
Before committing effort into selecting, developing and using texture techniques in an application, it is necessary to thoroughly understand its requirements and characteristics. The developer should consider at least the following questions: 0
0
Is the surface periodically, randomly, or only weakly textured? Strongly periodic textures can be efficiently characterized using linear filtering techniques that are also relatively cheap to implement with off-the-shelf hardware. For random textures LBP/C and gray level difference features with distribution based classification are computationally attractive and rank among the very best. With weakly textured surfaces, plain gray-level and color distribution based classification may work very well [48]. Are any of the properties of the defects known? In particular, are there any defects that cannot be discriminated from the background by their color or intensity? Due to their cost texture methods should usually be the last ones to be thrown
4.2 Approaches
t o Texture-Based Classification
...
731
in. They are generally much better in characterizing surfaces than in detecting anomalies. Thus, whenever feasible, application specific non-texture method solutions may be justified for detection, while texture measures may be powerful in eliminating false alarms and recognizing the defects. The following application cases, particle size determination, carpet wear assessment and leather inspection are demonstrations of analysis of random and quasiperiodic textures, and defect detection from random textures, respectively. 4.2.1. Case 1: determination of particle size distribution On-line measurement of the size distribution of granular materials, e.g. coke, minerals, pellets, etc., is a common problem in the process industry, where knowledge of the mean particle size and shape of the distribution are used for control. The traditional particle size distribution measurement instruments, such as sieves, are suitable for off-line use in the laboratory. The off-the-shelf machine vision systems developed for this purpose are based on blob analysis and require mechanical set-ups for separating the particles from each other. Separation is often necessary, because smaller particles may fall t o the spaces between the bigger ones and are no longer visible, so the particle size distribution of the surface may not be representative. This happens if the relative size range of the particle diameter is around 1.5 or larger. Texture analysis has clear potential in granulometric applications, as has been demonstrated by the previous example on mixture determination in Section 2.3. In principle, a measurement instrument could be trained with pictures of typical distributions, but the preparation of samples with known distributions is a laborious task, making this approach unattractive. Furthermore, the training problem is amplified by the need for frequent recalibrations, because the appearance of the material may change with time. The desired approach is to train the instrument by sieved fractions of the material, or to eliminate the need for training, as is the case with particle separation based measurements. Rautio et al. [25] performed distribution measurement experiments using chrome concentrate that was sieved into 15 fractions, 37 t o 500 pm, for use as training samples, and mixtures of three adjacent fractions were prepared for use as test samples. Various texture features, and distribution based and ordinary statistical classifiers were used in analysis. Figure 12 shows examples of the training material and mixtures, imaged at 7 x 7 pm resolution. The relative diameter range of particles in each mixture was 1.7 which results in only a minor “autosieving” phenomenon. Gray level differences were found to be the best performing features with all classification schemes. Using the G metric and kNN classifier (Ic = 3), the error of the leave-one-out test for training samples was 6%, when the gray level difference histograms with displacement 2 were used. The classification of mixture samples was
4.2 Approaches to Texture-Based Classification . . . 733 manufacturing shoes, belts, furniture and other leather goods, hides were selected on the basis of their characteristics and were cut into pieces of various shapes using moulds in a manner that the pieces have the desired quality, taking into account acceptable minor defects. The defects can be categorized as area faults that are local variations of gray-level or texture, line defects that are often scars or folds of skin, and point faults that are groups of spots, whose gray-levels differ from the background. The dimensions of the smallest defects that should be detected are around 2 mm. A methodology for inspecting leather hides has been investigated by Wamback and his co-workers [49]who found that gray-level distributions for hides are symmetric even for the areas for defects, making plain histogram based detection schemes insufficient. They make a simplifying assumption that the gray values in the image are Gaussian distributed, and check whether the pixels in a 5 x 5 neighborhood are from the distribution determined for the good part of the hide using mean, variance and edginess tests. Because parts of the faults have the same characteristics as the defectless regions, the most deviating parts of the flaws are located first using stricter confidence intervals and requiring a certain number of detections in the 5 x 5 neighborhood to avoid overdetection. The reported difficulties with the methodology were mostly with very small spot and weak line faults. 5. Conclusion This chapter has presented some recent approaches t o texture analysis using a classification principle based on comparison of distributions of feature values. The choice of proper texture measures for classification or segmentation is extremely important. Very good performance has been obtained by using distributions of simple texture features, like gray level difference histograms or local binary patterns, in various texture classification and segmentation problems. The results suggest that the presented approach should be considered for many practical applications of texture analysis. Despite the progress in texture analysis methodology, the application of texture analysis to industrial problems is usually not easy. A major problem is that textures in the real world are often not uniform, due to changes in orientation, scale or other visual appearance. In addition, the degree of computational complexity of many of the proposed texture measures is very high. Before committing effort into selecting, developing and using texture techniques in a n application, it is necessary to thoroughly understand its requirements and characteristics.
References [l] L. Van Gool, P. Dewaele and A. Oosterlinck, Texture analysis anno 1983, Comput. Vision Graph. Image Process. 29 (1985) 336-357. [2] R. M. Haralick and L. G . Shapiro, Computer and Robot Vision, Vol. 1 (AddisonWesley, 1992).
734
M. Pietikainen, T. Ojala
6 0. Silven
[3] T. R. Reed and J. M. H. Du Buf, A review of recent texture segmentation and feature extraction techniques, CVGIP. Image Understanding 57 (1993) 359-372. [4] M. Tuceryan and A. K. Jain, Texture analysis, in Handbook of Pattern Recognition and Computer Vision, 2nd edn., C. H. Chen, L. F. Pau and P. S. P. Wang (eds.) (World Scientific Publishing Co., Singapore, 1997). [5] R. Chellappa, R. L. Kashyap and B. S. Manjunath, Model-based texture segmentation and classification, in Handbook of Pattern Recognition and Computer Vision, 2nd edn., C. H. Chen, L. F. Pau and P. S. P. Wang (eds.) (World Scientific Publishing Co., Singapore, 1997). [6] D. Harwood, T. Ojala, M. Pietikilnen, S. Kelman and L. S. Davis, Texture classification by center-symmetric auto-correlation, using Kullback discrimination of distributions, Pattern Recogn. Lett. 16 (1995) 1-10. [7] T. Ojala, M. Pietikainen and D. Harwood, A comparative study of texture measures with classification based on feature distributions, Pattern Recogn. 29 (1996) 51-59. [8] R. M. Haralick and L. G. Shapiro, Image segmentation techniques, Comput. Vision Graph. Image Process. 29 (1985) 100-132. (91 T. Ojala and M. Pietikilnen, Unsupervised texture segmentation using feature distributions, Report CAR-TR-837, Center for Automation Research, University of Maryland, 1996. [lo] T. S. Newman and A. K. Jain, A survey of automated visual inspection, Comput. Vision Image Understanding 61 (1995) 231-262. [ll] M. Pietikilnen and T. Ojala, Texture analysis in industrial applications, in Image Technology - Advances in Image Processing, Multimedia and Machine Vision, J. L. C. Sanz (ed.) (Springer-Verlag, Berlin, 1996) 337-359. (121 K. Y. Song, M. Petrou and J. Kittler, Texture defect detection: a review, SPIE Vol. 1708 Applications of Artificial Intelligence X : Machine Vision and Robotics, 1992, 99-106. [13] J. Weszka, C. Dyer and A. Rosenfeld, A comparative study of texture measures for terrain classification, IEEE Trans. Syst. Man Cybern. 6 (1976) 269-285. (141 J . M. H. Du Buf, M. Kardan and M. Spann, Texture feature performance for image segmentation, Pattern Recogn. 23 (1990) 291-309. [15] P. P. Ohanian and R. C. Dubes, Performance evaluation for four classes of textural features, Pattern Recogn. 25 (1992) 819-833. [16] W. Siedlecki and J. Sklansky, On automatic feature selection, Int. J . Pattern Recogn. Artif. Intell. 2 (1988) 197-220. [17] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach (PrenticeHall, London, 1982). [18] K. Fukunaga, Statistical pattern recognition, in Handbook of Pattern Recognition and Computer Vision, 2nd edn., C. H. Chen, L. F. Pau and P. S. P. Wang (eds.) (World Scientific Publishing Co., Singapore, 1997). [19] Y.-H. Pao, Neural net computing for pattern recognition, in Handbook of Pattern Recognition and Computer Vision, 2nd edn., C. H. Chen, L. F. Pau and P. S. P. Wang (eds.) (World Scientific Publishing Co., Singapore, 1997). [20] A. L. Vickers and J. W. Modestino, A maximum likelihood approach to texture classification, IEEE Trans. Pattern Anal. Mach. Intell. 4 (1982) 61-68. [21] M. Unser, Sum and difference histograms for texture classification, IEEE Truns. Pattern Anal. Mach. Intell. 8 (1986) 118-125. [22] D. Harwood, M. Subbarao and L. S. Davis, Texture classification by local rank correlation, Comput. Vision Graph. Image Process. 32 (1985) 404-411.
4.2 Approaches to Texture-Based Classification . . . 735 [23] K. I. Laws, Textured image segmentation, Report 940, Image Processing Institute, Univ. of Southern California, 1980. [24] T. Ojala, M. Pietikiiinen and J. Nisula, Determining composition of grain mixtures by texture classification based on feature distributions, Int. J. Pattern Recogn. Artif. Intell. 10 (1996) 73-82. [25] H. Rautio, 0. Silven and T. Ojala, Grain size measurement using distribution classification, submitted to loth Scandinavian Conf. Image Analysis. [26] M. Pietikiiinen, T. Ojala, J. Nisula and J. Heikkinen, Experiments with two industrial problems using texture classification based on feature distributions, SPZE Vol. 2354 Intelligent Robots and Computer Vision XIII, 1994, 197-204. (271 J. Kontinen, J. Roning and R. M. MacKie, Texture features in classification of melanocytic samples, Computer Engineering Laboratory, University of Oulu, 1996. [28] M. Pietikiiinen, Z. Xu and T. Ojala, Rotation-invariant texture classification using feature distributions, submitted to 10th Scandinavian Conf. Image Analysis. [29] M. Pietikgnen, S. Nieminen, E. Marszalec and T. Ojala, Accurate color discrimination with classification based on feature distributions, in Proc. 13th Int. Conf. Pattern Recognition, Vol. 3, Vienna, Austria, 1996, 833-838. [30] L. Siew, R. Hodgson and E. Wood, Texture measures for carpet wear assessment, IEEE Raw. Pattern Anal. Mach. Intell. 10 (1988) 92-105. [31] L. Wang and D. C. He, Texture classification using texture spectrum, Pattern Recogn. 23 (1990) 905-910. [32] T. Ojala, Multichannel approach to texture description with feature distributions, Report CAR-TR-846, Center for Automation Research, University of Maryland, 1996. [33] M. Swain and D. Ballard, Color indexing, Int. J . Comput. Vision 7 (1991) 11-32. [34] R. R. Sokal and F. J. Rohlf, Introduction to Biostatistics (W. H. Freeman and Co, New York, 1987). [35] B. Kjell, Determining composition of grain mixtures using texture energy operators, SPZE Vol. 1825 Intelligent Robots and Computer Vision X I , 1992, 395-400. [36] P. Brodatz, Textures: A Photographic Album for Artists and Designers (Dover Publications, New York, 1966). [37] A. K. Jain and K. Karu, Learning texture discrimination masks, IEEE Bans. Pattern Anal. Mach. Intell. 18 (1996) 195-205. [38] D. K. Panjwani and G. Healey, Markov random field models for unsupervised segmentation of textured color images, IEEE Bans. Pattern Anal. Mach. Intell. 17 (1995) 93%954. [39] I. Dinstein, A. Fong, L. Ni and K. Wong, Fast discrimination between homogeneous and textured regions, in Proc. 7th Int. Conf. Pattern Recogn., Montreal, Canada, 1984, 361-363. [40] P. Dewaele, L. Van Gool, P. Wambacq and A. Oosterlinck, Texture inspection with self-adaptive convolution filters, in Proc. 9th Int. Conf. Pattern Recognition, Rome, Italy, 1988, 56-60. [41] D. Chetverikov, Texture imperfections, Pattern Recogn. Lett. 6 (1987) 45-50. [42] B. K. Ersball and K. Conradsen, Automated grading of wood slabs: The development of a prototype system, Industrial Metrology 2 (1992) 317-342. [43] F. Ade, N. Lins and M. Unser, Comparison of various filter sets for defect detection in textiles, in Proc. 7th Int. Conf. Pattern Recognition, Montreal, Canada, 1984, 428-431. [44] C. Neubauer, Segmentation of defects in textile fabric, in Proc. 11th Int. Conf. Pattern Recognition, Vol. 1, The Hague, The Netherlands, 1992, 688-691.
736 M. Pietikainen, T. Ojala & 0. Silven [45] L. A. Gerhardt, R. P. Kraft, P. D. Hill and S. Neti, Automated inspection of sandpaper products and processing using image processing, SPZE Vol. 1197 Automated Inspection and High-speed Vision Architectures IZI, 1989, 191-201. [46] V. Brecher, New techniques for patterned wafer inspection based on a model of human preattentive vision, SPIE Vol. 1708 Applications of Artificial Intelligence X , 1992, 452-459. (471 K. Y. Song, M. Petrou and J. Kittler, Texture crack detection, Mach. Vision Appl. 8 (1995) 63-76. [48] 0 .Silv6n and H. Kauppinen, Recent developments in wood inspection, Znt. J. Pattern Recogn. Artif. Zntell. 10 (1996) 83-95. [49] P. Wamback, M. Mahy, G. Noppen and A. Oosterlinck, Visual inspection in the leather industry, in Proc. ZAPR Workshop on Computer Vision, Tokyo, Japan, 1988, 153-156.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 737-764 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 4.3 I CONTEXT RELATED ISSUES IN IMAGE UNDERSTANDING
L. F. PAU Ericsson, PO Box 1505, Sl.2585 ALVSJO Sweden This chapter gives a formal model for scene understanding, as well as for context information; it helps in adapting image understanding procedures and software to varying contexts, when some formal assumptions are satisfied. We have defined and formalized context separation and context adaptation, which are essential for many applications, including to achieve the robustness of the understanding results in changing sensing environments. This model uses constraint logic programming and specialized models for the various interactions between the objects in the scene and the context. A comparison is made with, and examples are given of, the context models in more classical frameworks such as multilevel understanding structures, object based design in scenes, knowledge based approach, and perceptual context separation. Keywords: Image understanding, computer vision, context models, scene models, constrained logic programming, object oriented design, prolog, object recognition, noise filtering.
1. Introduction 1.1. Contezt Definitions Biophysics as well as perceptual studies, and also image understanding/ computer vision research [7,46]continuously stumble over the issue of context adaptation and separation: “Given a well defined image understanding task pertaining to objects or dynamic events in a scene, how do we render the formalization and implementation of this task independent from the scene and its varying parameters?” Context adaptation means the ability t o redesign an image understanding task and software by removing the context dependent information about one known context, and updating it with similar information for another known context. Context separation means the ability t o design an image understanding task not knowing the context and calibration information (e.g. when mapping 2-D information into 3-D information [27,36]). If, exceptionally, only a finite number of contexts are assumed possible, and if context separation applies, then context recognition is the task of identifying in the specific scene the applicable context selected from that finite list. 737
738
L. F. Pau
If, furthermore, context adaptation applies, then image understanding redesign can be carried out for the recognized context. In general, however, the number of possible contexts is infinite, even if scene, sensor, object and geometric calibration models are applied. The fundamental reasons for this are phenomena, geometric projections, or processes which violate the underlying required context separation: 0
0
0
Object-object interactions with direct effects on the context, e.g. geometric occlusion. Object-context and context-object interactions, each obeying causality relations, e.g. shadows from objects onto the environment or, vice versa, from the environment onto the objects in the scene; this becomes even more complex when non visible phenomena are taken into account, such as induced irradiation. Context unstationarity, due to random or slowly changing events, e.g. weather or failure in the lighting systems. Four basic approaches to context adaptation have been taken so far:
0
0 0
multilevel image understanding structures object oriented design in scenes knowledge based approach perceptual context separation
as sometimes exemplified in defect recognition in machine vision, obstacle avoidance modeling in robot navigation [17], aerial imagery [15,18], and a diversity of other areas.
1.2. Multilevel Image Understanding Structures The ability to represent information extracted from image data in a multilevel knowledge structure facilitates the hierarchical analysis needed for the image understanding (object detection, location, and completion of the understanding task at hand). In the now classical approaches, intermediate-level image processing operators (typically region and shape related) invoke lower-level operators (typically registration, feature extractors and measurement), which are then passed to higherlevel operators to derive complex relations between objects and their task related meanings.
1.3. Object Oriented Design in Scenes More recently, in relation to implementations of the previous multilevel understanding structures, object-oriented design has attempted to group similar low-level features or middle-level elements (such as regions or neighborhoods), while also separating those which are context related, into classes corresponding immediately to a hierarchical representation. In addition, the programming environments selected for the software solutions provide polymorphism, inheritance and encapsulation. This has been easing the updating of the instances and operators (methods) when
4.3 Context Related Issues an Image Understanding 739 changing contexts and tasks or objects. Method and object inheritance [13,16] offer a code-sharing alternative to supplying special-purpose operators for handling specific classes of objects.
1.4. Knowledge Based Approach Work has also taken place to break the hierarchical understanding structure by having a fully-fledged knowledge based system reason about all allowable combinations of low-level, intermediate level, and higher-level concepts or objects, but by introducing different depths into the selection and search according to the nature and ontology of these concepts or objects (e.g. [9]).Fundamentally, the image understanding task has become a goal, and backward inference is carried out to search for evidence along each possible explanation path. There was in this type of work an implicit hope, now largely lost, that the knowledge base itself could be segmented into context related and context independent pieces of knowledge; for example, it was hoped that generic rules would apply to, e.g. illumination, object centered geometrical transformations, clusters of physically related objects, etc.
1.5. Perceptual Context Separation Perception and psychophysics research sometimes suggest that image understanding tasks can almost always be carried out by the human, with the exception of illusions, thanks to the filtering out of perceptual cues about the task irrespective of the context, thus carrying out at once both context separation and adaptation. For example, car driving in day and night time rely on a perceptual image flow and object distance cues which are analyzed equally well in both contexts. However, no one yet knows how to implement and generalize these perceptual cues or groupings (also called collated features), apart from some simple scenes or processes. Some considerations have gone into using neural or Hopfield networks to coalesce and discriminate perceptual groupings such as edges and gaps between lines, line intersections, parallels, U’s, rectangles, etc. The hope here was to be able to add or remove them according to the geometry of the problem and of the context in general. In the neural networks, after training of the understanding task, weights are trained for the links in a global competition between collated features. 1.6. Review In the best case, image understanding work has focussed on the representation and control issues [24,30,35,42], such as those related to semantic network representations and their execution on distributed architectures; some work has focussed on the opposite of context modeling, that is, designing universal applications development environments able to cope with all kinds of specifics. As a result, context modeling has been largely ignored. When the context was not or was insufficiently known, the hypotheses were simply ranked by experimental tolerances or assigned likelihoods.
740
L. F. Pau
Experience proves that none of the first three approaches above, or combinations hereof, can deliver context adaptation and separation, except for very simple problems and scenes. The multilevel image understanding structures fundamentally cannot allow for the lower-level feature parametrization and detection in changing contexts, even if mixed forward-backward reasoning is applied between levels. And the more cluttered the environment, the more numerous the interactions between levels. The object oriented design in scenes is useful in highly structured environments, and especially for those understanding tasks which concern themselves with just a few objects with few interactions with the context (e.g. shadows, reflections, changes in reflection of). But that design is even more hierarchical, and thus more rigid, than the previous approach, and suffers from all the drawbacks hereof as well. The knowledge based approach suffers from the well known knowledge elicitation and accumulation problems; it is probably a never-ending process to acquire heuristic information or models to cover all possible separate context relations and context related processes. Perceptual context separation in humans and animals seems very powerful, especially as it achieves robustness versus perceptual deficiencies and anomalies. It is certainly a research goal of high importance t o be able t o formalize it, yet little is yet understood of the implementation sort for complex tasks. In [51] is analyzed contextual reasoning from text, which has parallels with the reasoning about scenes.
1.7. Plan In this chapter, we will formalize the context adaptation and separation problems, in both theoretical and practical ways, which have been shown to help out significantly, although not yet with resolution of the full range of context related issues. The formal models for the scene and context, and especially for their interactions, are given in Sec. 2. Sections 3, 4 and 5 illustrate those models, stressing especially the context modeling, via one example from car recognition in traffic images, by taking successively the multilevel understanding approach, the object oriented design of the same, and finally by addressing some perceptual context cues. Conclusions are given in Sec. 6 , while two appendices give introductory definitions or explanations about object oriented design and constraint logic programming.
2. Formal Image Understanding and Context Description 2.1. Approach The basic approach proposed here is to: (i) provide a formal description of the scene images via logic (ii) assume massive parallelism in both the spatial description as well as in the processing/understanding
4 . 3 Context Related Issues an Image Understanding 741 (iii) describe each context as a set of logic predicates and constraints propagated through the formal description (i) (iv) model the basic object +) context interactions. The need for the formal description is due to the context separation requirement; the need for the massive parallelism is due t o the implementation requirement; while the need for the context model description and interaction modeling, is due to the context adaptation requirement. It should be highlighted right away that image context simulation from physical processes has made significant progress, as evidenced by the flurry of image synthesis applications, and that they all contribute to the content of context modeling via call in-call out facilities to a battery of physical or other behavioral models. In the following the notation in logic shall be the one from the Prolog language [Ill,although the image understanding task solution implementation may very well be done later on in other languages. The notation a + b is a predicate saying that a is true iff b is true; the notation A t B is a rewriting rule saying that the list B is syntactically rewritten as A. The arity of a predicate is the number of arguments it has. One fundamental remark is also that, although difficult t o achieve, the goal of these scene and context models is t o help in image understanding tasks even in unstructured environments; this leads to the use of rather general context information data structures, i.e. the causal graphs and influencing domains (as defined below). This of course would be impossible unless the physical and causal image formation processes are not taken into account, and therefore we have t o assume known the range of such processes existing in a given scene. The other assumption is about the fact that the image understanding task relates to objects which are significant in terms of their overall presence in the scene, as e.g. measured by the total solid angle of these objects from the point of view of the sensor viewing the scene. 2.2. Scene Model
Assuming in general a four-dimensional space (z, y, z , t ) each sensed pixel gray value/color code “pixel (z, y, z , t)” is true iff its value is true, which in logic corresponds to the fact/statement : pixel(z, y, z , t ) . To each location (z, y, z , t ) is attached a causal graph G of all other locations having an influence on its pixel code value/color. This dependency is explicitly shown by increasing the arity of the “pixel” predicate:
The graph G is built from the causal influences mapped out pairwise on an influencing domain D (2, y, z , t , p ) : influenced (z, y, z , t , x’,y’, z’, t’) .
742
L. F. Pau
and this predicate is true iff (d, y’, z’, t‘) belongs to D ( s , y , z , t , p ) , where p is a causal process type. The influence “influenced” can in general not be related to a single process p , as the paths in the graph G ( z , y, z , t ) leading to the location (2,y, z , t ) may travel through a sequence of pixels each influenced by the previous pixel, but in different ways. 2.3. Causal Processes p
The range of causal processes { p } is not bound, as they may be physical, relational, geometric, qualitative, behavioral [50], or model-based. It is here assumed that the same range of causal processes apply t o the context related information. A “default” set of causal processes t o be considered are: { p } = {lighting, sensor, optics, shadows, orthogonal-projection}
.
In [48] is given an example of such process models in a simple case of context separation, irrespective of the image understanding task relating to objects in the scene. 2.4. Image Understanding Task
The image understanding task [33,35]is then a goal I to be satisfied in the scene in view of a finite number n of logical conditions applying to sets of pixels in the scene: I + cond-1 (pixel (.)), cond-2 (pixel (.)), . . . , cond-n (pixel (.)) or equivalently via the composite condition (applying for example to a composite region [29,33]): I + cond (pixel (.)) .
The n logical conditions cond-1, . . ., cond-n are here treated as constraints in a constraint logic programming framework (see Appendix B). The understanding process itself is then the search process S (set-of (pixel), setof (G), set-of ( D ) , set-of ( p ) ) needed to establish the previous goal I as true or false. The predicate “set-of” is self-explanatory. For reasons of clarity we assume here that S is the sequential ordered list of all nodes traversed in the total image, although of course a fundamental assumption made here is that a massively parallel architecture is used and reflected by a propagation scheme in this architecture. In an earlier work [21], the satisfaction of the conditions cond (pixel (.)) was defined as a truth maintenance problem [31,32],in view of sensor fusion and of the disambiguation of scene contexts in a threcdimensional fusion task.
2.5. Context Model The context is another massively parallel field “context-pixel (2, y, z , t)”with the corresponding causal graphs “Context-G (x,y, z , t)” and influencing domains “Context-D (z,y, z , t , p ) ” .
4 . 3 Context Related Issues in Image Understanding 743 We can then formalize the basic assumptions and definitions: (i) there is context separation iff the following implication holds true:
{I=> cond (pixel (. , . ,. , Context-G)), for all Context-G} which means that all constraints kond-i (pixel (.))” are independent of all “Context-G’ for the range of processes { p } . (ii) context adaptation, assuming context separation, can be carried out by the following rewriting process: S(set-of (pixel), set-of (Context-G), set-of (D), set-of ( p ) ) t S(set-of (pixel’), set-of (Context-G’), set-of (D’), set-of ( p ) ) where the primed symbols pertain to the same problem/goal but in an old context. 2.6. Interaction Models: Object- Object
Defining an object is done easily by defining the LLinfluenced)) predicate for influencing domains D (. , . ,., object-name) covering the spatial and time extent of this object. Context objects are defined equivalently. The influence of an object on a context object, or vice versa, is a predicate and rewriting rule, which in the most general form is in two parts: new-D (. , . , . , new-object-name)
+ intersect (D (. , . , ., object-name), Context-D (. , . , . , context-object-name)).
new-object-name
t (object-name
context-object-name)
where the “intersect” predicate says whether the two sets indeed intersect. The last rewriting rule is a possible, but not compulsory, object relabeling. Example: Intersections of two objects
We find the intersections between objects A and B ; these intersections divide the boundary of each object into contour segments. The contour segments of each object are then assigned to one of three disjoint sets, one containing segments that lie outside the other object, one containing segments that lie inside the other object, and one containing segments shared by the two objects. The relations between various collated features are represented in the context-graph G, which is labeled as a causal graph, in such a way that collated features which support each other perceptually are connected via positively weighted links, while mutually conflicting collations are linked via negatively weighted links. The following cases exist, each modeled by specific predicates, rewriting rules, and attribute changes eventually described by an attributed grammar [22]; the
744 L. F. Pau
definitions below apply to any pair of objects, but we are especially interested by the case where A is a real object and B a context-object: (i) Subsumption. If the outside-segment set of a shape A is empty, and the shared-segment set non-empty, and the edge support for segments in the insidesegment set is non-existent, then we say that object A is subsumed by object B , and can be removed. (ii) Occlusion. If the contour segments of A inside B have strong edge support, and those of B inside A have weak intensity edge support, then A occludes B . This applies even if the rest of the contour segments of A and B belong to the shared set or outside set. (iii) Merger compatibility. If the segments in the inside-segment and shared-segment set for both objects A and B have poor edge support, then A and B represent segmentation of one object into two parts, and can thus be merged into one object. (iv) Disconnected. If A and B have null inside-segment sets and null sharedsegment sets, they are disconnected. If A and B have a non-empty sharedsegment set, and null inside-segment sets, but the shared segments have good edge support, then A and B are still unrelated though adjoining. (v) Incompatible. If A and B have non-empty inside-segment sets and the elements of the inside-segments of both A and B have strong edge support, then at least one of A and B is a wrong structural grouping and must be deleted.
2.7. Interaction Models: Object
+)
Contezt
The influence between an object and the context is a predicate and rewriting rule, which in the most general form is in two parts: pixel (. , . , . , new-G) t (pixel (. , . , . ,G) pixel (. , . , . , Context-G)) new-D (. ,. , . , object-name) + intersect (D (. , . , . , object-name), Context-D (. , . , . , object-name)). which shows that the pixel value or code may be changed because of the change in the influenced domains.
2.8. Interaction Models: Contezt Unstationarity This unstationarity is of course first achieved by the stochastic point processes linked to the location (x,y, z , t ) and thus to the domains G and D. The latter are of course the most interesting due to combining stochastic deviations made of spatial stationarity and temporal stationarity, to modify the influences. In practice, it is indeed very difficult to have or estimate the characteristics of these point processes, and thus to compensate for them in the search processes S. The simpler case is when (x,y, z ) is deterministic but where t is an independent variable driving the scene, task and context.
4.3
Context Related Issues in Image Understanding
745
2.9. Context Causal Graph G Operations The context causal graph Context-G (.) can be manipulated by standard predicates operating on that causal graph and its attributes. It can for example be built (see [25,38] for complete predicate definitions) using: (i) causal graph merger, and adjacencies (ii) coalescence by graph join operations corresponding to overlaps between image scene contexts, with respect t o an angle of view and perspective transformations (iii) perception graph for the context, resulting from joining all context graphs for context-objects (iv) extensions to sensor fusion tasks [21,49].
2.10. Constraint Based Languages as Resolution Strategies Once the image understanding task has received a formal description as above, the big question is of course how to synthesize the search processes S. Here is where the impact of a new research field is felt the most, that is, of constraint based logic programming (see Appendix B for a n introduction). These languages do exist and are in use in the industrial world under trade-names such as: Prolog 111, CHIP, CHARME, PRINCE, etc. [l-61. They include constraint domains (as formalized via the constraints cond(pixe1 (.))) which can be both finite or infinite trees [5,14], linear algebras with infinite precision or rational numbers, boolean algebras, and lists. They also allow for domains such as finite domains (as related to the objects or the influencing domains D),and interval arithmetic (for pixel value rewriting rules such as most gray value “mixing” operations or thresholding). These languages also include the constraint solving algorithms right into their kernels (see Appendix B), while maintaining the declarative nature of the goal I and of all the predicates “influenced”. Most of these languages have pre-compilers or compilers, which is most appreciated in applications development; the search strategies S may be synthesized interactively in the interpreted mode, or compiled [28], with all the underlying constraint propagation carried or by the constraint solving algorithms. One area still unexplored is their implementation on massively parallel storage and processing architectures, although Digital Equipment is collaborating with some research partners on this subject.
2.11. Implementation of the Context Adaptation and Separation This implementation follows from the logic formalism and problem specification described above:
If there is no context separation over { p } , the assertion of the context separation definition will be false. This “fail” can in turn be used t o authorize or deny further rewritings which assume this separation, via the standard “/” predicate, or via the delayed “dif” predicate [ll].If the “fail” happens, then the developer
746
0
L. F. Pau
has the option to change the vocabularies in { p } and change the range and types of processes. If there is context adaptation and this can be carried out, the simple rewriting rule of the context adaptation definition applies. This can be further eased by separating out all predicates and data structures for each Context-G and Context-D in separate “worlds” or “modules” of the asserted predicate knowledge base. This segmentation is precious for context adaptation and modularity. Incidentally, Prolog allows us to write very simply regular grammars and others to implement the rule rewriting.
It should also be noted, and this is very important in practice, that consistency of all “influenced” predicates is maintained, as dynamic updates in the “worlds” or “modules” will check out possible tautologies/contradictions and deny any if happening. 2.12. Time Dependencies in the Context
It is worth underlining once more that the context graph Context-G ( 5 , y, z , t ) and the influencing domain Context-D (x,y, z , t , p , context-object-name) are both time dependent. This is mandatory as the context-objects move, and also as the causal graphs G change over time. In the threat assessment [20], scene monitoring [34], or target tracking problems [20,37], there is an allowed domain attached to the transitions over time between spatial zones occupied by the objects in the scene, corresponding to constraint “scripts”. 2.13. Comparison
This model is much more formal and powerful than the approaches surveyed in Sec. 1, except perceptual context separation. It opens one way to the latter by having ‘(retinas” specified via the influencing domains D, and matching vision processes (see Sec. 5). In Sec. 3, an example will show how easily the multilevel image understanding structures can be represented, with knowledge bases also, and Sec. 4 will show how object oriented design can be incorporated if needed. 3. Multilevel Image Context Representation in a Logic Programming Environment This section describes in a simple case how a simple context information model can be combined with multilevel image understanding, as discussed in Sec. 1, to carry out a simple recognition task. The task I is to recognize car objects in a real-life scene (see Fig. 9). Extensions have been made to three-dimensional scenes in [21], with a full example therein.
4 . 3 Context Related Issues an Image Understanding 747 All image processing predicates mentioned below are available as Prolog [ll] predicates in an environment described in [8,38], and organized into a three-level hierarchy summarized in Table 1. The car objects are defined by clauses too, with orientation as a parameter (see Fig. 1). It should be noted that all predicates affected by the context are found in the upper Context level layer, and only there. About the implementation of the search process S, using the Prolog unification algorithms, see [2,8] for extensive details.
Table 1. Hierarchy of Principal Image Operators; all listed here are predicates which all allow for unification/backtracking.Parameters are explained in Sections 3 and 4.
I
CONTEXT car (Img, Theta, ObjLlrt) I* ObjLlrt form8 a car In Img at < Theta *I LEVEL carSlde (Wl, W2, Theta) I* W1, W2 have 8Imllar orlent< Theta ' I carcorner (Wl, W2) I* W1, W2 are roughly perpendlcular *I carWlndow (Img, ObjNum, W) I* car wlndow W ha8 plxvaliObjNum ' I
APPLIC. LEVEL
trapezold (Img, ObjNum, T) parallelogram (Img, ObjNum, P) rectangle (Img, ObjNum, R) qu8drllateral (Imp, ObjNum, 0 ) orderllder (W, Edger) a8pectRatlo (W, Ratlo) orlentatlon (W, Or)
1'
Edger=Top,Bottom,L,R 8Id.8 of W ' 1 I* Ratlo la Perlm**2 I Area *I I* bared on varlanco ratlos .I
blnarlze (Src, Doat, Threah) lop08 (Src, Dert, Iter) FEATURE traceContour (Src, Deat, Obj, DIR) I* follows edge In DIR dlrectlon*l (LOW) clo8edPolygon (Cornera) LEVEL IocateCorners (Img, ObjNum, Corners, [SCAN, Thresh, COUNT]) I
car( Img, Orientation, [Wl, w2, W31) :carWindow ( I m g , PixVall, Wl), nextTo( I m g , W1, W2). carwindow ( I m g , PixVal2, W2), carside( W1, W2, Orientation), nextTo( I m g , W1, W3), carwindow ( I m g , PixVal3, W3), carcorner( W1, W3).
Fig. 1. Example clause within 'car' predicate, showing one allowable transformation.
3.1. Lower Level Predicates: Image Features
A common characteristic of the predicates at this level is that each is task ( I ) and context (G) independent: edges, vertices, geometric features, etc. Predicates
748
L. F. Pau
such as binary threshold and low-pass filter predicates preprocess the original image pixel gray level values. It is however object registration, scaling, labeling, contour tracking, and corner detection, which are the principal means for extracting information at this level. The attributes derived include, for each labeled object/region, its area, perimeter, centroid, x- and y- variances, chain-coded contour, number and locations of corners. Descriptions of the labeling and contour tracking algorithms are provided in [S]. For example, the general form of the corner location predicate could be [S]: locateCorners (Src, Dest, ObjNum, Corners, [SCAN, Thres, COUNT]) in which Src, Dest are source and destination atoms, ObjNum is the object label number/name, Corners is the corner pixel location pointer pixel (x,y, z , t ) ,and Thres is the gray level threshold for ObjNum. SCAN and COUNT are parameters which can be bound, unbound, or constrained. According to the way these bindings are specified at query time by constraints LLcond(.)l’, then one can easily get declaratively, answers to questions such as (see Fig. 2): 0
0
How many corners COUNT can be found using a scan window length of SCAN = 22 pixels? What scan length SCAN should be used to find exactly COUNT = 4 corners?
and any combinations of similar questions, including on other atoms. 1ocateCorners (Src, Dest, ObjNum, Corners, [SCAN, Thresh, LINECOUNT]) . (a) locatecorners (Src, Dest, ObjNum, Corners, [12, 0.75, LINECOUNT]) . LINECOUNT = 5 (b) 1ocateCorners (Src, Dest, ObjNum, Corners, [SCAN, 0.75, 4 1 ) . SCAN = 12 ; SCAN 16 : SCAN = 20
-
(C) 1ocateCorners (Src, Dest, ObjNum,
Corners,
[SCAN, 0.75, LINECOUNT])
-.
SCAN = 6, LINECOUNT 5 ; SCAN = 8 , LINECOUNT = 5 ; SCAN = 12, LINECOUNT = 4 : SCAN = 24, LINECOUNT = 3
Fig. 2. Three predicate calls to illustrate effects of parameter bindings.
The search process yielding answers to these questions is based on unification and backtracking, and explained in [8,10,38]. Essentially, unification can, not only trigger a search for a suitable parameter value satisfying the goal constraints, but
4.3 Context Related Issues in Image Understanding
749
ot-car-models-found
/
I car-model(lmg,Y,V) I
I
u tlgurs
5
Fig. 3. Class allocation control structure.
can also match the unbound parameter against all possible values which satisfy the predicate’s constraints. 3.2. Intermediate Level Predicates: Application Dependent Predicate
Knowledge Base The predicates defined at this level are all I-application specific and are employed to identify car windows as objects belonging to a particular class of shapes (trapezoid, rectangle, parallelogram), and with shape attributes (e.g. aspect ratio) conforming to specified ranges of values for cars. In addition, allowable transformations are defined, to permit the classification of shapes with incomplete contours. Spatial relationships between pairs of objects in the cars are also specified here; thus we see at this level, predicates for identifying adjacent objects in the cars, objects with the same orientation, etc. The same predicates can exclude objects which do not belong to a car although belonging to the same class of shapes. These predicates altogether constitute what might be called a formal specification of the object model from regions [40], the objects being cars and car elements. They should be stored in a separate predicate base, or world; the use of such worlds is mandatory in sensor fusion tasks [49]. In [8,38] examples are given in detail as to how the search by unification/ backtracking allow detection of the shapes of the object model, if any exist, and to adapt the parameters as explained above for the lower level predicates: 0 0
list of edges for a shape [8]; region attributes for context dependent regions, producing attributes of the influencing domains Context-D.
750
L. F. Pau
class polygon (Image, ObjNum) checks ( get-cornerData (Image, ObjNum, CornerList), closedPolygon (CornerList) ) body img (Image) => ( ! ) . quad (Image, ObjNum) class polygon checks ( get-cornerNum( Image, ObjNum, 4 ) , orderEdges ( Image, ObjNum, Edges), assert-once ( quadrilateral(Image, ObjNum, Edges)) ) body edges( [T, B, L, RI) => clause (quadrilateral(Image, ObjNum, [TI B, L, Rl ) ) -&-
topBottomTheta ( Theta) => clause (quadrilateral(Image, ObjNum, [TIB I-] transorigin( T, B, Tnew, Bnew), rotateToXaxis ( Tnew, Bnew, Theta) .
,
creation new(quad( image01, 141, QuadOl)
QuadOl#topBottomTheta( ThetaQOl
)
Fig. 4. Example class and subclass definitions, showing object creation and message passing. (Body’ predicates (methods) are inherited to subclasses. ‘Checks’ are also inherited, meaning that the call shown to create an instance of ‘quad’ will first evaluate the checks in ‘polygon’ and then in ‘quad’ before completing the instantiation.
3.3. Higher Level Predicates: Goal Satisfaction Constraints
At this level, coexisting in a multilevel representation are two very different types of predicates: a
0
the conditions “cond-i” which are constraints defining the image understanding task I all context related predicates, such as “influenced”, “Context-G” , “Context-D” , etc.
As to the constraints, they essentially specify in the specific case at hand, that an object may be classified as a car of a given car modelftype, if all component car windows are spatially dispersed in a certain way, specified for example by the
4.3 Context Related Issues in Image Understanding 751 constraints of the following car detection goal:
I
+ cond(Img(.)).
I = car (Img, Orientation, [WI, WZ, W3]) cond-1 (Img(.)) = carwindow (Img, PixVall, W l ) , nextTo(Img, W1, W2). cond-2 (Img (.)) = carwindow (Img, PixVal2, W2), carside (Wl, W2, Orientation), nextTo (Img, W1, W3). cond-3 (Img (.)) = carwindow (Img, PixVal3, W3), carcorner (Wl, W2). Unification again generates the search process S, satisfying the goal I , as well as all constraints on the objects or regions with specific characteristics, acting like special-purpose filters. Other goals are then satisfied by other sets of constraints: for example, the initially mentioned sequential car recognition goal will be stated as (again allowing for identical predicates with variable arity):
I + cond(Img(.)). I = list ( Y1, match (Img, Y1 , YZ), List-car-types). cond-1 (Img (.)) = different (Yl, Y2), car-model(Img, Y1, V l ) , car-model (Img, Y2, VZ), element-of (Yl, list-car-models), element-of (Y2, list-car-models). cond-2 (Img (.)) = car-model (Img, Y1, Vl). cond-3 (Img (.)) = car-model (Img, Y2, V2). cond-4 (Img (.)) = equal (Vl, [Thetal, Cornersl, SCAN, Threshl, COUNTl]). cond-5 (Img (.)) = equal (V2, [ThetaZ, Corners2, SCAN, Thresh2, COUNT21). In the above, the search process S will return the list of car models seen in the image “Img” , by ensuring that these car instances are spatially distinct, meaning that there is yet no object +) object interaction. The car-model predicate will first unify the unbound V parameter lists before it can backtrack with one or several solutions. The predefined “different” predicate should guarantee the difference at the term level, and spatial distinction. Figure 5 illustrates the class allocation constraints nesting and hierarchy, and is made graphical to help in object oriented design. We have however to analyze this predicate “different” in more detail below. 3.4. Context Model
The above goal satisfaction constraints are still not fully specified, precisely because the “different” predicate referred to above is obviously context dependent, as it pertains to spatial distinction and non-occlusion, which falls into the class of object +) object interaction models discussed in Sec. 2. More precisely, the context separation in a multi-object recognition task can only be achieved by: first a scene model for isolated single objects alone next a context model for each additional object
752
L. F. Pau
2735 car(lmg,Theta.ObJLlat)
I
1
arMndow(lmg,PlxVall,Wl)
carWlndow(lmg,PlxVaI2,WZ)
carSMe(Wl.W2,Theia)
\
\
trapezold(lmg,PlxVai2,W2) \
I
corder sides,
locataCornen(i~.PixVai2,Comers,[SCAN,THRESH,4])
\ lncaContour(lmg,PlxValZ,OlRECTlON,Ch~lnCode
Fig. 5 . Example. Invoking the ‘car’ predicate sequentially activates its subgoals. When the first call to ‘carwindow’ succeeds, W1 is bound t o an object in ‘img’ with pixel assignment PixVall. The subgoal ‘nextTo’ succeeds in finding a neighbor object, unifying it with W2. Now suppose that the second call to ‘carwindow’ succeeds (visiting the nodes shown in the expansion), but ‘carside’ fails. Backtracking returns to the second ‘carwindow’ call, but since W2 is still bound, the search does not look for a new object. Instead, ‘1ocateCorners’ is invoked via backtracking to analyze the object with new scan parameters, to derive a new corner placement which may satisfy the ‘carside’ constraints. A new object would be searched for if backtracking returned t o ‘nextTo’, freeing W2 to be unified with a new neighbor object.
so that each isolated car is located and recognized individually, and overlapping cars are treated as an object +) object interaction as formalized in Sec. 2. Reusing the goal specification for the car recognition task, and assuming the scene in which the isolated car must be located and recognized, is car-model (Img, Y1, Vl). The context model for this object +) object interaction perfectly fits the
4.3 Context Related Issues an Image Understanding 753 generic definition: new-D (. , . , . , new-object-name)
* intersect (D (. , ., . , object-name), Context-D (. , ., . , context-object-name)).
new-object-name t (object-name
context-object-name)
provided the application level specifies the predicates or definitions: object-name = car-model (Img, Y1, V1) context-object-name = car-model (Img, Y2, V2) different (Yl, Y2) + or (diff (new-object-name, Yl), diff (new-object-name, Y2)). In case of spatial overlap, the rewriting rule must be defined otherwise, e.g. in relation to the attributes of the domains D-1 and 0 - 2 and thus of the domain “newD” (see the formal definitions above leading to the corresponding formal predicates). In this car recognition example, the influencing domains D-1, 0-2, new-D are explicitly described by the simple causal process of spatial, stationary, neighborhood, as explicated by the predicates “nextTo” in relation to the car windows. In this example too, it is especially powerful to use the causal graphs G(.) and Context-G (.) to represent all possible relative attitudes of the “car-windows” and “car-corners” with respect to the sensor. For example, it is obvious that the car detection goal “car (Img, Orientation, [Wl, W2, W31)” made explicit above, corresponds to one such causal graph. In this specific example, as discussed in [8],there is no object t)context interaction, nor context unstationarity: 0
The object t)context interaction would apply both to shades cast by the isolated cars, as well as to reflections between cars and the physical surroundings. However the causal graphs G and Context-G allow for the filtering out of the shade seen in Fig. 9 (once compared to Fig. 8), and no further object t)context interaction model is needed. The context unstationarity would be capturing the randomness in the car speeds, but not speed (as pixel (. , . , . , t ) encapsulates time).
3 . 5 . Constraint Resolution Engine for Object Classification
We can illustrate, in relation with the example treated in this section, how a constraint satisfaction engine would operate for the car classification goal I. We will further particularize this procedure, borrowed from [8],by jointly treating scene objects such as cars and context objects such as trees or road signs, assuming context separation. Consider ‘ ‘ ~ b jand ” “c-obj” as facts belonging to the lists “mi’ and “c-n” respectively, where the prefix c- applies to context information, and “a” is the context. The fact ‘Y as observed in the scene with its context will be resolved by the following
754
L. F. Pau
predicate base which constitutes a constraint resolution engine, for the constraint that the object “obj” is found in that context “a”: (i) T + obj c-obj. (ii) class (r, a.1) + obj c-obj class (obj, m) class (c-obj, c-n) element-of (a, m) element-of (a, c-n). (iii) class(r, nil) + fail. (iv) delete(r, obj c-obj) + class(r, nil). (v) add-predicate (no (r), obj c-obj) +. where
0
0
the first rule (i) belongs to the goal specification in the predicate base; the other rules (ii)-(v) belong to the constraint satisfaction engine by deleting inconsistent facts in the predicate base and adding conditions by (iv) and (v) to get smaller consistent fact lists; the notation “a.l” designates the list made of “a” followed by “l”, “nil” the null fact; the other predicates are self-explanatory.
4. Image Understanding and Object Oriented Context Modeling
This section deals with the issue of showing how object oriented design can be used for context modeling, and the limitations hereof (see Sec. 1). To ease the reading, this is exemplified by extending the example of Sec. 3. Basic definitions of object oriented design terminology are given in Appendix A. 4.1. Object Oriented Context Representation
Even when using a logic based representation, as in Secs. 2 and 3, it is possible to extend the predicate representations by class definition capabilities, including mechanisms for inheritance and methods inheritance. A class definition is composed of two groups of predicates: 0
0
The group labeled “checks” is evaluated when an instance is requested via the predicate “new ((class name))”; if these predicates are satisfied, the instantiation is successful; The group labeled “body” contains the predicates available as methods. Both groups are inherited to subclasses (see Fig. 4).
The context classes then correspond to classes of “context-object”-names, the classes of influencing domains “Context-D”, and the classes of causal processes { p } , as featured via the “Context-D (, , , context-object-name)” term in the formal model of Sec. 2. A more specific example of this, but by no means the only one, is the class of context regions shapes, found in a shape library as used in [9].
4 . 3 Context Related Issues in Image Understanding 755 Method definitions allow a segment of Prolog code to be encapsulated within a class definition and invoked by sending a message to the class instance. Inheritance allows methods defined in a class to be available (inherited) to any of its subclasses. Figure 3 illustrates definitions for a class (e.g. %ar models”) and subclass, as well as the syntax for object creation and message passing; “body” and “checks” of the corresponding class definitions are illustrated in Fig. 4; also [43] gives another example for ship classification. Methods applicable to context classes obviously include the rewriting rules specified above in the context model of Secs. 2 and 3, for context adapt at ion and object-cont ext interact ions. 1- car(0,Thera.ObjLls~).
1 CALL car(0,_361,[_452,_4541) 2 CALL carWlndow(O,_t83._45i) 3 CALL locCorner(O,l,[~lC88,_1490._1C92.imgl],~l486) 4 EXIT 1ocCorner(0,1.~16,0.5,4~~mgl).I1110.210),[180,170],[190,200],[120,250]]) 5 EXIT carWindow(0,1,trapezoi0(0,l,lmgl)) 6 ChLL nextTo(trapezold(O,l,lmgl),-4?3,-494) 7 CALL 1ocCorner(0,1,[~21696,~2169~,~21700,~mg1],~21694) 8 F A I L 1ocCorner(0,1,[l6,0.5,4,1mg1).II110,210],[180,170], 9 RE00 locCorner(0,2,I,21696,~21698,~21700,1mg2],~216?~) 10 EXIT 1ocCorner(0,2,(12,0.5,4,1mg2].([200,160],[2C0,120] 11 W I T nextTo(trapezold(0,1,lmgll,quad(0,2,lmg2),104.043) 12 CkLL carWlndow(qusd(0,2,lmg2),_C54) 13 ChLL carWlndow(0,2,~454) 14 CALL locCorner(0,2,~_29070,~29072,~29074,img2],_29068) 15 W I T locCorner(0,2,[12,0.5,4,img2],[(200,160],(240,120] 16 EXIT carUlndow(0,2,tra~azoid(0,2,lm~2)) 17 EXIT carWindw( quad( 0,i,lmq2 j ,trape;oi;l( 0.2, img2) ) 1 8 CALL carSlde(trapezold(O,l,imgl),trapezold(0,2,img2),~361) 19 FAIL carSlde(trapezold(O,l,lmgl),trcpe~old(0,2,img2),~361) 2 0 REDO carWlndw(0,2,trapezoid(0,2,lmg2)) il REDO ca~indw(qu~d(0,2,lmg2),~ra~e:aid(0,2,impi)) 22 CALL locCorner(0,2,~~29070,~29072,~29074,lmg3],~29068) 23 EXIT 1 o c C o r n e r ( 0 , 2 , [ 1 8 , 0 . 5 , C . i m g ) l . 1 [ 2 0 0 , 1 5 0 ] , [ 2 4 0 , 1 2 0 ] , [ 2 8 0 , 1 3 0 ~ , [ 2 2 0 , 1 9 0 ] ] ) 24 EXIT carWlndov(0,2,trapezold(0,2,lmg3)) 25 W I T carWlndov(quad(0,2,11~g2),trapezold(0,2,lmg3)) 26 CALL carSide(trapetold(0,l,i~ngl),tscpezoid(0,2,lmg3),~361) 27 W I T carSide(trapetold(0,l,imgl),trapezold(0,2,img3),-1.47424) 28 EXIT car(0,-1.47424,[trapezoid(O,l,lm~l),~rapezold(O,2,im~3)]) Theta -3.47424, ObjLlst = [trapezold(O,l,lmgl),trapezoid(0,2,lmg3)]
-
Fig. 6. Example Prolog trace showing toplevel goal ‘car’ (1); subgoal ‘carwindow’ identifies an object as a car window ( 5 ) ; ‘nextTo’ finds a neighboring object (11); neighbor object identified as a car window (17); these two windows fail t o satisfy ‘carside’ (19); backtracking to ‘carwindow’ causes corner detection t o retry same object with new parameters and derives a new corner placement (22); two windows satisfy the ‘carside’ predicate (27).
The visualization of the search process S is displayed in Fig. 5, while Fig. 6 gives the trace of the “I” car recognition goal execution. Figure 7 displays the corner detection results achieved by this search process, for varying values of the bound SCAN parameter (see Sec. 3). Figure 8 displays the regions characterizing the car class instance, in the binary/thresholded image in Fig. 9. Please note that in Fig. 9 there are reflections from the roof and shadows, with the shadows eliminated by the context modeling, whereas the roof is treated as a supplementary LLwindow” but not found to abide by the neighborhood graph Context-G, and thus is eliminated by the constraint satisfaction.
756
L. F. Pau
Fig. 7. Four corner detection results achieved by backtracking, varying the scan window length. Each result shows the same contour with a different corner placement.
Fig. 8. Image of car after preprocessing (upper left) is identified as a car instance, matching the three extracted contours shown as car windows t o satisfy one of the 'car' predicates.
4.2. Extensions to Sensor Fusion
The object oriented design comes in handy to represent context diversity and changes in the case of sensor fusion tasks [25,26,49], where different detection ranges of heterogeneous sensors overlap in the feature domains while having distinct attributes.
4 . 3 Context Related Issues in Image Understanding 757
Fig. 9. Regions extracted by the understanding procedure in the car image. Some connected components have been filtered out thanks t o context knowledge, kept separate from the generic one, and also the thresholds are adapted locally.
Above the context classes, a super-class must be defined for the instances of the same contexts according to different perceptual/sensor ranges. The sensor fusion tasks apply to this root class level. Inheritance applies for the context classes and the methods underneath, but the context causal graphs and influencing domains stay specific to each sensor. 4.3. Comparisons
A summary comparison of full object oriented design to the object oriented elements considered above, is presented in Table 2. But some comments are required. Table 2. Summary comparison of full object oriented design and elements used here. Attributes
Full 00 Design
Objects/Classes Inheritance
Yes Full
Encapsulation
Yes
Last binding
Yes
Current Image Understanding Environment Yes Inheritance for methods No inheritance for instance variables No data in instances Methods exist only in class definitions Yes - inherent in Prolog’s interpretive environment
758
L. F. Pau
An instantiation exists only within the predicate which created the object (with “new”). Thus we do not require here full object oriented design; this however is consistent with a formal object-oriented inheritance model such as [13]. Class inheritance is limited t o methods; there are no instance variables per se. Furthermore, code is not actually encapsulated: objects, e.g. relating t o the context, contain no code of their own, but forward messages to their class definitions. This type of behavior resembles “delegation” [16], but includes the message based inheritance presented in [13]. There is no loss of generality with respect t o Prolog, as the class definitions are translated by Prolog into Prolog; inheritance is also implemented within Prolog [23,41,44,47]. Unification/backtracking/constraintsatisfaction are fully applicable t o class inheritances, especially for the context model, because of the object oriented modalities above. The form of polymorphism presented here for context modeling is most flexible when the instances of influencing domains “Context-D” , processes { p } , and possibly “context-object”-names, each strictly abide to a class tree structure. This holds if basically the image understanding context contains few context elements (i.e. few “Context-D” domains), but with a high variability within each class. This variability is well modeled by the subclasses, instances, and inheritance. Such a n object oriented design is insufficient if the “Context-G’ graphs are rich, with many arcs, labels, and loops all corresponding to contextual ambiguities.
5. Perceptual Context Modeling 5.1. Perceptual Context Modeling This fourth approach is also covered by the model of Sec. 2, by a suitable joint selection of the topologies and of the causal structures in the context graphs “ContextG” and the influencing domains “Context-D” .
Example: Perceptual noise filtering in images Many image understanding tasks consist in removing the perceptual consequences of noise, as this interferes with the interpretation. Many noise removal algorithms rely on bandpass filtering of the gray levels, thus blurring or deleting or fragmenting, e.g. linear elements, such as the car element border lines in the example. Context based modeling consists in representing the “influenced process as a stochastic process, triggered by the local variations in gray values in the influencing domain “Context-D” around each pixel. This context model will be parametrized also by the orientation and curvature of edges or lines found in the influencing domain. Areas containing a lot of structure, with strong contrast between edges/lines and the background, have a higher gray value variance in local areas. This determines, by a context predicate, whether noise reduction by low-pass filtering is to be
4.3 Context Related Issues in Image Understanding 759 applied or not. The result is that noise can be reduced, while the edges in areas with prominent structures are kept sharp. By implementing separation of the context model and of the filter, noise removal filters can be designed with no regard to the structure of the image content. 5.2. Ontologies
One difficulty with perceptual context modeling is that, when this approach is used, it is often not clear what the categories, objects, attributes, entities and conceptual structures, are. This is what an ontology is about, i.e. about the study of the concepts and categories of the world. The understanding and semantic depth of an image in a domain depend on the richness of the ontology of that domain. Most A1 ontologies treat situations or states of the world as objects which can participate in relations; situations get changed by events and generate scripts. Underlying the ontologies are causal links which dictate the behavior: causal links are specialized relational links which indicate the propagation of change. The knowledge and mapping of the causal links driving perceptual changes is still not known (see however [50] and (511). 6. Conclusions
Context modeling is of course difficult, but the time has come where the availability of context simulation models on one hand, and the urgent need for code sharing and reuse on the other hand, just simply impose the use of some still imperfect solutions. Until recently, there was little or no progress, just because of a lack of formal descriptions of the understanding task and of the context. While the jury is still out on the specific forms of inheritance and polymorphism in object oriented design implementations, dynamic inheritance is clearly lacking in image understanding. By dynamic inheritance we mean allowing a method to change the inheritance specifications of the methods within an object. This is required to simplify the representation of context related physical phenomena and constraints, like changes in contrast, shape discontinuities, etc., and to be able to address perceptual cues better than now. Furthermore, image understanding research should not ignore work on contexts in natural language [19,45]. In this field, a cohesive and coherent discourse relies on a three-tiered representation system based on linguistic and knowledge bases; semantic plausability relies finally on an overall discourse model, with the invocation of context specific handlers which: a a
specify which types of interpretations are possible for each specific handler type; combine all information they get into a single text which reflects constraints on the plausability of a proposed interpretation.
This is very similar to the approach proposed here.
760 L. F. Pau
Finally, the extension of context modeling to sensor fusion should not necessarily we viewed as a “complication”; t o the contrary, sensor diversity may often allow for disambiguation, provided the causal and physical processes are entirely known.
References [l] A. Van Hentenryck, Constraint satisfaction in logic programming (CHIP) (MIT Press, Cambridge, MA, 1989). [2] J. Cohen, Constraint logic programming languages, J. ACM 33, 7 (1990) 52-67. [3] J. Jaffar and J.-L. Lassez, Constraint logic programming, in Proc. 14th ACM Symp. on Principles of Programming Languages (POPL-87), Miinchen, 1987, 111-119. [4] W. Leler, Constraint Programming Languages: Their Specification and Generation (Addison Wesley, 1987). [5] A. Colmerauer, An introduction to Prolog 111, J. ACM 33, 7 (1990) 67-90. [6] P.-J. Gailly et al., The Prince project and its applications, in G. Comyn and N. E. fichs (eds.), Logic Programming in Action, Lecture Notes in Artificial Intelligence, Vol. 636 (Springer Verlag, Berlin, 1992) 55-63. [7] D. H. Ballard and C. M. Brown, Computer Vision (Prentice-Hall, Englewood Cliffs, NJ, 1982). [8] B. Bell and L. F. Pau, Contour tracking and corner detection in a logic programming environment, IEEE Trans. Pattern Anal. Mach. Intell. 12, 9 (1990) 913-916. [9] D. Cruse, C. J. Oddy and A. Wright, A segmented image data base for image analysis, Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada (IEEE, 1984) 493-496. [lo] E. C. Freuder, Backtrack-free and backtrack-bounded search, in L. Kana1 and V. Kumar (eds.), Search in Artificial Intelligence (Springer-Verlag, New York, NY, 1988) 343-369. Ill] F. Giannesini, H. Kanoui, R. Pasero and M. van Caneghem, Pmlog (Addison Wesley, Reading, MA, 1986). [12] A. Goldberg and D. Robson, Smalltalk-80: The Language and Its Implementation (Addison-Wesley, Reading, MA, 1983). [13] B. Hailpern and Van Nguyen, A model for object-based inheritance, in B. Shriver and P. Wegner (eds.), Research Directions in Object-Oriented Programming (MIT Press, Cambridge, MA, 1987) 147-164. [14] R. M. Haralick and G. L. Elliot, Increasing tree search efficiency for constraint satisfaction problems, Artif. Intell. Journal 14 (1980) 263-313. 115) A. Huertas and R. Nevatia, Detecting buildings in aerial images, Comput. Vision Graph. Image Process. 41 (1988) 131-152. [16] H. Lieberman, Using prototypical objects to implement shared behavior in object oriented systems, Proc. ACM Conf. on Object Oriented Programming, Systems, Languages, and Applications, Portland, OR, 1986, 214-223. [17] S. Matwin and T. Pietrzykowski, Intelligent backtracking in plan-based deduction, IEEE Trans. Pattern Anal. Mach. Intell. 7, 6 (1985) 682-692. [18] D. M. McKeown, Jr., W. A. Harvey, Jr. and J. McDermott, Rule-based interpretation of aerial imagery, IEEE Trans. Pattern Anal. Mach. Intell. 7, 5 (1985) 57c585. [19] B. Neumann, Natural language description of time-varying scenes, in D. Waltz (ed.), Advances in Natural Language Processes, Vol. I (Morgan Kaufmann, 1984). [20] L. F. Pau, Knowledge-based real-time change detection, target image tracking, and threat assessment, in A. K. C. Wong and A. Pugh (eds.), Machine Intelligence and Knowledge Engineering for Robotic Applications, NATO A S I Series, Vol. F-33 (Springer Verlag, Berlin, 1987) 283-297.
4.3
Context Related Issues in Image Understanding
761
[21] L. F. Pau, Knowledge representation for three-dimensional sensor fusion with context truth maintenance, in A. K. Jain (ed.), Real Time Object Measurement and Classification (Springer-Verlag, Berlin, 1988) 391-404. [22] K. C. You and K. S. Fu, A syntactic approach to shape recognition using attribute grammars, IEEE %ns. Syst. Man Cybern. 9, 6 (1979) 334-345. [23] L. Leonardi, P. Mello and A. Natali, Prototypes in Prolog, J. Object Oriented Programming 2,3 (1989) 2Cb28. [24] A. R. Rao and R. Jain, Knowledge representation and control in computer vision systems, IEEE Expert, Spring (1988) 64-79. [25] L. F. Pau, Knowledge representation for sensor fusion, in Proc. IFAC World Congress 1987 (Pergamon Press, Oxford, 1987). (261 S. B. Pollard, J. E. W. Mayhew and J. P. Frisby, PMF: a stereo correspondence algorithm using a disparity gradient limit, J. Perception 14,449-470. [27] R. A. Brooks, Model based 3-D interpretation of 2-D images, in Proc. 7th Int. J. Conf. on Artificial Intelligence, 1981, 619-623. 1281 L. Wes, R. Overbeck, E. Lusk and J. Boyle, Automated Reasoning: Introduction and Applications (Prentice Hall, Englewood Cliffs, NJ, 1984). [29] L. Kitchen and A. Rosenfeld, Scene analysis using region-based constraint filtering, Pattern Recogn. 17,2 (1984) 189-203. [30] Y. Ohta, Knowledge Based Interpretation of Outdoor Natural Scenes (Pitman Advanced Publishing Progr., 1985). [31] J. Doyle, A truth maintenance system, Artif. Intell. J. 12 (1979) 231-272. (321 J. De Kleer, Choices without backtracking, in Proc. AAAI Nat. Conf. on Artificial Intelligence, Aug. 1984. [33] A. Rosenfeld et al., Comments on the Workshop on Goal-Directed Expert Vision Systems, Comput. Vision Graph. Image Process. 34,1 (1986) 98-110. [34] Harbour change of activity analysis, AD 744332, NTIS, Springfield, VA, 1982. [35] Proc. DA RPA Image Understanding Workshop, Science Applications Report SAI-84176-WA, or AD 130251, NTIS, Springfield, VA, June 1983. [36] J. Ebbeni and A. Monfils (eds.), Three-Dimensional Imaging, Proc. SPIE, Vol. 402, Apr. 1983. [37] N. Kazor, Target tracking based scene analysis, CAR-TR-88, CS-TR-1437, Univ. of Maryland, College Park, MD, Aug. 1984. (381 B. Bell and L. F. Pau, Context knowledge and search control-issues in object oriented Prolog-based image understanding, Pattern Recogn. Lett. 13 (1992) 279-290. [39] P. Coad and E. Yourdon, Object Oriented Analysis (Prentice Hall, Englewood Cliffs, NJ, 1991). [40] A. Palaretti and P. Puliti, A Prolog approach to image segmentation, J. Appl. Artif. Intell. 3,4 (1990) 56-68. [41] R. Knaus, Message passing in Prolog, AI Expert, May 21-27 (1990). [42] D. T. Lawton, Image understanding environments, Proc. IEEE 7 6 , 8 (1988) 1036-1050. [43] S.-S. Chen (ed.), Image Understanding in Unstructured Environments (World Scientific, Singapore, 1988). [44] D. Pountain, Adding objects to Prolog, BYTE, Aug. (1990). [45] S. Luperfoy and E. Rich, A computational model for the resolution of contextdependent references, MCC Technical report NL-068-92, MCC, Austin, TX, Mar. 1992. [46] B.J&ne, Digital Image Processing (Springer Verlag, Berlin, 1991).
762 L. F. Pau [47] B. Bell and L. F. Pau, Prolog object oriented embedded manager, Tech. Report, Technical University of Denmark, 21 Jul. 1989. [48] G . Ciepel and T . Rogon, Background modelling in an object oriented, logic programming, image processing environment, Tech. Report, Technical University of Denmark, May 1990. [49] L. F. Pau, Sensor and Data Fusion (Academic Press, NY, 1993). [50] L. F. Pau, Behavioral knowledge in sensor and data fusion systems, J. Robotic Syst. 7, 3 (1990) 295-308. [51] V. Akman and N. Surav, Step toward formalizing context, A1 Magazine 17, 3, 55-72.
Appendix A. Object Oriented Definitions
A good introductory article and a more detailed glossary are found e.g. in [12,39]. Class: A set of elements sharing the same behavior and structure characteristics which are represented in a “class definition”. A class which adopts the behavior and structure of another class, but specializes some characteristics t o form a subcategory, is a “subclass”. Constraints: Predicates or parameters which control goal evaluation and backtracking so that the resulting object classification or scene understanding is consistent with the known physical characteristics of the object or class. Context: A consistent subset of facts derived during the evaluation of a higher level goal. Context knowledge includes predicates t o establish a context, i.e. t o evaluate a top-level image understanding goal, and constraints t o enforce consistency. Instance: An element created from the descriptions in a class definition. An image object is classified when it is determined t o be a n instance of a class. Method: A segment of code appearing within a class definition, which can be invoked by sending a message t o any instance of that class t o evaluate the named method.
Appendix B. Constraint Logic Programming Logic is a powerful way of representing problems. Constraints appearing in logic programming are tests used for checking a solution. With CLP (Constraint Logic Programming), the constraints are part of the description of the problem, i.e. in this chapter. It is the image understanding task. The way they are used allows for a more efficient search for solutions. Further details on various CLP concepts and implementations are found in [la].
B . l . Syntaz
A problem is represented in a CLP by a set of clauses (or rules), like in logic programming languages like Prolog [ll]. However, the syntax of the clauses is
4.3 Context Related Issues in Image Understanding 763 different. The common part is that a clause consists of a term, the head of the clause, and a list of terms (which can be empty) in the body of the clause; in both cases it means “the head term is true if the body terms are true”. The difference with respect to logic programming languages is that CLP clauses can also contain a list of constraints (see next section), and in this case the meaning becomes: “the head term is true if the body terms are true and the constraints are not violated”.
B .2. Constraints Constraints apply t o terms, boolean values, identifiers, lists and trees, numbers (integers, rationals and/or reals, depending on the CLP language used). Constraints are equations or inequations or logicals or fixed lists. The variables in the constraints behave like unknown quantities in mathematical equations:
Examples: X = X / 2 + 1 , implies X = 2 ; X = X + 1 , has no solution; Y = X + 1, means that, if the two variables X , Y are unbound, they belong t o one line: A + B C, says that A is true if B and C are; 0 < T < 3X/4 + 5Y, defines a region for the unbound variable T . Execution efficiency depends very much on the time at which the constraint is treated, and on the algorithm used for testing the satisfiability of systems of constraints, in what is called the constraint solver.
B .3. Resolution The resolution mechanism in CLP languages is based both on a constraint solving mechanism, which is in charge of testing if the constraints can have a t least one solution, and upon a unification algorithm which attempts t o prove each term of the goal by replacing it by an equivalent set of terms and constraints, as they appear in a clause. At each step of the attempt to prove a logical goal, the constraint solver must decide if there is at least a solution for the set of constraints on the variables which appear in the terms considered. The mechanism is as follows: given a set of variables W (appearing in the query), a list of terms to, t l , . . . , tn, and a list of currently satisfiable constraints S, two states are defined:
(I): (W, to, t l , ..., tn, S ) ( 2 ) : (W, sl, . . . , sm,t l , . . . , tn, (S U R U (SO = to) )), with
U=
“union”
An inference step consists of making a transition from the state (1) to the state (2) by applying the program rewriting rule ( r ) :
r: SO
t sl, ..., sm,R
764
L. F. Pau
in which the (si)’s are terms and R is the set of constraints specified by the CLP rule. The new state after inference becomes (2) if the new set of constraints (S U R U (SO = to)) is satisfiable; it has at least one possible solution. Here, (SO = to) represents the set of constraints applied to variables so that SO and t O become identical. If the new set of constraints is not satisfiable, another rule has to be tried in the CLP program to attempt to replace the term. There are two types of non-determinisms that arise in the sequential interpretation of such CLP programs: the first is the selection of the term in the list of terms that will be processed first, and the second is the choice of an applicable rule in the CLP program. The constraint solver must be incremental to minimize the computational effort required to check if the constraints remain satisfiable or not. If the set S of constraints has solutions, adding a new set of constraints R will not require to solve for (SU R), but to transform the solutions of S into solutions of (Su R).
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 765-796 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 4.4I POSITION ESTIMATION TECHNIQUES FOR AN AUTONOMOUS MOBILE ROBOT-A REVIEW
RAJ TALLURI and J. K. AGGARWAL Computer and Vision Research Center, Department of Electrical Engineering University of Texas at Austin, Austin, Texas 78712, USA In this paper, we review various methods and techniques for estimating the position and pose of an autonomous mobile robot. The techniques vary depending on the kind of environment in which the robot navigates, the known conditions of the environment, and the type of sensors with which the robot is equipped. The methods studied so far are broadly classified into four categories, landmark-based methods, methods using trajectory integration and dead reckoning, methods using a standard reference pattern, and methods using a priori knowledge of a world model which is matched the sensor data for position estimation. Each of these methods is considered and its relative merits and drawbacks are discussed. Keywords: Autonomous navigation, mobile robots, position estimation, self-location, landmarks, world model.
1. Introduction Autonomous mobile robots are one of the important areas of application of computer vision. The advantages of a vehicle that can navigate without human intervention are many and varied, ranging from providing access to hazardous industrial environments to battlefield surveillance vehicles. A number of issues and problems must be addressed in the design of a n autonomous mobile robot, from the basic scientific issues to state-of-the-art engineering techniques. The tasks required for successful autonomous navigation by a mobile robot can be broadly classified as (1) sensing the environment; (2) building its own representation of the environment; (3) locating itself with respect to the environment; and (4)planning and executing efficient routes in this environment. It is advantageous for a robot t o use different types of sensors and sensing modalities to perceive its environment, since information available from one source can be used to better interpret information from other sources, and can be synergically fused to get a much more meaningful representation. Some of the different sensor modalities considered by previous researchers are visual sensors (both monocular and binocular stereo), infrared sensors, ultrasonic sensors, and laser range finders.
766
R. Talluri & J. K . Aggamal
Building a world model also termed map-making is an important problem in mobile robot navigation. The type of spatial representation system used by a robot should provide a way to consistently incorporate the newly sensed information into the existing world model. It should also provide the necessary information and procedures for estimating the position and pose of the robot in the environment. Information t o do path-planning, obstacle avoidance and other navigational tasks must also be easily extractable from the built world model. Section 3 presents a review of various mapmaking strategies and their associated position estimation methods. Determining the position and the pose of a robot in its environment is one of the basic requirements for autonomous navigation. In this discussion, position refers t o the location of the robot on the ground plane and the pose refers t o the orientation of the robot. We use the term position estimation t o refer to the estimation of both position and pose. The problem of self-location has received considerable attention, and many techniques have been proposed to address it. These techniques vary significantly, depending on the kind of environment in which the robot is to navigate, the known conditions of the environment, and the type of sensors with which the robot is equipped. Most mobile robots are equipped with wheel encoders that can be used to estimate the robot’s position at every instant; however, due to wheel slippage and quantization effects, these estimates of the robot’s position contain errors. These errors build up and can grow without bounds as the robot moves, and the position estimate becomes more and more uncertain. So, most mobile robots use an additional form of sensing, such as vision or range, to aid the position estimation process. In this paper we review various techniques studied for estimating the position and pose of an autonomous mobile robot. Broadly, we classify the position estimation techniques into four categories, landmark-based methods, methods using trajectory integration and dead reckoning, methods using a standard reference pattern, and methods using a priori knowledge of a world model which is matched to the sensor data for position estimation. These four methods are briefly described below. In landmark-based methods, typically the robot has a list of stored landmark positions in its memory. It then senses these landmarks using the onboard sensors and computes the position and pose using the stored and the sensor information. Section 2 reviews the different approaches using landmarks for self-location. In the second type of position estimation technique, the position and pose of a mobile robot are estimated by integrating over its trajectory and by dead reckoning, i.e. the robot maintains an estimate of its current location and pose at all times and, as it moves along, updates the estimate by dead reckoning. Section 3 reviews these methods. A third method of estimating the position and pose of the mobile robot accurately is to place standard patterns in known locations in the environment. Once the robot detects these patterns, the position of the robot can be estimated from the known location of the pattern and its geometry. Different researchers have used different
4.4
Position Estimation Techniques for an Autonomous Robot 767
kinds of patterns or marks, and the geometry of their methods and the associated techniques for position estimation vary accordingly. These methods are discussed in Section 4. Finally, some researchers consider the position estimation problem using a priori information about the environment in which the robot is to navigate, i.e. a preloaded world model is given. The approach is to sense the environment using onboard sensors and to match these sensory observations to the preloaded world model to arrive at an estimate of the position and pose of the robot with a reduced uncertainty. Section 5 presents a review of the different methods studied in solving these issues. Once the robot has the capability to sense its environment, build a representation of it, and estimate its position accurately, then navigational tasks such as pathplanning and obstacle avoidance can be performed.
2. Landmark-Based Methods Using landmarks for position estimation is a popular approach. The robot uses the knowledge of its approximate location to locate the landmarks in the environment. Once these landmarks are identified and the range/attitude of these relative to the robot is measured, in general, the position and pose of the robot can be triangulated from these measurements with a reduced uncertainty. Landmarks used for position estimation can include natural or man-made features in the outdoor environment, such as the tops of buildings, roof edges, hilltops, etc., or can be identifiable beacons placed at known positions to structure the environment. One basic requirement of the landmark-based methods is that the robot be able to identify and locate the landmarks, which is not an easy task. The position estimation methods based on landmarks vary significantly depending upon the sensors used (eg. range or visual sensors); the type of landmarks (i.e. whether they are point sources or lines etc.); and the number of landmarks needed. Case [4] summarizes the landmark-based techniques quite well and presents a new method, called the running fix method, for position estimation. Case classifies the sensor data for navigation purposes as either angular or range type inputs. In general, any combination of two of these is sufficient to yield a fix. In Fig. 1, for instance, the angle between the x-axis and each of the two landmarks is used to construct two lines of position (LOP) which intersect at the robot’s location. In Fig. 2, arcs are struck at the measured range corresponding to the two landmarks. The intersection points of these two arcs yield two possible positions for the robot, thus requiring either correlation with an estimated position or the use of a third landmark to resolve the ambiguity. A range and an angle may also be used, as in Fig. 3. This requires only one landmark, but requires either multiple sensors or a sensor capable of measuring both range and attitude. The angle measurements can be either absolute or relative. Absolute angle measurements require the robot to maintain an
768
R. Talluri €4 J. K. Aggarwal LOP 2 Landmark 2
Fig. 1. Two landmarks and lines of position.
Landmark 2
Robot
Landmark 1 Fig. 2. Ranges from two landmarks.
Fig. 3. LOP and range from one landmark.
4.4
Position Estimation Techniques for an Autonomous Robot
769
internal reference using a gyrocompass or a n inertial sensor. Any error in this reference usually affects the position estimation accuracy. Case also presents a method called the running fix for use in areas with a low density of landmarks. The underlying principle of the technique in that an angle or range to a landmark obtained at a time t - 1 can be used at a time t. To do this the cumulative measurement vector recorded since the reading was obtained is added to the position vector of the landmark, thus creating a virtual landmark. Case presents experimental results using an ARCTEC Gemini mobile robot with an infrared beacon/detector pair. Clare D. McGillem et al. [35] also describe an infra-red location system for autonomous vehicle navigation. They present a n efficient method of resection for position estimation using three infrared beacons t o structure the environment and an optical scanner on the robot capable of measuring the angles between a pair of beacons. They also discuss the sensitivity of the method to errors. They point out that by judiciously placing the beacons in the environment, regions of high error sensitivity can be minimized or avoided. They demonstrate the feasibility of their approach by implementing it on a n inexpensive experimental system. Nasr and Bhanu [42] present a new approach t o landmark recognition based on the perception, reasoning, and expectation (PREACTE) paradigm for autonomous mobile robot navigation. They describe a n expectation driven, knowledge-based landmark recognition system that uses a priori map and perceptual knowledge. Bloom [2] also describes a landmark-based system for mobile robot navigation that uses a grid-based terrain map (MAP), a landmark database, and a landmark visibility map (LVM). Bloom describes the landmarks by one or more of three distinctive attribute sets: a color attribute set, a textural attribute set, and a 2-D geometry attribute set. He then describes the contents of these attribute sets and shows how the vision system uses these attributes to recognize the landmarks. Sugihara [47,48] presents algorithms for the position estimation of a mobile robot equipped with a single visual camera. He considers the problem of a robot given a map of a room in which it navigates. Vertical edges are extracted from the images taken by the robot’s camera, with the optical axis parallel to the floor. The points from where the vertical edges can arise are assumed to be given. Sugihara then considers two classes of problems. In the first class, all vertical edges are identical, and he searches for the point where the image is taken by establishing a correspondence between the vertical edges in the images and those in the map. In the second class of problems, the vertical edges are not distinguishable from each other and the exact directions in which the edges are seen are not given; only the order in which they are found in the image is given. The problems are considered mainly from the point of view of computational complexity. In the case where the vertical lines are distinguishable from one another, Sugihara shows that if we establish a correspondence between three image points and three poles (vertical lines) and measure the angles between the rays joining the image points to the lines, we can uniquely determine the camera’s position.
770 R . Talluri & J . K. Aggarwal
Fig. 4. The unique camera position determined by three rays and the corresponding mark points.
In Fig. 4, p l , p2, and p3 are the three poles and R is the robot’s position. In the case involving four poles, the solution is not necessarily unique. So, in general, when we have k poles and, hence, k rays, Sugihara suggests using the first three rays, rl,rz,rg to determine the position R and then using the other rays to check if the solution R is correct. Now, in the general case when the k lines are not distinguishable from one another, the suggested approach is: First, choose and fix any four rays, say r1 ,7-2, rg, 7-4, and next, for any quadruplet (pi,p j , pk,p ~ ) of marks (vertical lines), solve for the position on the assumption that r1, 7-2, rg, 7-4 correspond to ( p i , p j , p k , p l ) , respectively. Then repeat for the n(n- l ) ( n - 2 ) ( n - 3 ) different quadruplets for a consistent solution. The above naive procedure can solve for the position in O(n4) time. He then gives a less naive algorithm for the position estimation, with n identical marks, which runs in O(n3logn) time with O ( n ) space or in O(n3)time with O(n2)space. Sugihara also considers variations of this problem of n indistinguishable vertical lines, such as: (1) the existence of spurious edges; (2) the existence of opaque walls; (3) linearly arranged marks; and (4)a case in which the robot has a compass. He discusses the possible solutions and simplifications of the original algorithm to these special cases. The case in which the marks are distinguishable from one another but the directions are inaccurate is considered in the second part of the paper. He shows that this case is essentially the same as the problem of forming a region which generates a circular scan list in a given order. Krotkov [25] essentially followed Sugihara’s work of localizing a mobile robot navigating on a flat surface with a single camera using the vertical lines in the image plane as landmarks. He formulates the problem as a search in a tree of interpretations (pairings of landmark directions and landmark points), The algorithm he uses is the naive algorithm, discussed by Sugihara, that runs in O(n4) time. In his work, Krotkov also considers the errors in the ray directions and, using the worst-case analysis, comes up with bounds on the position and pose estimated using this method of localization. He shows that in the case when the angles of the rays
4.4
Position Estimation Techniques for a n Autonomous Robot
771
R, Region of possible robot locations
Fig. 5 . Possible locations given by noisy rays.
are erroneous, the robot position estimated lies not on one point but in a region of possible points, R (see Fig. 5). He also presents simulation results with random ray errors and worst-case ray errors and makes the following observations from his analysis: (1) the number of solution poses computed by the algorithm depends significantly on the number k of angular observations and the observation uncertainty 64; and (2) the distribution of solution errors, given angular observation errors that are either uniformly or normally distributed, is approximately Gaussian, whose variance is a function of 64. Krotkov also presents real data results using a CCD imagery. Most of the landmark-based approaches considered above suffer from the disadvantages of : (1) assuming the availability of landmarks in the scene around the robot; (2) depending on the visibility and the ability to recognize these landmarks from the image to estimate the range/attitude to them from the current location; (3) requiring an approximate starting location to check for the landmarks; and (4) needing a database of landmarks in the area to look for in the image.
2.1. Photogrammetric Methods Photogrammetry generally deals with the mathematical representation of the geometrical relations between physical objects in three-dimensional space based on their images recorded on a two-dimensional medium. Over the years, photogrammetry has been routinely used in aerial photography, cartography, and remote sensing [61]. One of the problems of cartography is to determine the location of a n airborne camera from which a photograph was taken by measuring the positions of a number of known ground objects or landmarks on the photograph. This problem is sometimes known as the camera calibration problem. The orientation and position of the camera in the object space are traditionally called the camera’s exterior orientation partimeters as opposed to its interior orientation parameters, which are independent
772
R. Talluri & J. K. Aggarwal
of the co-ordinate system of the ground objects. The interior orientation parameters include such elements as the camera’s effective focal length, lens distortion, decentering, image plane scaling, and optical axis orientation. These parameters generally do not vary as much, or as quickly, as the exterior orientation parameters and need not be updated at image sampling rates. For nonmetric cameras, standard off-line calibration procedures are available for determining the elements of the interior calibration. The problem of estimating the position and pose of an autonomous mobile robot is, in essence, similar to the camera exterior orientation problem in photogrammetry. However, since the robot is ground-based and has position encoders and other sensors on it, these can be used to constrain the possible orientation and pose. In general, the exterior camera orientation problem involves solving for six degrees of freedom, three rotational and three translational. Traditionally, in single camera photogrammetry, by observing the object’s feature points on the image, it is possible to solve the exterior orientation calibration problem using a traditional method known as space resection [12]. The method is based on the perspective geometry of a simplified camera model, derived from pinhole optics, in which the image of each feature point is projected onto the image plane by a ray connecting the feature point with the pinhole lens. This collinearity condition results mathematically in two nonlinear equations for each feature point. Hence at least three non-collinear points are required to solve for the six degrees of freedom. These collinearity equations are linearized and solved in a n iterative fashion. When the images are noisy, more than three points can be used, with least squares criteria, to take advantage of data smoothing. These methods are now standard in the photogrammetry literature [61]. Iterative solutions are generally more computationally demanding, so that simplifying assumptions are usually necessary for real-time applications. Over the years, a number of alternate methods have been proposed in an effort t o improve the efficiency of the camera calibration procedure. Some of these are reviewed below. Szczepanski [49] surveys nearly 80 solutions, beginning with one by Schrieber of Karlsruhe in 1879. The first robust solution in computer vision literature is by Fischler and Bolles [15]. They studied the exterior calibration problem in connection with the concept of random sample consensus (RANSAC), a methodology proposed for processing large data sets with gross errors or outliers. They argue against the classical techniques of parameter estimation, such as least squares, that optimize (according to a specified objective function) the fit of a functional description (model) to all the presented data. Their argument is that the above techniques are usually averaging techniques that rely on the smoothing assumption, which is not usually valid when the data has outliers or gross errors. The RANSAC paradigm they present can be stated as follows: Given a model that requires a minimum of n data points to instantiate its free parameters and a set of data points P such that the number of points in P is greater than n,randomly select a subset S1 of n data points from P and instantiate the
4.4
Position Estimation Techniques for a n Autonomous Robot
773
model. Use the instantiated model M1 to determine the subset S1*of points in P that are within some error tolerance of M1. The set S1*is called the consensus set of s1. If # (S1*)is greater than some threshold t , which is a function of the estimate of the number of gross errors in P , use 5'1' t o compute (possibly using least squares) a new model M1*. If # (S1*)is less that t , randomly select a new subset S 2 and repeat the above process. If, after some predetermined number of trials, no consensus set with t or more members has been found, either solve the model with the largest consensus set found or terminate in failure. Fischler and Bolles then discuss methods to determine the three unspecified parameters in the RANSAC paradigm: (1) the error tolerance, (2) the number of subsets to try, and (3) the threshold t. They then present a new solution to the Location Determination Problem (LDP) based on the RANSAC paradigm. They reduce the LDP problem to the perspective-n-point problem, i.e. if we can compute the lengths of the rays from three landmarks to the center of perspective projection, then we can directly solve for the location and orientation of the camera. They obtain solutions in a closed form for three and four coplanar feature points; the latter, as well as the case of six points in general position, are demonstrated to be unique. Unfortunately, these analytic solutions cannot be extended to the general case involving more than four points. Nevertheless, the paper does demonstrate graphically the existence of multiple solutions with four or five noncoplanar points. Beyond these qualitative observations, however, no conclusion was offered regarding the existence and uniqueness in the general case. The four point solution has been implemented in a power line inspection system [29]. Ganapathy [19] presents a noniterative, analytic technique for recovering the six exterior orientation parameters as well as four of the interior orientation parameters (two for scaling and two for the location of the origin in the image plane). His method assumes that the perspective transformation matrix relating the world model and image plane points is determined by experimental means. Ganapathy essentially presents an algorithm to decompose the given transformation into the various camera parameters that constitute the components of the matrix. He linearizes the system of equations represented by the transformation matrix by increasing the number of unknowns. He then adds additional constraints, drawn from the properties of the rotation matrix, to solve these systems of equations. The algorithm is independent of the number or distribution of the feature points, since this information has already been distilled into the transformation matrix. Although the matrix may be obtained through experimental means, it is not known whether the effort will be feasible for operation in real time. Kumar and Hanson [28] report that their implementation of the method is extremely susceptible t o noise, and suggest that the susceptibility may be due to the nonlinear least square minimization used, where it is assumed that all the parameters are linearly independent while they actually are not.
774 R. Talluri 63 J. K . Aggarwal
Tsai [58] presents a two stage technique for the calibration of both the exterior and interior parameters of the camera that is probably the most complete camera calibration method proposed so far. The interior parameters include the effective focal length, the radial lens distortion, and the image scanning parameters. The basic idea used is to reduce the dimensionality of the parameter space by finding a constraint or equation which is only a function of the subset of the calibration parameters. Tsai introduces a constraint called the radial alignment constraint, which is a function of only the relative rotation and translation (except for the z component) between the camera and the calibration points. Although the constraint is a nonlinear function of the above mentioned calibration parameters (called Group I parameters), a simple and efficient way exists for computing them. The rest of the calibration parameters (called Group I1 parameters) are computed with normal projective equations. A good initial estimate of the Group I1 parameters can be obtained by ignoring the lens distortion and using simple linear equations in two unknowns. The precise values for these Group I1 parameters can then be computed in one or two iterations, minimizing the perspective equation error. One of the limitations of this technique is that although the method calls for a minimum of five coplanar feature points (seven in the non-coplanar case), a much larger number is required for accuracy (60 points were used in the experiment). Furthermore, restrictions in the relative positions between the objects and the camera exist. For instance, the plane containing the feature points must not be exactly parallel to the image plane of the camera. Although these conditions can be easily arranged in a laboratory environment, they cannot be guaranteed to hold in a real life operating environment for a mobile robot. Finally, the range parameter (Group 11) must still be generated by a nonlinear optimization procedure (specified only as a steepest descent), the choice of which could have a major influence on the efficiency of the algorithm. Horaud et al. [21] consider the perspective-Cpoint problem. They derive an analytic solution for the case of four non-coplanar points, namely a biquadratic polynomial in one unknown. h o t s of such an equation can be found in closed form or by an iterative method. Finding a solution for four non-coplanar points is equivalent to finding a solution to a pencil of three non-coplanar lines: The three lines share one of the four points. The authors show the various line and point configurations that are amenable to solving the P4P problem. Liu, Huang, and Faugeras [31] present a new method for determining the camera location using straight line correspondences. Since the lines can be created from given points, the method can be used for point correspondences also. They show that the rotation matrix and the translation vector can be solved for separately. Both linear and nonlinear algorithms are presented for estimating the rotation. The linear met hod needs eight line correspondences or six point correspondences, while the nonlinear method needs three line or point correspondences. For the translation vector, the method needs three line correspondences or two point correspondences and the algorithm is linear. The authors argue that since the nonlinear methods
4.4
Position Estimation Techniques for an Autonomous Robot
775
need fewer correspondences and have a wide convergence range, they may be preferable in practical problems. The constraint used by Liu, Huang, and Faugeras is that the 3-D lines in the camera coordinate system must lie on the projection plane formed by the corresponding image line and the optical center. Using this fact the constraints of rotation can be separated from those of translation. They suggest two methods to solve for the rotation constraint. In the first, they represent the rotation as an orthonormal matrix and the device as an eigen-value solution. However, they do not enforce the six orthonormality constraints for an orthonormal matrix. The second method represents rotation by Euler angles and is a nonlinear iterative solution obtained by linearizing the problem about the current estimate by the output parameters. The translation constraint is solved by a linear least-squares method. Kumar [27] argues that the decomposition of the solution into the two stages of solving first for rotation and then for translation does not use the set of constraints effectively. His argument is that since the rotation and translation constraints, when used separately, are very weak constraints, even small errors in the rotation stage become amplified into large errors in the translation stage. This, he says, is particularly true in the case of an autonomous mobile robot in an outdoor environment, where the landmark distances from the camera are large. He suggests solving for both the rotation and translation matrices simultaneously to achieve better noise immunity. He uses the same constraints as Liu, Huang, and Faugeras but a different nonlinear technique. The technique he uses is one adapted from Horn [22] to solve the problem of relative orientation. Kumar presents two algorithms, R-then-T and R-and-T. The former solves for the rotation first and then for the translation using the rotation matrix. The latter solves for both rotation and translation simultaneously. He presents experimental results which show that R-and- T performs better in all cases. In addition, he also develops a mathematical analysis of the uncertainty measure, which relates the variance in the output parameters to the noise present in the input parameters. For the analysis, he assumes that that there is no noise in the 3-D model data and that the only input noise occurs in the image data. To handle the problem of outliers or errors in the data and landmark correspondences, Kumar and Hanson [28] present a technique that performs quite well even in the presence of up to 49.9%outliers or gross errors. The work is basically an extension of their previous work [27]. They present an algorithm called Med-R-and-T which minimizes the median of the square of the error over all lines, or the LMS (least median of squares) estimate. The outliers can be arbitrarily large. The algorithm is based on the robust algorithm by Rosseeuw [45]. LMS algorithms have been proven to have a 49.9% breakdown point. Haralick et al. [20] summarize the various cases of the position estimation problem using point data. They consider the pose estimation problem to involve, essentially, the estimation of the object position and orientation relative to a model reference frame or relative to the object position and orientation at a previous time using a camera sensor or a range sensor. They divide the problem into four cases, depending on the type of model and sensor data: (1)2-D model data and 2-D sensor
776
R. Talluri €4 J. K. Aggarwal
data, (2) 3-D model data and 3-D sensor data, (3) 3-D model data and 2-D sensor data, and (4) two sets of 2-D sensor data. All data considered is point data, and the correspondence between the model and sensor data is assumed. The 2-D sensor data is usually the camera perspective projection. The 3-D sensor data refers to range data. The authors refer to Case 3 as absolute orientation and Case 4 as relative orientation. Case 4 occurs in multicamera imagery or time-varying imagery. Haralick et al. present a solution to each of the above four problems and characterize their performance under varying noise conditions. They argue for robust estimation procedures in machine vision, since all machine vision feature extractors, recognizers, and matchers seem to make occasional errors which are indeed blunders. Their thesis is that the least square estimators can be made robust under blunders by converting the estimation procedure to an iterative, reweighted least squares procedure, where the weight for each observation depends on the residual error and its redundancy number. So, they first find the form of the least-square solution, establish their performance as a baseline reference, put the solution technique in an iterative reweighted form, and, finally, evaluate the performance using non-normal noise, such as slash noise. The least-squares solution for both the 2-D2-D and the 3-D-3-D cases are constrained to produce rotation matrices guaranteed to be orthonormal. Yuan [60] presents a general method for determining the 3-D position and orientation of an object relative to a camera based on a 2-D image of known feature points located on the object. The problem is identical to the camera exterior Calibration problem. In contrast to the conventional approaches, however, the method described here does not make use of the collinearity condition, i.e. the condition of the pinhole camera and the perspective projection. Instead, the algebraic structure of the problem is fully exploited to arrive at a solution which is independent of the configuration of the feature points. Although the method is applicable to any number of feature points, Yuan says that no more than five points are needed from a numeric stand point and, typically, three or four points suffice. A necessary condition for the existence of the solution is presented and also a rigorous proof for the uniqueness of the solution in the case of four coplanar points. He shows with simulation results that in the case of four feature points, non-coplanar configurations generally outperform the coplanar feature point configurations, in terms of robustness, in the presence of image noise. In a more recent work, Chen [6] describes a polynomial solution to the pose estimation problem that does not require an a przori estimate of the robot location, using line-to-plane correspondences. He describes the situations when such a problem arises. In the case of a mobile robot, the lines are the 2-D image features and the planes are the projection planes joining these lines to the 3-D world model features. As do Liu et al. [31], Chen also solves for the rotations first and then for the translations. The crux of the approach is that it converts a problem with three unknowns (the three rotation angles) into one that has two unknowns
4.4
Position Estimation Techniques for an Autonomous Robot
777
by transforming the co-ordinate system into a canonical configuration. The two unknowns are then computed by evaluating the roots of an eighth degree polynomial using an iterative method. Chen also presents closed form solutions for orthogonal, co-planar and parallel feature configurations. He also derives the necessary and sufficient conditions under which the line-to-plane pose determination problem can be solved. 3. Trajectory Integration and Dead Reckoning
In this section, we consider techniques for estimating the position and pose of a mobile robot by integrating over its trajectory and dead reckoning, i.e. the robot maintains an estimate of its current location and pose at all times and, as it moves along, updates the estimate by dead reckoning. In order to compute a n accurate trajectory, the robot detects features from the sensory observations in one of the positions and these are used to form the world model. As the robot moves, these features are again detected, and correspondence is established between the new and the old features and the trajectory of the robot is computed. These techniques do not rely on the existence of landmarks and the robot’s ability to identify them. However, to successfully implement such techniques a fundamental problem of environment perception and modeling must be addressed. Indeed, the model of the environment and the location model are the two basic data for position estimation, path planning, and all other navigation tasks involving interaction between the robot and its environment. 3.1. Spatial Representation Using preloaded maps and absolute referencing systems can be impractical because they constrain the robot’s navigation to a limited, static, and structured environment. In this section we survey the various approaches for map making and position estimation using trajectory integration and dead reckoning and evaluate their relative merits. The various approaches are influenced by the environment in which the robot navigates and the type of sensing used. Some approaches try to reason away errors and uncertainties to simplify the map-making process, while others take explicit account of errors and uncertainties using either static error factors or stochastic approaches that use probability distributions to model the errors. Most of the methods deal with an indoor factory or office-type environment made up of walls, corridors, and other man-made obstacles in which the robot navigates. Map-making in an outdoor scenario is a much more complex problem which relies on the existence of landmarks and digital elevation maps of the area. The different methods of mobile robot map-making studied so far are quite varied and differ chiefly in terms of: The environment in which the mobile robot is to navigate. The map-making strategies for an indoor ofice type robot differ significantly from those of a n outdoor terrain autonomous land vehicle.
778 R. Talluri Ed J. K . Aggarwal 0
0
0
The type of world representation (either 2-D or 3-D). Most methods consider a 2-D representation or a flopr map type approach. Since the mobile robot is essentially interested only in obstacle avoidance and path planning, a map of the vacant/occupied areas of the floor should suffice for these tasks. The types of sensors used. To a certain extent, the sensing modality affects the mapping strategies used. Typically, all mobile robots use some kind of range sensor. If a passive range sensor such as binocular stereo is used, the map so constructed will usually be sparse and feature-based. On the other hand, a laser range finder gives dense, high resolution depth estimates, which affect the mapping strategy differently. Sonar-based range finding techniques give less accurate and hence more uncertain depth estimates, so the map-making technique used should have the capability to deal with these uncertain readings. The navigational tasks to be accomplished by the robot. Most mobile robots consider tasks such as position estimation, obstacle avoidance, and path planning.
Keeping in view the above differences, map-making approaches can be broadly classified into the following four types: (1)object feature-based methods; (2) graphbased approaches; (3) certainty grid-based approaches; and (4)qualitative methods These categories are not exacting, since some approaches do not fit into any of the categories and some have properties of more than one approach. However, such a classification may help put things in a better perspective. 3.2. Object Feature-Based Methods
In these methods, object features detected from the sensory observations in one of the robot’s positions are used to form the world model. As the robot moves, these features are again detected, and correspondence is established between the new and the old features. Usually the motion of the robot is known to a certain degree of accuracy as given by its position sensors. These motion estimates are then used to predict the occurrence of the new positions for the features in the world model. The prediction is then used as an aid to limit the search space and to establish a correspondence between the detected features and those already in the current world model. A mechanism to consistently update the world model is also provided when new features are detected. One significant advantage of the object feature-based methods is that after the position sensors are used to establish correspondence, the motion parameters of the robot between the old and the new positions can be solved for explicitly. The solution provides a much more accurate estimate of the robot’s position. The loop then continues, and the world model is continuously updated. The type of sensing used is typically stereo triangulation or other types of visual sensing [38,39,33,43]. Crowley [8] uses a ring of 24 sonar sensors for a similar paradigm. Moravec’s Cart [38] was one of the first attempts at autonomous mobile robot navigation using a stereo pair of cameras. He defines an interest operator to locate the features in a given image. Essentially, the interest operator picks regions that
4.4
Position Estimation Techniques for an Autonomous Robot
779
are a local maxima of a directional variance and uses these to select a relatively uniform scattering of good features over the image. A coarse to fine correlation strategy is used to establish correspondence between the features selected by the interest operator between different frames. The Cart uses a unique variable baseline stereo mechanism called slider stereo. At each pause, the computer slides its camera left to right on a 52 cm track, taking nine pictures a t 6.5 cm intervals. A correspondence is established by using a coarse to fine correlation operator between the central image and the other eight images, so that the features’ distance is triangulated in the nine images. These are then considered as 36 stereo pairings and the estimated (inverse) distance of the feature is recorded in a histogram. The distance to the feature is indicated by the highest peak in the histogram if it crosses a given threshold; otherwise, it is forgotten. Thus, the application of a mildly reliable (correlation) operator is used to make a very reliable distance measurement. Position estimation in the Cart is carried out in exactly the same manner as described before, i.e. the features used t o establish correspondence are then used to estimate the motion parameters and, hence, the location. The world model developed by the Cart is a set of these matched object features. The uncertainty and error modeling of the object features used in the Cart is a simple scalar uncertainty measure. This measure was proportional to the distance of the feature from the robot location; the further the feature, the larger the error associated with it, and, hence the less reliable the measure. Matthies and Shafer [33] show that Moravec’s approach is very similar to using a spherical probability distribution of error centered around the object feature. They argue that a 3-D Gaussian distribution is a much more effective way t o explicitly deal with stereo triangulation errors and errors due to the image’s limited resolution. They detail a method to estimate the 3-D Gaussian error distribution parameters (mean and covariance) from the stereo pair of images. They then present a method to consistently update the robot’s position, explicitly taking into account the Gaussian error distribution of the feature points and motion parameters and their error covariances. A Kalman filter approach is used t o recursively update the robot position from the detected features and the previously maintained world model. They assume the correspondence problem to be solved, and show by simulation data and experimental results that the Gaussian error model results in a more accurate stereo navigation paradigm. The Cart suffers from the requirement of a huge memory to store the object features, the lack of speed (typically it moves in lurches of 1meter in 10 to 15 minutes), and errors in position estimation due to insufficient error modeling. However, as one of the first autonomous mobile robots, it performed very well and made clear the various problems associated with autonomous navigation. The CMU Rover dealt with and corrected many of these problems [39]. Faugeras and Ayache [14,13,1]also address the problem of autonomous navigation. They use trinocular stereo to detect object features. The features they use are
780
R. Talluri & J. K. Agganual
3-D line segments. They propose a paradigm to combine coherently visual information obtained at different places to build a 3-D representation of the world. To prevent the system using line segments as primitives from running out of memory, they want their system t o “forget intelligently” i.e. if a line segment “S” is detected at different positions 1, 2, 3 , . . . , n of the robot as S1, Sz, . . . ,S,, they want to establish a correspondence between all these, to form the line segment S from them, and to forget all others. Thus, the end result is a representation of the environment by a number of uncertain 3-D line segments attatched to co-ordinate frames and related by a n uncertain rigid motion. The measurements are combined in the presence of these uncertainties by using the Extended Kalman Filtering technique. The authors present these ideas, and detail with experimental data the technique for building, registering, and fusing noisy visual maps. A framework presented by Smith and Cheesman [46] for the representation and estimation of position uncertainty is relevant in this context. They describe a general method for estimating the nominal relationships and expected error (covariance) between coordinate frames representing the relative locations of objects. They introduce the concept of Approximate Transformations (ATs) consisting of an estimated mean of one co-ordinate frame relative to another and an error co-variance matrix that expresses the uncertainty of the estimate. They present two basic operations that allow the estimation of the relationship between any two coordinate frames given another relative transformation linking them. The first, Compounding, allows a chain of ATs to be collapsed (recursively) into a single AT. The final compounded AT has a greater uncertainty than its components. The second operation, Merging, combines information from parallel ATs t o produce a single resulting AT with an uncertainty less than either of its components. Crowley [8,9] has a similar approach to Ayache and Faugeras [14]; he also uses a line segment based representation of the free space using Extended Kalman filtering techniques for dealing with error covariances. However, he uses a circular ring of 24 Polaroid ultrasonic sensors, while Faugeras and Ayache use trinocular stereo. 3.3. Graph-Based Approaches Rodney Brooks [3] was one of the first to suggest the idea of using a relational map, which is rubbery and stretchy, rather than place observations in a 2-D coordinate system. The key idea is to represent free space as freeways, elongated rectangular regions, which naturally describe a large class of collision-free straight line motions of objects to be moved. Some places are described as convex regions and called meadows. So a map representation of free space is a graph; nodes of the graph are meadows; and arcs of the graph are freeways. Meadows and freeways are further described with metric and relational position and orientation properties. Brooks also suggests a method for dealing with uncertainties in the position of the robot. He argues that the position uncertainty is a 2-D manifold in a 3-D
4.4
Position Estimation Techniques for an Autonomous Robot
781
space and that dealing with this explicitly makes it mathematically complex. He proposes to use instead a n upper bound on the uncertainty which is cylindrical and mathematically easier t o handle. Brooks also suggests how the uncertainty in the position estimation can be reduced if landmarks can be detected in a meadow. Miller [37] presents a surface representation for real-world indoor robots equipped with a ranging device such as sonar and robot odometry as sensors. He assumes that the world consists of a flat, open plane on which walls and obstacles are placed. Since the robot is limited to motion on the plane of the floor, the projection of walls and obstacles on the plane of the floor captures all the relevant world information. The basic unit of the spatial representation system is the map, composed of linked regions. Regions have a local coordinate frame. Walls and obstacles are themselves represented by line segments, whose end point positions are designated by coordinates in the frame of the region. The borders of the regions are marked with labels that specify the adjoining regions. Regions can be of four types, 0-F, 1-F, 2-F, and 3-F, since a floor dwelling mobile robot has three degrees of freedom, two translational (x and y), and one rotational (orientation 0). A type designation of j-F means that a sensor (here a sonar range sensor with a maximum range of Dmax)can be used t o eliminate j degrees of freedom. Regions are made up of a set of edges. Each edge is represented by a pair of end points, whose Cartesian coordinates are specified in the frame of reference of a particular region. The relative positions of features in two different regions cannot be known with great precision. The more 0-F, 1-F, and 2-F regions on the path between the two regions in question, the less the accuracy with which the two regions can be related. It is, however, possible to arrive at a n approximate idea of the distance t o be traveled between regions by using the lower bounds of the regions. Having set the mapping scheme, Miller then presents methods for position estimation as a heuristic search paradigm. The type of region in which the robot operates determines the amount of position information that can be calculated. If the robot is in a n 0-F region, then the only position information available would be extrapolations from the last known position, based on the robot's ability t o do dead reckoning. If the robot is known t o be in a region that is 1-F or greater, then position information can be found by taking several sensor readings and conducting a heuristic search over the tree of possible matches between the observations and the edges in the map. Chatila and Laumond [5] present a world modeling and position referencing system on their mobile robot HILARE. They take a multisensor approach using a laser range finder for measuring depth and optical shaft encoders on the drive wheel axis for the trajectory integration. The random errors are modeled as Gaussian distributions and their parameters are determined experimentally. They present a three-layer model consisting of geometric, topological, and semantic levels. In the model construction paradigm, the robot at every instance has:
782 R . Talluri 6 J. K. Aggamal
(1) a current environmental model with geometric, topological, and semantic levels related to an absolute reference frame, (2) knowledge about the attitude and position of the robot, and (3) a robot-centered geometric model of the environment perceived at that point. The central problem is to update the models of (1) using (2) and (3), and to correct the information of (2), if possible. 3.4. Certainty Grid-Based Methods
Moravec and Elfes [11,40] use a grid-based representation for mapping the environment a mobile robot will inhabit. The basic idea is to represent the floor as a rectangular grid and to store the information about the occupancy of different portions of the floor on this grid as probability distributions. A sensor range reading provides information concerning empty and occupied volumes in a cone in front of the sensor. The readings are modeled as probability profiles and are projected onto a rasterized 2-D map where somewhere occupied and everywhere empty regions are represented. Range measurements from multiple points of view (taken from multiple sensors on the robot and from the same sensor after the robot moves) are symmetrically integrated into the map. Overlapping empty volumes reinforce each other and serve to condense the range of the occupied volumes. The map definition improves as more readings are added. The final map shows regions probably occupied, probably empty, and unknown areas. The method deals effectively with clutter and can be used for motion planning and extended landmark recognition. The system was tested and implemented on a CMU mobile robot called Neptune. The authors also develop and present a fast algorithm for relating two maps of the same area to determine relative displacement, angle, and goodness of the match. These can then be used to estimate the position and pose of the robot. A measure of the goodness of the match between two maps at a trial displacement and a rotation is found by computing the sum of products of corresponding cells in the two maps. An occupied cell falling on an occupied cell contributes to a positive increment to the sum, as does an empty cell falling on an empty cell. An empty cell falling on an occupied one reduces the sum, and any comparison involving an unknown value causes neither an increase nor a decrease. Moravec and Elfes then offer more efficient versions of this naive algorithm, which take into account only the occupied cells and also use a hierarchy of reduced resolution versions of each map. The authors argue that the advantages of the sonar maps are that they : (1) are much denser than stereo maps, (2) require less computation, (3) can be built more quickly, and (4)can be used for position estimation. Of course, the disadvantages of sonar maps are that the large uncertainty areas associated with the features detected and the difficulties associated with active sensing.
4.4
Position Estimation Techniques for an Autonomous Robot
783
Moravec [40,41] presents a new Bayesian statistical foundation for the mapmaking strategies in the certainty grid framework which seems to hold promise. The fundamental formula used is for the two occupancy cases of a cell o (cell is occupied) and 6 (cell is empty) with prior likelihoods p(o) and p ( 6 ) and new information M ; Bayes’ theorem can be expressed as
The new information, M , occurs in terms of the probability of M in the situation that a cell is or is not occupied, i.e., P ( M / o ) and P ( M / B ) , respectively. This inversion of o and M is the key feature of using the Bayesian framework, and it combines independent sources of information about o and M into a single quantity P ( o / M ) . Moravec then elaborates on this principle and derives formulas for the various cases of multiple sensor readings and presents a Context-free and ContextSensitive method. The former is much faster, but the latter is much more reliable. The former has a linear cost while the latter has a cost proportional to the cube of the volume. These methods are illustrated by simulations. The certainty grid representation also provides an easy framework for fusing information from different sensing modalities, such as sonar, stereo, thermal, proximity, and contact sensors. Matthies and Elfes [34] present several approaches and results in integrating sonar and stereo in a certainty grid.
3.5. Qualitative Methods Levitt et al. [30] and Kuipers et al. [26] argue that the existing robot navigation techniques use absolute range information and, hence, tend to be brittle, to accumulate error and to use little or no perceptual information. They propose qualitative methods which do not depend as much upon metrical information as on perceptual information to build a topological map. Levitt et al. [30] describe a formal theory that depends on visual landmark recognition for the representation of environmental locations. They encode perceptual knowledge in structures called wiewframes. Paths in the real world are represented as a sequence of sets of landmarks, viewframes, and other distinctive visual events. Approximate headings are computed between viewframes that have their lines of sight to common landmarks. Range-free, topological descriptions called orientation regions are rigorously abstracted from viewframes to yield a coordinate-free model of the visual landmark memory that can also be used for navigation and guidance. With this approach, a robot can opportunistically observe and execute visually cued short-cuts. Map and metric information are not required but, if available, are handled in a uniform representation with the qualitative navigation technique. Most of the examples they present are of simulated outdoor scenes using a visual sensor. Kuipers et al. [26] present an approach similar in spirit for an indoor robot with a sonar sensor. They draw a parallel from cognitive science and argue that a powerful description of the environment is a topological description. Their topological
784
R. Talluri €d J. K. Aggamal
description consists of a set of nodes and arcs. The nodes represent distinctive places and the arcs represent travel edges connecting them. A distinctive place is defined as the local maximum of some measure of distinctiveness appropriate to its immediate neighborhood and is found by a hill climbing search. Local travel edges are described in terms of local control strategies required for travel. How to find the distinctive places and how to follow edges is the procedural knowledge which the robot learns dynamically during the exploration stage and which guides the robot in the navigation stage. An accurate topological model is created by linking places and edges, and allows metrical information to be accumulated with reduced vulnerability to metrical errors. The authors describe a simulated robot called NX to illustrate the technique. The position estimation strategies that use trajectory integration and dead reckoning thus rely on the robot’s ability to sense the environment and to build a representation of it, and to use this representation effectively and efficiently. Each of the approaches detailed above has relative merits and works well in different environments. The sensing modalities used significantly affect the map-making strategy. Error and uncertainty analyses play an important role in accurate position estimation and map building. It is important to take explicit account of the uncertainties; modeling the errors by probability distributions and using Kalman Filtering techniques are good ways to deal with these errors explicitly. Qualitative methods propose to overcome the brittleness of the traditional approaches by relying on perceptual techniques. In general, a 2-D floor map of the environment is good enough for most navigation problems such as path planning and position estimation. This approach conserves memory and is easier to build. Certainty grid-based methods are novel and use the Bayesian probability techniques to advantage in combining information from various viewpoints consistently. 4. Techniques Using a Standard Pattern Another method of estimating the position and pose of the mobile robot accurately is to place standard patterns in known locations in the environment. Once the robot detects these patterns, the robot’s position can be estimated from the known location of the pattern and its geometry. The pattern itself is designed to yield a wealth of geometric information when transformed under the perspective projection. Ambiguous interpretations are avoided, and a minimum of a priori knowledge about the camera is desirable. These methods are particularly useful in those applications where a high degree of accuracy in the positioning of the robot is required only after it is near a particular workstation. Simple trajectory integration systems could be used to locate the robot near the work station. Then by identifying the mark (standard pattern) located near the workstation, the robot can be positioned more accurately. Researchers have used different kinds of patterns or marks, and the geometry of the method and the associated techniques for position estimation vary accordingly.
4.4
Position Estimation Techniques for an Autonomous Robot
785
circular arc with constant distance r from the origin
\
A
0
B
circular arc with constant angular width y formed by the segment AB
Fig. 6. Determination of point Q by r and $J.
Fukui [18] uses a square mark rotated by 45 degrees. As the robot moves on the floor, h k u i determines the position of the robot by two co-ordinates, r and p , where r is the distance between the standard point and the robot and p is an angle between the r vector and the normal line t o the mark. Figure 6 shows this situation. Two circles are drawn, one with the center at the standard point and with a radius of r , the other with an arc of a constant visual angle $J made by watching a segment A B on the mark. Generally these two circles intersect a t two points, and it is easy to judge in another way which is the real point. However, if $J is a right angle, the two circles become the same and the position cannot be determined. The height of the camera is adjusted to the square mark ACBD, and it is imaged. If 8 is the visual angle made by viewing CD in the square and $J is that made by viewing AB, then the robot position in polar co-ordinates ( p , r ) can be determined from the following relations: r=w
+
(1 case) sin 19
p = farctan
f i r 2 + w ~ ) ( c o s $-J ()r~2- w2)2 (r2 - w2) sin $J
where A B is equal to CD,which equals 2w in length, and r # w (see Fig. 7). To know the sign of p , Fukui measures the two angles in the image which correspond to LCAD and LCBD, and decides if LCBD 5 LCAD, p 5 0, or else p > 0. The angles $J and 8 are determined as follows: Using all the data of the image and the method of least squares, the equations of the lines AC, AD, C B , and DB are determined on the image. Then the four points A, B , C,D are determined as the points of intersections of these lines. Next, assuming that $J is proportional to the corresponding length A B on the image and
786 R. Talluri d J. K . Agganual Rhombusmark
C
D C
Tv
Fig. 7. Diagram to measure the distance from the origin.
that 8 is proportional to CD, Fukui determines the proportional coefficients that can be used to convert the measured lengths into the desired angles. Fukui also outlines image processing techniques to extract the lines of the pattern from the images. In addition, he presents experimental results to determine the camera position using this method and discusses the effects of errors in the measurement of the angles 8 and $J on the position @, r ) . Courtney, Magee, and Aggarwal [7] use the same mark as Fukui but relax the constraint of having the lens center at the same height as the mark center by partitioning the problem into two planes. Each plane passes through the lens center and either the vertical or the horizontal diagonal. Their results initially yield two equations in three unknowns, which must be further constrained by adding a second mark at a known height above the original mark or by assuming that the height of the camera relative to the mark is known. Since adding the second mark forces the solving of a system of six nonlinear equations, they opt for the later solution, which involves straightforward substitution. Magee and Aggarwal [32] consider the use of a standard mark which would always directly produce at least one of the three position parameters (distance, elevation, or azimuth) and whose geometric properties would be such that its basic shape would be unchanged when its center is viewed along the optical axis. A sphere is such an object, and its projection is always a circle whose radius may be used to determine the distance. On the other hand, an unmarked sphere produces no information regarding the orientation, and so horizontal and vertical great circles are added to the sphere for computing the elevation and azimuth. The resulting selflocator system is mathematically quite simple. The preprocessing stage requires that four values be determined. These are the center and radius of the sphere’s projected circle and the co-ordinates of the points on the projections of the great circle that are closest to the center of the sphere’s outline. The three position
4.4
Position Estimation Techniques for an Autonomous Robot
787
Fig. 8. The robot locator sphere.
Fig. 9. Geometry for finding the distance to the center of the sphere.
estimation parameters used are: (1) the distance of the lens center D , (2) the elevation angle 4 of the lens center above the horizontal great circle, and (3) the azimuth angle 8 of the lens center with respect to the plane of the vertical great circle. The value of D can be computed from the relation
where f is the focal length of the camera, R is the radius of the sphere, and r is the radius of the circular projection of the sphere on the image plane (see Figs. 8 and 9 for details). Similarly, the authors give relations to determine the azimuth and the elevation angles from the projections of the great circles. The preprocessing used to extract these primitives from the images are also discussed and experimental results in estimating the position are shown. From the error analysis presented, the authors show that the errors in the computed distance increase as the camera is moved farther from the sphere and the errors in the computed angles increase as their respective great circles approach the edge of the sphere. This method is robust as long as the primary features are not lost in the sphere’s shadow. Drake et al. [lo] present a method of estimating the position and pose of a mobile robot using a Navigation Line, for use in factory environments. The navigation line is a long line with parallel edges on the floor, that does not intersect other lines.
788
R. Talluri & J. K. Aggarwal
z t \ \uI
T
Fig. 10. The sensor geometry.
Boreslght of the sensor
x =x-
I
x=x+
x=o
Fig. 11. Mobile robot with -0 orientation angle error and
-20
x-shift position error.
Navigation Line
I-
_.
-xO
Fig. 12. Global coordinates of the navigation line.
4.4
Position Estimation Techniques for an Autonomous Robot
789
Once the line is imaged and detected by the robot, the position of the robot with respect to the line can be easily computed. The geometry used by the authors, illustrated in Figs. 10, 11, and 12 below, explains the method. 8 is the pan angle and is the tilt angle of the sensor. Two coordinate systems are shown; the unprimed coordinates (5,y, z ) represent the global coordinate system and the primed coordinates (x’, y’) represent the coordinate system of the image plane. In Fig. 10, the gimbal center coincides with the focal point and is centered on the xy plane, and z = 0 is defined as the ground plane so that the sensor is at zo. The authors also assume that the sensor is centered on the navigation line so that the pan angle 0 = 0. Since the navigation line is coincident with the y-axis, the line has the co-ordinates as shown in Fig. 11. The robot’s position may be described by two parameters: the lateral position along the x-axis between the sensor and the navigation line (the x-shift) denoted by 20, and the angle between the robot’s orientation vector (the direction of travel) and the navigation line denoted by 8. The authors develop relations for these two parameters in terms of the focal length of the camera f, the image plane co-ordinates of the edges of the line x’,and z’, given below. x=xo-
zox’ cos 8
+ (z’ sin d, - f cos d,) (zo sin 8) z‘ cosd, + f sin4
(4.4)
While this equation is only one equation in two unknowns, by using a number of (x’, 2‘) points along the line and using numerical techniques, the values of x and 0 can be solved for quite accurately. The authors also present a specialized operator to detect edges in an image that occur at a specific angle, and use the operator to detect the edges of the navigation line. In addition a Hough transform is used to completely segment the navigation line from the image. The authors also present experimental results to illustrate the robustness of the method. Kabuka and Arenas [23] consider the problem that the robot might end up in a position that will not allow it to view the standard pattern. To alleviate the problem, they suggest using multiple patterns in the navigation environment. It is assumed that the location of each pattern in some standard world coordinate system is known. They associate a unique code with each pattern that will enable the robot to identify and distinguish that pattern from all the others. It consists of two parts: a relative displacement part and an identification code. The relative displacement pattern is used, as in the previous methods, to obtain the relative position of the viewing point with respect to the pattern by analysis of the particular geometric characteristics of its projection onto the image plane. The identification codes serve two purposes: they provide a unique code to discern the viewed pattern from other patterns in the environment, and they provide an aid to scan for the pattern in a minimal amount of time. The displacement pattern used by the authors is a circle, and the identification codes used are similar to bar codes. The authors present a detailed analysis of the application of this method and study the effects of errors in the input parameters on the position estimation.
3 J . K. Agganual 790 R. Talkuri 6
5 . Model-Based Approaches
Some researchers consider the problem of the position estimation of a mobile robot when a priori information is available about the environment in which the robot is to navigate. This could be provided in terms of a CAD model of the building (or a floor map, etc.) in the case of a n indoor mobile robot, or a Digital Elevation Map (DEM) in the case of an outdoor robot. In these cases, the position estimation techniques used take on a different flavor. The basic idea is, of course, t o sense the environment using onboard sensors on the robot and t o match these sensory observations to the preloaded world model t o arrive at an estimate of the position and pose of the robot with a reduced uncertainty. One problem with such an approach is that the sensor readings and the world model may be in different forms. For instance, given a CAD model of the building and a visual camera, the problem is to match the 3-D descriptions in the CAD model to the 2-D visual images. This is the problem addressed by Kak et al. [24]. They present PSEIKI, a system that uses evidential reasoning in a hierarchical framework for image interpretation. They discuss how the PSEIKI system can be used for mobile robot self-location and how their approach is utilized by the navigational system of the autonomous mobile robot PETER. The robot’s position encoders are used to maintain an approximate estimate of its position and heading at each point. However, to account for errors in the quantization effects of the encoders and the slippage of the wheels, a visual sensor in conjunction with a CAD model of the building is used to derive a more accurate estimate of the robot’s position and pose. The basic idea is that the approximate position from the encoders is utilized t o generate, from the CAD model, an estimated visual scene that would be seen. This scene is then matched against the actual scene viewed by the camera. Once the matches are established between the features of the two images (expected and actual), the position of the robot can be estimated with a reduced uncertainty. Tsubouchi and Yuta [59]discuss the position estimation techniques used in their YAMABICO robot, which use a color camera and a map of the building in which the robot navigates. The authors propose a vision system using image and map information with consideration of real time requirements. This system consists of three operations. The first operation is the abstraction of a specified image from a TV camera. The image is processed, and highly abstracted information, called the real perspective information, is generated. The second operation is the generation of the estimated perspective information by coordinate transformation and map information, using information about the robot’s position and direction. The third operation is the establishment of correspondence between the two perspectives. The authors use color images in their real perspective views. They argue for color images, saying that they are invariant under lightness and shadow. From the color images, the authors extract regions of similar color and fit trapezoids t o these regions. From the map information, trapezoids are also extracted and, in the matching process, these trapezoids from the two sources are used as matching primitives.
4.4
Position Estimation Techniques for an Autonomous Robot
791
The authors provide a method of representing the map information efficiently and also discuss techniques for matching trapezoids. Real image data are provided as examples. As pointed out earlier, one of the key issues involved in determining the position of a mobile robot given a world model is to establish a correspondence between the world model (map) and the sensor data (image). Once this correspondence is established, the position of the robot in the environment can be determined easily as a coordinate transformation. Indeed, this problem of imagelmap correspondence is of fundamental importance not only to the mobile robot position estimation problem, but also to many other computer vision problems, such as object recognition, pose estimation, airborne surveillance and reconnaissance, etc. Other work addressing this image/map correspondence problem is described in [30,24,36,16,17,44]. Freeman and Morse [17] consider the problem of searching a Contour Map for a given terrain elevation profile. Such a problem is encountered, for example, when locating the ground track of an aircraft (the projection of the flight path on the ground) given the elevation of the terrain below the aircraft during the flight. The authors describe a solution that takes advantage of the topological properties of the contour map. A graph of the map topology is used to identify all the possible contour lines that would have been intersected by the ground track. So, the topological constraints of the terrain elevation profile and the geometric constraints of the flight path are used in estimating the location of the elevation profile in the given map. Ernst and Flinchbach [16] consider the problem of determining the correspondence between maps and the terrain images in low altitude airborne scenarios. They assume that an initial estimate of the three-dimensional position is available. Their approach consists of partially matching the detected and expected curves in the image plane. Expected curves are generated from a map using the estimate of the sensor position and the simulated curves are matched with the curves in the image plane. Rodriguez and Aggarwal [44] consider the problem of matching aerial image to a Digital Elevation Map (DEM). They use a sequence of aerial images to perform stereo analysis on successive images and recover an elevation map. Then they present a method to match the recovered elevation map to the given DEM and thereby estimate the position and pose of the airborne sensor. Talluri and Aggarwal [50-521 describe a position estimation technique for autonomous mobile robots navigating in outdoor, mountainous environment equipped with a visual camera that can be panned and tilted. A DEM of the area in which the robot navigates is provided to the robot. The robot is also assumed to be equipped with a compass and an altimeter to measure the altitude. Typical applications could be that of an autonomous land vehicle, such as a planetary rover. The approach presented formulates the position estimation problem as a constrained search problem. The authors also follow the idea of computing the expected image and comparing it to the actual image. In particular, the main idea of their work is to hypothesize a robot location, render the model (DEM) data, extract the Horizon Line Contour (HLC) and compare it to the HLC extracted from the camera images. In order
792
R. Talluri €9 J. K . Aggarwal
to reduce the complexity of the search, the authors propose a two stage search strategy. First, all possible camera locations are checked by comparing the predicted (from the DEM) HLC height at the center of the image with the height computed from the camera image. This is done in the four geographic directions (N, S, E, and W). Only the candidate locations with the HLC height within some threshold of the actual height remain. Second, for each remaining candidate location the terrain image is rendered and the complete HLC is extracted and matched with the actual HLC (from the camera image). Examples of the position estimation strategy using real terrain data and simulated images are presented. The algorithm is made robust to errors in the imaging process by accounting for the worst case errors. In a separate work, Talluri and Aggarwal [53-571 also consider the navigational aspects of an autonomous mobile robot navigating in an outdoor, urban environment consisting of polyhedral buildings. The 3-D descriptions of the rooftops of the buildings are assumed to be given as a world model and the robot is assumed to be equipped with a visual camera. The position and pose are estimated by establishing a correspondence between the lines that constitute the rooftops of the buildings (world model features) and their images. A tree search is used to establish a set of consistent correspondences. The tree is pruned using the geometric constraints between the world model features and their images. To effectively capture the geometric relations between the world model features with respect to their visibility from various positions of the robot, the free space of the robot is partitioned into a set of distinct, non-overlapping regions called the Edge Visibility Regions (EVRs). Associated with each EVR is a list of the world model features that are visible in this region called the Visibility List (VL). Also stored for each entry in the VL is a range of orientations of the robot for which this feature is visible. The uses of these EVRs in pruning the tree in searching for a consistent set of correspondences between the world model and the image features is discussed in this paper. An algorithm for forming such an EVR description of the environment from the given world model is presented. The authors also derive worst case bounds on the maximum number of EVRs that will be generated for a given world model and show that this is polynomial in the number of world model features. The uses of this EVR description in the path-planning tasks of the robot are also outlined. Results of the position estimation are provided using a model of a real airport scene. 6. Conclusions
In this paper we have illustrated the various aspects of the problem of estimating the position and pose of a mobile robot and provided a comprehensive review of the various methods and techniques used. These techniques vary significantly depending on the known conditions of the navigation environment and the type of sensors with which the robot is equipped. Landmark-based methods are suitable for robots with the ability to identify the landmarks and measure the rangelattitude
4.4
Position Estimation Techniques for an Autonomous Robot
793
to them. This usually requires the robot to have a database of landmarks occurring in the environment and a n approximate location t o start to search the database. Computing the exterior orientation parameters in the camera calibration problem is dealt with in the photogrammetry literature. This problem is quite similar t o the position estimation problem of a mobile robot using landmarks. Some of the techniques used in photogrammetry can thus be modified and applied in localizing the robot’s position and orientation. Methods using trajectory integration and dead reckoning usually require the robot to address the problem of environment perception and modeling. Various methods of modeling the environment and forming a map of it for navigation tasks are also reviewed in this paper. The position estimation techniques used depend on the map-making strategy and representation used. Techniques which use a standard pattern to structure the environment by placing a standard reference pattern at known locations in the environment are particularly useful in those applications where a high degree of accuracy in positioning the robot is required only after it nears a particular workstation. Simple trajectory integration techniques could be used to locate the robot near the workstation, and then the standard pattern can be used. Model-based methods are best applied when a priori information of the robot’s environment is available in the form of a world model. The problem t o be solved in this instance is t o match the model and the sensor observations, which may be in different forms.
Acknowledgments This research was supported by the Army Research Office under contract DAAL03-91-G-0050.
References [l]N. Ayache, 0. D. Faugeras, Building a consistent 3-D representation of a mobile robot environment by combining multiple stereo views, in Proc. 10th IJCAI, 1987, 808-810. [2] B. C. Bloom, Use of landmarks for mobile robot navigation, in SPIE Proc., Intelligent Robots and Computer Vision, Vol. 579, 1985, 351-355. [3] R. A. Brooks, Visual map making for a mobile robot, in Proc. IEEE Int. Conf. on Robotics and Automation, St. Louis, MO, 1985, 824-829. [4] M. Case, Single landmark navigation by mobile robots, in SPIE Proc., Mobile Robots, Vol. 727, Oct. 1986, 231-238. [5] R. Chatila and J.-P. Laumond, Position referencing and consistent world modeling
for mobile robots, in Proc. IEEE Int. Conf. on Robotics and Automation, St. Louis, MO, 1985, 138-145. [6] H. H. Chen, Pose determination from line-to-plane correspondences: Existence condition and closed-form solutions, IEEE Trans. Pattern Anal. Mach. Intell. 13,6 (1991) 53ck541. [7] J. Courtney, M. Magee and J. K. Aggarwal, Robot guidance using computer vision, Pattern Recogn. 17,6 (1984) 585-592. [8] J. L. Crowley, Dynamic world modeling for an intelligent mobile robot using a rotating
ultra-sonic ranging sensor, in Proc. IEEE Int. Conf. on Robotics and Automation, St. Louis, MO, 1985, 128-135.
794
R. Talluri €9 J. K. Aggarwal
[9] J. L. Crowley, World modeling and position estimation for a mobile robot using ultrasonic ranging, in Proc. IEEE Int. Conf. on Robotics and Automation, Scottsdale, May 1989. [lo] K. C. Drake, E. S. McVey and R. M. Iiiigo, Experimental position and ranging results for a mobile robot, IEEE Trans. Robotics and Automation 3, 1 (1987) 31-42. [ll] A. Elfes, Sonar based real-world mapping and navigation, IEEE Trans. Robotics and Automation 3, 3 (1987) 249-265. [12] I. M. El Hassan, Analytical techniques for use with reconnaissance from photographs, Photogrammetric Eng. Remote Sensing 47, 12 (1981) 1733-1738. [13] 0. D. Faugeras, N. Ayache and B. Faverjon, Building visual maps by combining noisy stereo measurements, in Proc. IEEE Conf. on Robotics and Automation, San Francisco, CA, 1986, 1433-1438. [14] N. Ayache and 0. Faugeras, Maintaining representations of the environment of a mobile robot, IEEE Trans. Robotics and Automation 5, 6 (1989) 804-819. (151 M. A. Fischler and R. C. Bolles, Random sample consensus : A paradigm for model fitting with application to image analysis and automated cartography, Commun. ACM 24, 6 (1981) 726-740. [16] M. D. Ernst and B. E. Flinchbaugh, Image/map correspondence using curve matching, Texas Instruments Technical Report, CSC-SIUL-89-12, 1989. [17] H. Freeman and S. P. Morse, On searching a contour map for a given terrain elevation profile, Journal of the Franklin Institute 284 (1967) 1-25. [18] I. Fukui, TV image processing to determine the position of a robot vehicle, Pattern Recogn. 14, 1-6 (1981) 101-109. [19] S. Ganapathy, Decomposition of transformation matrices for robot vision, in Proc. 1st IEEE Int. Conf. on Robotics, Atlanta, GA, Mar. 1984, 130-138. [20] R. M. Haralick et al., Pose estimation from corresponding point data, IEEE Trans. Syst. Man Cybern. 19, 6 (1989) 1426-1445. [21] R. Horaud, B. Conio and 0. Leboulleux, An analytical solution to the perspective 4-point problem, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR '89, San Diego, CA, Jun. 1989, 500-507. [22] B. K. P. Horn, Relative orientation, Proc. Image Understanding Workshop, Vol. 2, 1988, 826-837. [23] M. R. Kabuka and A. E. Arenas, Position verification of a mobile robot using a standard pattern, IEEE Trans. Robotics and Automation 3, 6 (1987) 505-516. [24] A. Kak, K. Andress and C. Lopez-Abadia and M. S. Carroll, Hierarchical evidence accumulation in the PSEIKI system and experiments in model-driven mobile robot navigation, in Uncertainty in Artificial Intelligence, Vol. 5 (Elsevier Science Publishers B.V., North-Holland, 1990) 353-369. [25] E. Krotkov, Mobile robot localization using a single image, in Proc. IEEE Int. Conf. on Robotics and Automation, Scottsdale, May 1989, 978-983. [26] B. J. Kuipers and Y. T. Byun, A robust qualitative method for robot spatial learning, in AAAI-88, The Seventh Nat. Conf. on Artificial Intelligence, St. Paul/Minneapolis, MI, 1988, 774-779. [27] R. Kumar, Determination of the camera location and orientation, in Proc. DARPA Image Understanding Workshop, 1988, 870-881. [28] R. Kumar and A. Hanson, Robust estimation of the camera location and orientation from noisy data having outliers, in Proc. Workshop on Interpretation of 3-d scenes, Austin, TX, Nov. 1989, 52-60.
4.4
Position Estimation Techniques f o r an Autonomous Robot
795
[29] J. Lessard and D. Laurendeau, Estimation of the position of a robot using computer vision for a live-line maintenance task, in Proc. IEEE Int. Conf. on Robotics and Automation, Raleigh, NC, 1987, 1203-1208. [30] T. S. Levitt, D. T. Lawton, D. M. Chelberg and P. C. Nelson, Qualitative navigation, in Proc. DA RPA Image Understanding Workshop, 1987, 447-465. [31] Y. Liu, T. Huang and 0. Faugeras, Determination of the camera location from 2-d to 3-d line and point correspondences, IEEE Trans. Pattern Anal. Mach. Zntell. 12, 1 (1990) 28-37. [32] M. J. Magee and J. K. Aggarwal, Determining the position of a robot using a single calibration object, in Proc. 1st IEEE Int. Conf. on Robotics, Atlanta, GA, Mar. 1984, 140-149. [33] L. Matthies and S. A. Shafer, Error modeling in stereo navigation, IEEE Trans. Robotics and Automation 3 (1987) 239-248. [34] L. Matthies and A. Elfes, Integration of sonar and stereo range data using a grid based representation, in Proc. ZEEE Int. Conf. on Robotics and Automation, Philadelphia, PA, Apr. 1988, 727-733. [35] C. D. McGillem and T. S. Rappaport, Infra-red location system for navigation of autonomous vehicles, in Proc. IEEE Int. Conf. on Robotics and Automation, Philadelphia, PA, Apr. 1988, 1236-1238. [36] G. Medioni and R. Nevatia, Matching images using linear features, IEEE Trans. Pattern Anal. Mach. Zntell. 6, 6 (1984) 675-685. [37] D. Miller, A spatial representation system for mobile robots, in Proc. IEEE Int. Conf. on Robotics and Automation, St. Louis, MO, 1985, 122-127. (381 H. P. Moravec, Robot Rover Visual Navigation (UMI Research Press, Ann Arbor, MI, 1981). (391 H. P. Moravec, The Stanford Cart and the CMU Rover, in Proc. IEEE 71, 7 (1983) 872-884. [40] H. P. Moravec, Sensor fusion in certainty grids for mobile robots, A I Mag. 9, 2 (1988) 61-74. [41] H. P. Moravec and D. W. Cho, A Bayesian method for certain grids, in A A A I Spring Symposium Series on Mobile Robot Navigation, Stanford, CA, Apr. 1989. [42] H. Nasr and B. Bhanu, Landmark recognition system for autonomous mobile robots, in Proc. IEEE Int. Conf. on Robotics and Automation, Philadelphia, PA, Apr. 1988, 1218-1223. [43] A. Robert de Saint Vincent, A 3-D perception system for the mobile robot HILARE, in Proc. IEEE Conf. on Robotics and Automation, San Francisco, CA, 1986, 1105-111 1. [44] J. J. Rodriguez and J. K. Aggarwal, Matching aerial images to 3-D terrain maps, ZEEE Trans. Pattern Anal. Mach. Zntell. 12, 12 (1990) 1138-1149. [45] P. J. Rosseeuw and A. M. Leroy, Robust Regression and Outlier Detection (John Wiley and Sons, NY, 1987). [46] R. C. Smith and P. Cheeseman, On the representation and estimation of spatial uncertainty, Znt. J. Rob. Res. 5, 4 (1987) 56-58. [47] K. Sugihara, Some location problems for robot navigation using a single camera, Comput. Vision Graph. Image Process. 42, 1 (1988) 112-129. [48] K. Sugihara, Location of a robot using sparse visual information, in Robert Bolles and Bernard Roth (eds.), Robotics Research: The Fourth International Symposium (MIT Press, 1987) 319-326. [49] W. Szczepanski, Die Losungsverchlage ftir den raumlichen Ruckwiirtseinschnitt, Deutche Geodatische Komission, Reiche C: Dissertationen-Heft Nr, 1958, 1-44.
796
R. Talluri €3 J. K. Aggarwal
[50] R. Talluri and J. K. Aggarwal, A position estimation for a mobile robot in an unstructured environment, in Proc. IEEE Workshop on Intelligent Robots and Systems, IROS '90, Tsuchiura, Japan, Jul. 1990, 159-166. [51] R. Talluri and J. K. Aggarwal, A positional estimation technique for an autonomous land vehicle in an unstructured environment, in Proc. A I A A / N A S A Int. Symp. on Artificial Intelligence and Robotics Applications in Space, ISAIRAS '90, Kobe, Japan, Nov. 1990, 135-138. [52] R. Talluri and J. K. Aggarwal, Position estimation for an autonomous mobile robot in an outdoor environment, IEEE Trans. Robotics and Automation 8, 5 (1992) 573-584. [53] R. Talluri and J. K. Aggarwal, Edge visibility regions-a new representation of the environment of a mobile robot, in Proc. I A P R Workshop on Machine Vision Applications, MVA '90, Tokyo, Japan, Nov. 1990, 375-380. [54] R. Talluri and J. K. Aggarwal, Positional estimation of a mobile robot using edge visibility regions, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR '91, Hawaii, Jun. 1991, 714-715. [55] R. Talluri and J. K. Aggarwal, Positional estimation of a mobile robot using constrained search, in Proc. IEEE Workshop on Intelligent Robots and Systems, IROS '91, Osaka, Japan, Nov. 1991. [56] R. Talluri and J. K. Aggarwal, Transform clustering for model-image feature correspondence, in Proc. IAPR Workshop on Machine Vision Applications, MVA '9.2, Tokyo, Japan, Dec. 1992, 579-582. [57] R. Talluri and J. K. Aggarwal, Autonomous navigation in cluttered outdoor environments using geometric visibility constraints, in Proc. Int. Conf. on Intelligent Autonomous Systems: I A S - 3, Pittsburgh, PA, Feb. 1993. [58] R. Y. Tsai, A versatile camera calibration technique for high accuracy 3-D machine vision metrology using off the shelf TV cameras and lenses, IEEE Trans. Robotics and Automation 3, 4 (1987) 323-344. [59] T. Tsuboushi and S. Yuta, Map assisted vision system of mobile robots for reckoning in a building environment, in Proc. IEEE Int. Conf. on Robotics and Automation, Raleigh, NC, 1987, 1978-1984. (601 J. S.-C. Yuan, A general photogrammetric method for determining object position and orientation, IEEE Trans. Robotics and Automation 5, 2 (1989) 129-142. [61] P. R. Wolf, Elements of Photogrammetry (McGraw Hill, New York, 1974).
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 797-815 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 4.5 I COMPUTER VISION IN POSTAL AUTOMATION*
G. GARIBOTTO and C. SCAGLIOLA Elsag Bailey, a Finmeccanica Company R b D Department, Genoua, Italy E-mail: giouanni.garibottoOe1sag.it The objective of this chapter is to provide a critical analysis of Computer Vision within the context of Postal Automation services. The main functional requirements of this application field are briefly referred, as well as the involved Vision functions, which are considered here in a broad sense, including Pattern Recognition, Image Processing and understanding, Signal Processing and Robot Vision. New trends as well as new services emerging in Postal Automation are also discussed, in an attempt t o highlight the expected impact on the development of Computer Vision technology. The aim of the chapter is also to refer t o the most relevant chievements of Computer Vision as well as to discuss why other promising techniques did not succeed, in spite of the advanced results obtained a t prototypical level in laboratory experiments. The ultimate goal is to provide a contribution t o stimulate the basic and applied research efforts in this important field of industrial automation and possibly support the recent initiatives of technology transfer from research to industry. Keywords: Mail processing, character recognition, image processing, material handling, electronic reading systems.
1. Introduction
Mail sorting and postal automation has always represented an important area of application for Image Processing and Pattern Recognition techniques. Since the early developments in the first half of this century, postal mechanisation has grown a lot in the sixties and seventies, pushed primarily by the initiatives of the different national postal administrations. Around the middle of the seventies, the escalating use of faxes and data transfer, and more recently of e-mail, led to predictions that within 20 years relatively few people would communicate by letter. In spite of such predictions mail volume grows steadily, to reach such levels of about 5.6 billion pieces a year in Italy in 1994 [l],10.3 billion in France in 1994 [2], 10.6 billion in Canada in 1992/1993 [3], and 181 billion in the USA in 1995 [4]. However, the scenario of postal services is rapidly changing all around the world, because of the gradual transformation of national post institutions into private companies looking for service quality and efficiency. On the other hand, *The paper has been partially supported by ECVnet, the European Computer Vision network, an ESPRIT Network of Excellence.
798
G. Garibotto €9 C. Scagliola
the telematics revolution enables software developers to offer alternate services in competition with traditional providers. Finally, global market rules force the breakup of national monopolies in communication and mail services, also encouraging information exchange among all providers. In conclusion, the new situation is such that Postal Administrations are hardpressed by strong competition. In order t o meet the new challenges while remaining profitable or at least avoiding losses, at the same time ensuring the continuity of service to small, often rural, post offices, Postal Administrations move essentially along two lines: 0
0
To improve their service through a re-engineering process of mail handling; To introduce new services to adapt their operations to the changing needs of the users.
Many of the foreseen improvements and new services are based on Image Processing and Computer Vision functions, as will be described later in this report. The following section is devoted to the description of the present situation and trends in postal mechanisation and mail processing, while the next one describes new developments that are under study to improve the mechanisation process. Section 4 points out new image-based functions that are needed for a further improvement of the postal service and/or for introducing new services to the customers, and finally some conclusions are drawn.
2. Description of the Industrial Sector and the Current Trend: Postal Mechanisation and Mail Processing Mail handling is a very labour intensive process and labour costs have been increasing during the last three decades. In addition t o the cost factor, the knowledge level required for the sorting process is quite considerable. Mail has to be sorted for a large number of destinations. In the US, the national delivery network reaches nearly 128 million addresses [4].For important destinations like large cities, direct bundles are formed, but for small villages mail is combined into bundles and dispatched to regional sorting centres for further inward sorting. The policy of most Postal Administrations is t o introduce new postcodes containing information which could be used for the entire mail-handling process up to the final delivery point. Sorting is usually performed by machines that read a barcode as an identifier of the mail piece destination. The barcode is impressed on the envelope by an “encoding” function, performed either automatically by a postal OCR machine or by an employee through a videocoding station. Traditionally, encoding mirrors sorting: a first encoding step identifies the destination city for outward sorting, by reading the classical postcode. The second encoding step identifies the final delivery point for inward sorting by reading the new and complete postcode (where it exists and
4.5 Computer Vision in Postal Automation 799 when the user has written it on the address). Otherwise the full address must be read, i.e. street name and number and possibly apartment number, or Post Office Box number, or the name of a large customer, like a Bank or a Company. In order to reduce the cost of this encoding function, the tendency of Postal Administrations is to gradually increase the percentage of mail that is encoded automatically, and to perform this operation only once, i.e. to encode mail to the destination point directly in the first sorting centre. This operation, which requires the on-line consultation of a nation-wide address database, can be already done for a very large proportion of typewritten mail, but a goal of most Postal Administrations is to also automatically encode handwritten mail to the destination point. Just to give an example, we may refer to the figures provided by the Royal PTT in the Netherlands [5], as the service objectives for the near future. 0
0
98 per cent of mail items smaller than 380 x 265 x 32 mm (machinable mail) will be sorted automatically. So-called non-standard sorting machines will handle larger items.
Standard sorting machines
Vide;o;ding Non-Standard sorting machines
'ostcode address directories
Fig. 1. Flow chart of a network service for mail handling.
800 G . Garibotto & C. Scagliola 0
0 0
As many addresses as possible will be read automatically using OCR systems (over 90% of all items are expected to be read in this way). Sorting will take place down t o the level of an individual postman’s delivery route. Parcels will be handled in separate infrastructures which will be newly constructed according to standardised design.
The flow-chart of Fig. 1 refers to a network architecture proposed t o manage all information in a uniform way and sharing the appropriate resources for the mail sorting process.
2.1. Mail Processing: Statement of the Problem and Application Requirements Automatic reading is necessary for all the address fields needed by the carrier to bring the mail to the final destination. The current mail flow is roughly sketched in Fig. 2, from the collection of all mail items t o the first office. All items are separated (culling), oriented and packed together (facing and cancelling), and are sorted according to the respective post codes. At the destination office, two further sorting processes are implemented in order t o obtain the final carrier sequencing of the mail.
Outward sorting Acceptance and payment
Induction, culling, facing and cancelling
Transport Return services
c Delivery
Sequencing
Inward sorting
Transport
Fig. 2. Current mail flow.
4.5 Computer Vision in Postal Automation 801
Letter processing A brief description of the basic components is given as follows: 0
0
0
0
0
A Culler Facer Canceller (CFC) machine is commonly used as a pre-processor. It also provides for image capturing to allow image processing while the physical mail is being transported in the centre. An OCR machine reads addresses written on the face of the mail and prints a fluorescent bar code on the mail item. Should it be unable to recognise the address, it will capture the mail image and send it for on-line video coding. If the mail cannot be resolved by OCR or on-line coding, the image will be sent for off-line video coding. An off-line OCR and the video coding system will be used to process mail images from the CFC. A Bar Code Sorter (BCS) machine can pre-sort mail items whose images have been resolved during either off-line video coding or off-line OCR. It can also pre-sort letters pre-printed with a bar-code by bulk mailers. A Delivery Bar Code Sorter (DBCS) operates on a two-pass sorting process by sorting bar-coded letters to postman delivery routes in the first pass and to a delivery point sequence in the second pass.
In Appendix A, the functional architecture of a typical letter reading process is presented .
3. Main Functions Involving Computer Vision and Related Computer Vision Techniques The main function required in postal automation, involving Computer Vision is definitely address reading and interpretation. In this sense it belongs to the basic perceptual functions of biological vision. Nevertheless, due to the inherent 2-D nature of the problem, its computer implementation is strongly based on basic technologies such as image processing and pattern recognition. At present, the most challenging tasks, performed by such Vision technologies in postal automation, are handwritten address reading, including cursive handwriting recognition, flats handling and reading, grey level and colour image processing, improved man-machine interaction, and robotic material handling. 3.1. Handwritten Address Reading The new frontiers in handwriting recognition make extensive use of the context to achieve unconstrained address reading, using both large vocabulary and grammar constraints as well as heuristics. A first short-term objective consists in reading the last line including the postcode, city name and state and integrates such information in order to improve the reliability of the system and minimise the use of off-line video coding. The French Postal Administration expects a 10% increase in automatically sorted (outward
802
G. Garibotto & C.Scagliola
sorting) handwritten mail [6],by integrating the reading of the city name with that of the postcode. Next generation machines will also include the capability to read the full handwritten address line, with street name and civic number, in order to manage the final postman’s delivery. The goal of USPS, the United States Postal Service, is to encode to the delivery point 50% of the handwritten mail, with 1%error rate [7]. Laboratory tests indicate that this goal can be achieved [8]. Automatic reading of off-line cursive handwriting is presently a forefront technology for automatic reading systems. Considered too difficult to find useful solutions for common use until the early go’s, it is now one of the most important subjects studied by research groups in Pattern Recognition. There are several difficulties in reading cursive handwriting; letters are usually connected, there is a large variety of letter shapes and individual styles and pixel patterns have an intrinsic ambiguity when taken in isolation. The pattern on the right could be equally well interpreted as a “u”, a double “1” , a double “e”, an “n” or part of an “m”. In the classical approach to OCR, characters are first segmented and then recognised. Lexical knowledge is used at the end, in a postprocessing stage, to correct possible recognition errors and find the correct word interpretation. This approach works pretty well with typewritten images, and even with handwritten ones, if characters are hand-printed separately. In the case of cursive handwriting, unfortunately, it is quite difficult to segment without recognising, and of course it is also difficult to recognise without segmenting. Moreover, as seen in the example before, the real identity of character patterns can often be determined only with the aid of the contextual knowledge. An over-segmentation approach is always recommended [9] to produce a series of segmentation, which are measured against the alphabet of characters, without taking a crisp decision. The optimal interpretation is found for the image by looking for the sequence of character hypotheses that best matches one of the sequences of characters allowed by the a priori knowledge, i.e. by the lexicon and/or grammar. Other similar promising approaches have been recently experimented and some of them are already operative in some prototypical installations [lo].
3.2. Flats Sorting Machines Flats reading represents the most challenging objective for postal sorting machines, besides more conventional letter manipulation. There are different categories of flats to be handled. One class includes large A4 size envelopes with more or less additional information printed on it (sender and destination address, advertisement messages, stamps and mail class service information). Another class consists of journals, newspapers and catalogues with or without plastic covers. Moreover in the flat category are often included also small parcels with a maximum thickness of about 40 mm.
4.5 Computer Vision in Postal Automation 803 Only very few flat sorting machines are in operation, often without automatic reading capability, but the traffic for this kind of mail is constantly growing, and the need for automatic encoding is emerging rapidly. However, automatic reading of addresses on flats cannot be achieved through a simple re-engineering of a letter reader. The different characteristics of the mail pieces pose different problems to the image acquisition subsystem and t o the processing algorithms, that are not yet fully satisfactory in unconstrained operating conditions and that require more sophisticated image processing and computer vision capabilities. One of the main problems comes from the management of plastic covers which prevents a sharp and well contrasted image acquisition from the input vision sensor. Improvements in the acquisition process (both in resolution and dynamic range) as well as adaptive grey level image processing tools, represent key factors in the solution of this problem. The second critical point is Address Block Location. In fact, in some cases, like magazines and advertisements, the destination address is usually written on a small label floating under a plastic cover, which means that it may be found in any position and with any orientation. In other cases the destination address has to be located in a complex image, full of text and graphics, as in newspapers, where the statistical properties of the address block are very close t o those of the full size image. Appendix B refers to a processing scheme and the current main problems in flat sorting machines.
Fig. 3. An example of a flat with plastic cover, Address block on a label, advertisement messages and postage code information.
804 G. Garibotto & C. Scagliola
3.3. Man-Machine Interaction and Intelligent Interface There is much improvement in man-machine interaction tools for video coding, which represent a n increasingly important component of the system, especially in the current approach aimed at increasing off-line and remote encoding by system operators. Ergonomic problems, friendly interfaces, quick and easy panning and scrolling of the image on the screen are priority issues. One of the most important problems is still the optimal display of the letter or flat image on the screen, with the appropriate resolution, grey level scale and adaptive contrast adjustment, to remove the uneven perception of the foreground and background information. Many efforts have also been made to use a combination of technologies and the integration of different sensors (i.e. speech recognition t o simplify the input of address information). Eye tracking is also a new area of research [ll]to speed up the localisation of the address block on the screen, when very complex mail images are involved. These technologies may have an important role in the new generation of postal automation systems. 3.4. Parcel Classification
Huge and complex machines are currently used for 3-D parcel sorting. State of the art sorting equipment can handle some 250,000 items per day. Presently parcel processing is highly labour intensive; in fact parcels are introduced manually by human operators and during this input stage a preliminary selection is already performed according to their size and shape (roll and cylinders, regular and irregular packets, etc.). They are also labelled with some ID code label and the operators place the parcel item on the sorter tray with the label side facing up. At the input stage an overhead scanner is installed to automatically read the label and assign the destination information to the sorter tray. During the last 10 years there has been a significant research effort carried out by the most important Postal Administrations, to automate the parcel input stage and reduce the cost associated with this very low-level work. The main objective is to estimate the correct dimensions of the parcel size for billing purposes. Quite interesting results using Computer Vision technology have been achieved in the late eighties, with the realisation of prototype systems which made use of active light laser sensors to recover the 3-D shape of the parcels and allow a presorting of the mail items [12]. 3-D reconstruction has been a hot research topic for a long time in the Computer Vision scientific community with proposed solutions based on both passive vision (stereovision and motion analysis) and geometric analysis of projected light patterns (lasers or white light projectors). The main industrial thrust in the development of this technology came from robotic metrology and co-ordinate measuring machines, in order to substitute the traditional slow contact mechanical sensors with optical non-contact area sensors, and improve speed in surface reconstruction.
4.5 Computer Vision in Postal Automation 805
There are many industrial products available in the market for the reconstruction of 3-D surfaces, besides some new results from advanced research lab-oratories [13]. In any case, the final solution requires a careful system integration, with controlled lighting conditions and a n optimal arrangement of the sensor w.r.t. the positioning fixture of the samples to measure. Actually, the requirements of 3-D parcel reconstruction in terms of precision are not as severe as in robotic metrology. As far as 3-D shape representation is concerned, the most common representation techniques can be classified into volumetric or surface based schemes. A volumetric reconstruction by voxels is quite heavy in terms of data storage but classical arrangement of data as octrees [13] or skeletons may save a lot of memory space. The use of deformable models, like superquadrics [14], allows for a very efficient 3-D shape representation, as a composition of individual elements, and this approach has been successfully investigated for parcel modeling too. The most consolidated technique for 3-D shape representation is definitely the Delaunay triangulation, which allows a n optimal interpolation of sparse data points. The extension of such a tool from two-dimensions to three-dimensions is thoroughly discussed in [16], where formal definitions and examples are referred to as well as the relationshipship which exists between the Delaunay triangulation of a set of points on the boundary of an object and the skeleton of that object. Nowadays there are a few installations of 3-D dimensioning systems for parcel measuring and classification. It is worth while to mention a light curtain technology, as well as an infrared Laser Rangefinder technology [17], which is currently in use for on-line parcel dimensioning in different European countries. 3.4.1. 2 - 0 Bar-code image readers The term “2-D bar-codes” refers to any of the new ID-codes that do not rely on a single row of marks/spaces t o encode data. 2-D codes provide high capacity (up to 2,000 characters per label, compared to 50 characters of conventional 1D Bar codes), and very robust error correction capabilities. There are two main types of 2D codes: stacked linear (PDF) and matrix codes (like the Maxicode referred to in Fig. 4) [18]. Postal administrations are currently using new generation portable 2-D image readers for parcel delivery and high speed sortation. UPS is starting to roll out Maxicode, by marking packages with this 2-D matrix code and reading the code in its large-scale hub sortation centres. The main problem of image processing and recognition is the development of a reliable location and positioning of the 2-D code in the acquired image. For instance in the example of Fig. 4 the detected elliptical shape of the inner circles are used to identify the central position of the code and further geometric reasoning techniques are used to properly locate the polygonal shape of the code, which may be affected by perspective distortion. The following decoding process is quite trivial and is based on standard encoding/decoding procedures.
806
G . Garibotto i 3 C. Scagliola
Fig. 4. An example of Maxicode printed on a parcel.
However, a wider range of image recognition applications is considered using both fixed and hand-held image readers. They include some postal-specific symbols like Postnet as well as digital indicia, a new service that the United States Postal Service (USPS) will offer t o their customers, to purchase postage over the Internet, download the postage t o their PC and then use their own laser printers t o output certified and secure electronic postage. 3.5. Material Handling and Station Loading/Unloading
One of the most intensive tasks t o be performed in a large mail distribution centre is the transportation of mail items between different machines and to/from the input/output stage of the centre. The loading/unloading of the letters to/from mail sorting machines is still mainly performed by human operators and the automation of this process represents an essential target to work towards. A commonly agreed approach consists in the realisation of standardised letter containers which could be handled automatically by robotic machines. From recent studies it has been demonstrated that such a transport service is carried out for about 80% of the latters through the use of trolleys pushed by human operators and just 20% are managed by electric trucks (mainly driven again by human operators). The available technologies to solve such problems of transportation between different working cells (intercell service) are: 0
Electrical trucks, often used to tow a convoy of passive trolleys, with obvious problems of manoeuverability and requirement of human driving.
4.5 Computer Vision an Postal Automation 807
Rail transport system with the well known disadvantages of fixed installations and no flexibility in the management of the mail distribution centre. Rollers chains which are efficiently used for point-to-point service, but again there are strong limitations due to space occupancy and no flexibility. The use of AGVs (Autonomous Guided Vehicles) represents nowadays the most flexible solution. There are conventional systems using inductive guides buried into the floor, as well as new generation navigation systems, mainly based on active sensors (magnetic tags or lasers) for self-orientation. Still there is poor flexibility in the reconfiguration of the navigation map, high precision requirements in the positioning of the loads (pallets), difficulty of switching from automatic to manual driving of the vehicle. New generation mobile robots, based on advanced sensors, with free ranging capabilities and the possibility of an easy reconfiguration of the navigation route, able to detect the presence of other vehicles at crossing points, with automatic and manual driving capability, represent the expected solution. In this domain Computer Vision should play a fundamental role to give flexibility and intelligence to the robotic logistic system. 3.5.1. Computer vision for autonomous navigation
There are not many industrial examples of Computer Vision applied to AGVs even if this is commonly considered one of the most promising solutions to achieve the necessary flexibility and performance. On the other hand there are interesting vision-based control techniques which have been developed in the automotive industry as a support to the driver, to provide information about the distance from the nearest forward cars and to perform autonomous tracking and car following in traffic congestion [19]. Autonomous navigation has been strongly pushed by military research for both normal road drive and off-road navigation. Interesting results can be found in [20] and in [21] where this subject has been investigated for many years and has been tested with prototype vehicles operating at nearly normal speed conditions (about 100 km/h with obstacle detection capability). There have been also attempts to use Computer Vision to drive AGV systems and in mobile robotics for service applications, as in [22] for hospital transportation functions. Some experiments of using AGV systems in mail distribution have been implemented [23]. Quite recently, a fully passive vision approach has been proposed to allow selfpositioning and autonomous navigation of a fork-lift carrier named Robolift, [20]. In this case Computer Vision is used to recognise artificial geometrical landmarks, placed along the navigation pathway and correct the relative odometer estimates. Furthermore, Computer Vision is also used for docking control, to recognise the correct position of the pallet so that Robolift control can suitably correct the displacement of the fork prongs before loading.
808
G. Garibotto & C. Scagliola
Advanced research are in progress to use Computer Vision and multisensor integration (laser, inertial sensors) for autonomous navigation, in order to make use of existing landmarks and features already present in the environment and allow easy and fast reconfiguration of the navigation mission. 4. A Projection of Future Trends
Today large volume mailers increasingly demand faster, more reliable service and customised products. They want day-certain delivery, shipment and piece tracking, and an electronic data interface. Moreover, the most important administrations now compete with express mail service companies like DHL, UPS, FedEX, newspapers, telecommunications companies, and alternative delivery services. A key function to survival and success of the Postal Administrations is the integration of the service in a network, by connecting most of the plants with each other, with transportation suppliers, mailer plants and postal customers. For instance, the information infrastructure has to be redesigned in order to avoid the repetitive capture of the same data and to put all available information ready at each stage in the physical mail handling process. Standardisation of tools and processing interfaces is therefore another fundamental requirement for the next generation of machines. Moreover there is a great effort to improve the efficiency of the service through the realisation of distributed architectures which may provide remote access and obtain a wider access to the processing resources (both geographically and logically). However, besides re-engineering the mail handling process through wise use of Information Technology, other benefits can be achieved by the implementation of new functions based on Image Processing and Computer Vision capabilities. The previous section has already referred to the main direction of research and investment aimed at improving the performance of mail processing systems in the field of handwriting recognition, address block location, parcel processing, etc. In the following we try to focus on new emerging services which represent the new frontiers for competition and system providers in order to enhance the level of service and increase the added value to the final customer. 4.1. New Functions and Services i n Mail Processing Systems
The mail handling process can be divided into three main functional areas, i.e. acceptance and payment, sorting and transportation, and delivery services. A thorough review of the whole postal process, from customer payment for and submission of mail through to delivery to the addressee, is the main goal of the Esprit project TIMBRE (Technology In Mail Business Re-Engineering), conducted by a consortium of postal administrations and technology providers and led by IPC Technology (251. While most of the improvements in the postal process would be based on a heavy use of communication and networking functions, new image-based functions
4.5 Computer Vision in Postal Automation
809
and services can also be used to improve efficiency and provide new services. We indicate here a list of such possible functions, some of which most probably would require colour image processing: 0
0
0 0
0 0 0
verification of the presence and value of stamp(s) and postage, with reference to weight, destination and class of the postal object; Detection of false or recycled stamps; Identification of special stamps and logos for “controlled delivery by time”; Reading of the amount of additional stamps for effecting payments; Reading and verification of postal permits; Reading the name of the addressee for redirection services; Reading the address of the sender for “return to sender” mail.
As a whole, the listed functions would constitute what could be called a “postal image understanding system”. While such a complete system most probably would never be implemented, some of the above services are under consideration in different Postal Administrations. For instance, Canada Post Corporation is considering the automation of “Return to Sender” mail [3] and electronic redirection of mail [26]. 4.1.1. Paper to electronic mail service The mail process has been primarily considered as an end-to-end paper mail service. Over the last few years, the rapid growth of computer and communication technology has led to a corresponding rapid growth of an end-to-end electronic mail, especially in the business sector. On the other hand the integration of different technologies can provide excellent opportunities for new postal services as hybrid mail, as an example of electronic to paper mail service. In this case large mailers can produce and forward messages in electronic form and use the distributed postal network for the printing and delivery of the mail. Image-based technologies make it possible now to cross the paper/electronic barrier in the opposite direction and implement paper-to-electronic services. An example of a new service of this type is the Reply Card Processing, or automation of Business Reply Mail. This service allows the interception of all business reply cards or courtesy reply cards addressed to a specific customer (i.e. a mail order company) and the capture of the image on both sides of the card. At this point, the image can be stored in appropriate electronic mail boxes and later transmitted to the customer’s fulfilment centre instead of the physical card. This paper-to-image transformation alone can reduce the delivery time from 2 to 4 days down to 12 to 18 hours [27] and it is the first step foreseen for a new service in the USA [4]. The next steps are the automatic electronic reading of the content (handwritten information) and its translation into frameworks suitable for computer processing,
810 G. Garibotto & C. Scagliola
its decoding and delivery t o the customer using data transmission networks. In this way the customer’s fulfilment centre would have the information transferred directly into its database, thus avoiding the usual manual data entry operation. This kind of new service is being considered not only by USPS, but also by other Postal Administrations, like UK’s Royal Mail [28], and by technology providers [29]. 5 . Conclusions
The mail automation sector is a quickly evolving area of industrial automation, where the development of Electronics and Parallel processing, Sensors, Robotics, Information Technology and Telecommunications, have opened up new perspectives of development. In the last twenty years the Postal sector has been a very closed domain with special purpose solutions and approaches and a poor connection with similar applications and research disciplines. The mail sorting process has been, and it is, heavily dependent by mechanical constraints. It is always an exciting experience to visit a Mail sorting and Distribution Centre, with huge and complex machines and thousands of letters running back and forth at incredibly high speeds along the rubber transport chains. The other foremost technology has been definitely OCR for address reading. But now there are new problems and new solutions emerging, as we have tried to briefly mention in this chapter. The reading process is much more complex than simply measuring accuracy or speed of the individual character recognition module. The intelligence to distinguish between the different pieces of information in the mail (background, form structures, etc.), the possibility of taking advantage of all available context information to drive the OCR and text reading with feedback control, is the real challenge of the new generation reading systems. The availability of ever increasing processing power, and the possibility of off-line distributed reading systems is pushing the development of software solutions, possibly with hardware accelerators. But severe requirements are also expected of the imaging sensor which has to be able to deal with grey-scale information (and possibly colour) at very high resolution, large formats and with fast pre-processing tools. This is the area where real time hardware solutions are unavoidable. Finally 3-D Computer Vision appears t o have become a n important enabling technology in the Postal Automation sector, being mainly focused on supporting robotic applications in the material handling and logistics services. The most relevant impact of Computer Vision is expected in the following areas: 0 0
3-D object recognition and classification (parcel sorting). Pose recognition for standardised trays handling in loading/unloading stations.
4.5 Computer Vision in Postal Automation 811 0
Vision based autonomous navigation (self-localisation, obstacle detection, docking) to provide the required flexibility and free-navigation capability in a crowded environment, for inter-cell transportation services.
It is time for new individuals t o come into the arena of Mail Automation and Postal Services in order to introduce fresh ideas from close technological fields of Vision and Pattern Recognition, and give a new impulse to the improvement of this communication service of vital importance in the modern society. Appendix A. Functional Architecture of a Postal Address Reader In a very schematic way, address reading may be described as a data compression process, from the raw data coming at 8 pixel/mm and 256 grey levels, (something about 2 Mbytes data) down t o a few bytes, corresponding to the content of the postal destination address. It consists essentially of three modules: (1) Acquisition and image processing (2) Segmentation and recognition of the individual characters in the mail piece (3) Context analysis and address recognition.
Appendix A. 1. Acquisition and Image Processing The objective of this stage is the compression of the grey level image into a binary image. This represents a fundamental step of the process, since any amount of information lost at this stage cannot be recovered any more. Moreover, adaptive processing capability is essential. In fact some of the postal items have a good image contrast, but many of them are much more confused, with poor contrast and limited reading possibilities. A further strong requirement is real-time processing, due to the high speed of the mail stream in front of the input sensor (about 17 letters/second).
Appendix A.2. Individual Character Recognition The input is the bit-map obtained from the single letter image and this module consists of the following steps. Appendix A.2.1. Localisation The objective of this processing step is to identify: 0
0
text lines (both in the case of handwritten or typewritten text). Other geometric or information features t o be detected in the mail item (stamp, codes, etc.)
This stage represents a classical binary pre-processing (using clustering techniques and morphological processing tools).
812
G . Garibotto €9 C.Scagliola
Appendix A.2.2. Segmentation From the localised block of text it is necessary to segment the individual characters for the following recognition. The major problems here are the correct segmentation of broken and touching characters, especially for handwritten text (both numerals and alphabetic characters). Appendix A.2.3. Character recognition The literature of character recognition is extremely wide and rich [30], including the use of feature based statistical approaches, a variety of pattern matching schemes and a combination of neural network techniques. There has been always a great interest in establishing evaluation criteria and benchmarking procedures, to help attain a quantitative and objective comparison of such a wide range of solutions and implementation techniques. The US National Institute of Standards (NIST) has organised specific conferences [31]to promote a thorough comparison of results on the basis of selected data bases representative of machine print and hand print text. The Conference and related comparison exercises focused on a single step in the reading process: machine recognition of individual (or segmented) characters, with no context. To further improve the performance of handwritten character recognition it is quite common to use the following schemes: 0
0
0
to manage multi hypothesis until the end of the process, avoiding an early pruning of the decision tree; to use a combination in parallel or sequential order of different (possibly uncorrelated) character recognition techniques (statistical, neural, etc); a combination of character recognition methods on pair of consecutive characters, rather than on the individual segmented ones.
Appendix A.3. Context Analysis Module This module is relevant not only for the obvious objective of minimising the error rate of the address, but also because there are often writing errors in the original address (it has been estimated to be more than 10% in the UK). This module takes into account the following information: 0
0
a database which describes all the spectrum of the expected addresses (including some possible errors); some coding rules which describes how the address is supposed to be arranged.
Different solutions are usually implemented for typewritten or handwritten context analysis, since different fields with different content are typically involved. Ultimately, the performance achieved by the context analysis code is significantly better than the possible result from a simple code reader as depicted in the following table [32].
4.5 Computer Vision in Postal Automation 813 Reading rate
Rejection rate
Typewritten
code reader context address
2 72%
2 92%
5 1.6% 5 0.5%
Handwritten
code reader context address
2 62% 2 69%
5 1.8% 5 0.9%
Appendix B. Address Block Location in Flat Sorting This section describes the essential features of a module for address block location, in flat sorting machines.
Input data They consist of grey level images of size 2000 x 2000 pixels or more. Visual criteria and the information content of such mail pieces are briefly summarised in the following. The address block usually contains dark ink characters on a brighter background (which may be a white label or the grey colour of the envelope). The format and size of the characters is arbitrary and cannot be established a priori, especially for handwritten addresses. The address lines do not have a fixed known direction, although typewritten text is mostly oriented horizontally or vertically (except when a free label is inserted in a plastic wrapping). The most critical noise source comes from plies or folds on the surface (especially for plastic wrapping). Other useless information which have to be identified and discarded are headlines, patches of text, photos, graphics, etc., with a large variety of colours.
Address block location: A short description of the process Pre-processing and noise removal The objective of this stage is image enhancement t o remove input noise and minimise the effect of interferences (like light reflections or plies and other overlapped noise structures). Multiresolution Region of Interest analysis Data reduction is the primary objective, together with a more efficient data representation to find out candidate regions for the address block. The detection of regular and repetitive patterns, local measures of density and frequency, and blob analysis, are common tools used at this stage of the process. Segmentation of block candidates Geometrical constraints are commonly used to segment and isolate some blocks and rank them on the basis of similarity measures with respect to the available
814
G. Garibotto & C. Scagliola
address prototype models. The number of detected text lines (horizontal or vertical), their alignment (left, centre or right), the size and shape of the block are used as discriminating features. (4) Context analysis Topological constraints as well as heuristic criteria are used t o classify the detected blocks and decide on their arrangement onto the mail piece. It is worthwhile to point out that this stage of the process is usually carried out at lower resolution where text characters cannot be recognised and interpreted. Many research results on this subject, as well on the other topics of mail process automation can be found on the proceedings of the USPS Conference [33]. In the previous scheme we have described a traditional forward processing approach, but the present availability of ever increasing powerful processing power allows the exploitation of feedback information to improve the selection of the processing parameters, and converge t o better reading results. References [I] Poste Italiane, Volume Frafico Nazionale, World Wide Web site: http://www.nettuno.it /fiera/posteitaliane/roma/html/4bc-corr.gif [2] La Poste, Progress Report, 1 l t h Int. Conf. Postal Mechanisation, Melbourne, March 7-11, 1994. [3] Canada Post Corporation, Progress Report, 1 l t h Int. Conj. Postal Mechanisation, Melbourne, March 7-11, 1994. [4] United States Postal Service, Postal Facts for Fiscal Year 1995, World Wide Web site: http://www.usps.gov/history/pfact95f.htm. [5] Postal Technology International '96, UK & Int. Press, ISSN 1362-5209. [6] J. J. Viard, New technologies and their Impact on communication markets, Troika '95, postal service infrastructure of a modern society international symposium, Saint Petersburg, Russia, June 12-16, 1995. [7] S. N. Srihari, V. Govindaraju and A. Shekhawat, Interpretation of handwritten addresses in US mail stream, First European Conference dedicated to Postal Technologies J E T P O S T E 93, Nantes, June 14-16, 1993. [8] F. Kimura and M. Shridhar, Handwritten address interpretation using extended lexicon word matching, in Progress in Handwriting Recognition, A. C. Downton and S. Impedovo (eds.) (World Scientific, Singapore, 1997); Proc. 5th Int. Workshop on Frontiers in Handwriting Recognition, Colchester, Sept. 2-5, 1996. [9] C. Scagliola, Search algorithms for the recognition of cursive phrases without word segmentation, Proc. 6th Int. Workshop on Frontiers in Handwriting Recognition, Taejon, Corea, Aug. 12-14, 1998. [lo] C. Saulnier, Revolution in OCR processing, Postal Technology '98, 112-115 UK & International Press, 1998. [ll] F. Morgan, R. V. O'Toole, D. A. Simon and M. Blackwell, Optotrak Validation Experiments, Technical Report CMU-RI-TR-95-26, The Robotics Institute, Carnegie Mellon Unin., 1995. [12] P. Mulgaonkar, Automated postal system, S R I Int. Report ITAD-733-MK-95,026, pp. 13, February 1995. [13] T. Kanade, A. Gruss and L. R. Carley, A very fast VLSI rangefinder, Proc. 1991 IEEE Int. Conf. Robotics and Automation, Sacramento, CA, Apr. 1991, 1322-1329.
4.5 Computer Vision in Postal Automation 815 [14] H. Samet, Design and Analysis of Spatial Data Structures: Quadtrees, Octrees, and other Hierarchical Methods (Addison-Wesley, 1989). [15] R. Bajcsy and F. Solina, Three-dimensional object representation revisited, in Proc. 1st Int. Conf. Computer Vision, June 1987. [16] 0. Faugeras, Three-Dimensional Computer Vision; A Geometvic Viewpoint (The MIT Press, Cambridge, Massachusetts). [17] J. Spandow, New dimensions in data capture, Postal Technol. Int. '98, UK & International Press, 1998, 136-139. [18] D. Flynn, Advances in barcode scanning, Postal Technology '98, UK & International Press, 1998, 132-134. [19] B. Ulmer, VITA I1 - Active Collision Avoidance in Real Traffic, IEEE Symp. Intelligent Vehicles '94, Oct. 1994, Paris, 1-6. [20] C. Thorpe and M. Hebert, Mobile robotics: perspectives and realities, Proc. ICAR'95, Saint Feliu, Spain, Sept. 1995. [21] E. D. Dickmanns, R. Behringer, D. Dickmanns, T. Hildebrandt, M. Maurer, F. Thomanek and J. Schielen, The seeing passenger car VaMoRs-PI IEEE Symp. Intelligent Vehicles '94, Oct. 1994, Paris, 68-73. [22] J. M. Evans, HelpMate: A service robot success story, Service Robot: A n Int. J. 1, 1 (1995) 19-21 (MCB University Press). [23] S.Tansey and 0. Holland, A system for automated mail portering using multiple mobile robots, 8th Int. Conf. Advanced Robotics, ICAR97, Monterey, Ca, 1997, 27-32. [24] G. Garibotto, ROBOLIFT: Vision guided autonomous fork-lift, Service Robot: A n Int. J . 2, 3 (1996) 31-36 (MCB University Press). [25] European Commission ESPRIT, Information Technologies RTD Programme, Domain 7: Technologies for Business Processes, Summaries of projects of Fourth Framework Programme, Sept. 1996. [26] Canada Post Corporation, Lettermail Mechanization, 11th Int. Conf. Postal Mechanisation, Melbourne, March 7-11, 1994. [27] D. Bartnik, V. Govindaraju and S. N. Srihari, Reply Card Processing, VJWW site:
http://www.cedar.buffalo.edu/RCP/. [28] J. Kavanagh, Investment spawns spin-offs, Financial Times, Jan. 14, 1996. (291 D. Roetzel, Strategic aspects for electronic and hybrid mail services, Troika '95, Postal Service Infrastructure of a Modern Society Int. Symp., Saint Petersburg, Russia, June 12-16, 1995. [30] Proc. Third Int. Conf. Document Analysis and Recognition (IEEE Computer Society Press, Aug. 1995). [31] The first census optical character recognition system conference, NISTIR 4912, Aug. 1992, U S . Dep. of Commerce, NIST. [32] B. Belkacem, Une Application Industrielle de Reconnaissance d'addresses, 4 eme Colloque National sur 1'Ecrit et le Document, CNED'96, Nantes, July 1996. (331 Proc. Advanced Technology Conf. USPS, 1992.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 817-854 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 4.6 1 VISION-BASED AUTOMATIC ROAD VEHICLE GUIDANCE
DIETER KOLLER
EE Dept., 109 Moore Labs, MC 136-93 California Institute of Technology, Pasadena, CA 91125, USA QUANG-TUAN LUONG* Artificial Intelligence Center, SRI International 333 Ravenswood Ave. Menlo Park, CA 94025, USA JOSEPH WEBER Autodesk, 2465 Latham St., Suite 101 Mountain View, CA 94040, USA JITENDRA MALIK Computer Science Division, University of California Berkeley, CA 940.20, USA During the last decade, significant progress has been made towards the goal of using machine vision as an aid to highway driving. This chapter describes a few pieces of representative work which have been done in the area. The two most important tasks to be performed by an automatic vehicle are road following and collision avoidance. Road following requires the recognition of the road and of the position of the vehicle with respect to the road so that appropriate lateral control commands (steering) can be generated. Collision avoidance requires the detection of obstacles and other vehicles, and the measurement of the distances of these objects to the vehicle. We first explain the significance of vision-based automatic road vehicle guidance. We then describe the different road models, and contrast the approaches based on modelbased lane marker detection with adaptive approaches. We describe in detail the important approach of road following by recursive parameter estimation, which is the basis for the most successful systems. We then address the issue of obstacle detection, first detailing monocular approaches. We finally describe an integrated stereo approach which is beneficial not only for obstacle detection, but also for road following. Keywords: visual navigation, intelligent highway vehicle systems, intelligent control, real-time machine vision, recursive estimation, spatio-temporal modeling, stereovision. video image processing. *Corresponding author:
[email protected]
818
D. Koller et al.
1. Introduction 1.1. Vision for Automatic Vehicle Guidance There has been much research in machine vision on the basic problem of building a dynamically updated map of the three-dimensional environment. This includes techniques based on binocular stereopsis-using the slight differences in the images of the scene from two cameras - and techniques based on structure from motion - which is the problem of using the optical flow extracted from a sequence of images acquired from a single moving camera. The literature on these problems is extensive - we refer the reader to Horn [18], and Faugeras [13]. Autonomous vehicle navigation was one of the first applications of machine vision to have been investigated (see for instance [36]). In the United States, many earlier projects were supported by the Department of Defense, and the desired capacity focused on cross-country, all-terrain driving. The CMU NavLab project [46] is the key university-based project on this theme with major activity at sites such as Martin Marietta. The cross-country terrain requirement means that the problem is quite unstructured and hard. Other recent developments in this line of research include applications to planetary rovers [30]. By contrast, this chapter will focus on work specifically aimed at using vision for guiding a vehicle driving on a road. Since automobile transportation is a major component of modern societies, the social and economic implications are far-reaching in terms of comfort, economy, and safety. The ultimate goal is to perform fully autonomous driving within an urban environment. However, achieving this goal within the next decade seems to be out of reach. Because of the complexity of such an environment, many tasks such as identification of the road configuration (including intersections), other vehicles’ behaviors, traffic sign reading, and omnidirectional sensing, have to be performed in addition to those required for autonomous highway cruising under standard conditions. On the other hand, highway driving is sufficiently constrained that simple assistance systems, such as autonomous cruise control, lane keeping, and vehicle following will be operational in the next few years, and fully autonomous highway cruising can be envisioned within this decade. The two most basic tasks to be performed by a n automatic vehicle are road following and collision avoidance. Road following requires the recognition of the road and of the positioning of the vehicle with respect t o the road so that appropriate lateral control commands (steering) can be generated. Collision avoidance requires the detection of obstacles and other vehicles, and the measurement of the distances of these objects to the vehicle. These two capacities are sufficient for highway driving under normal conditions, and will be studied in detail in this chapter.
1.2. Vision Compared to Other Sensors Vision (i.e. the real-time processing of video images) is a rich source of information, however:
4.6 Vision-based Automatic Road Vehicle Guidance 819 The processing is computationally more expensive and complex than with other sensors. The visual input is degraded in some particular atmospheric conditions and at night. Many other sensors have been investigated for use in autonomous vehicles. For the purpose of lateral control, magnets embedded under the road have been proposed. They require an upgrade of the highway infrastructure. However, nearly 70% of single vehicle roadway departure accidents occur in rural or suburban settings on undivided two lane roads. Since it is unlikely that these roads will be upgraded in the foreseeable future, a system for preventing these crashes must rely on the existing road structure. Another approach is to use radar to detect the roadsides, using the fact that at a low incident angle asphalt possesses a lower reflectivity than typically rough roadside surfaces. However, such an approach does not take advantage of the existing lane markings. For the purpose of longitudinal control, Doppler radars, laser range-finders, and sonars have been proposed. Each of these sensors have their own weaknesses. For instance, the magnetic field of embedded magnets can be perturbed by the presence of nearby magnetic bodies, and the sensitivity of the sensors might not be sufficient for performing lane changes. Sonars need a reflective surface and there are problems with detection range and time considerations. The advantage of vision is that it is a passive sensor: it does not send out a signal. Active sensors such as radar can be a potential problem in crowded scenarios. Issues such as environmental safety and interference need to be addressed. While all these different sensors have been shown to provide adequate information to support lateral control and longitudinal control, they might not provide enough information for more complex tasks, where the environment is complex, and a large number of possibilities must be taken into account in order to provide safe automation. Examples of such tasks include lane change maneuvers, or obstacle detection in cluttered environments and/or curved roads. Many approaches to automatic vehicle guidance take a multi-sensor fusion approach in order to gain robustness through redundancy, and to combine the strength of each sensor. Visual sensors are always considered in such approaches, because they have properties which complement non-visual sensors. The combination can provide a reliable system under all weather conditions.
1.3. The State of the Art Coordinated research programs towards autonomous vehicle guidance have been developed around the world. In Europe, impressive work on a visually guided autonomous vehicle has been done in the group of Prof. E. D. Dickmanns of Universitat der Bundeswehr, Munich, Germany [ll].Their work resulted in a demonstration in 1987 of their 5-ton van, the VaMoRs running autonomously on a stretch of the Autobahn at speeds of up to 100 km/h. Vision was used to provide input for
820 D. Koller et al.
both lateral and longitudinal cqntrol on free roads. Subsequently they have demonstrated successful operation on cross-country roads (at lower speeds) where the road boundaries are difficult t o determine. All this was achieved with rather simple hardware-IBM PCs. Subsequently, the PROMETHEUS initiative by the major European car manufacturers has led t o implementations such as VITA [48] and VaMP. The platform for the latter project is a passenger vehicle which demonstrated in 1995 autonomous driving of more than 95% of a 1700 km trip at an average speed of 120 km/h, where hundreds of lane change maneuvers were performed automatically [3]. In the US, the IVHS research initiative triggered the development of several projects. The most visible example is the Navlab-5 which drove in 1995 98% of a 4500 km trip with automatic lateral control [39]. Another example is the LaneLock and LaneTrack projects at General Motors [1,27], which have resulted in real-time implementations. In Japan, research is being conducted at a number of industrial and academic research laboratories. These include the Harunobo project at Yamanashi University ongoing since 1982 which has resulted in an autonomous vehicle tested on roads. The Japanese Ministry of Transport started in 1991 the AVHS and PVS projects, to which major car manufacturers such as Toyota, Nissan, Honda contributed individually. Test vehicles exhibiting various functionalities were demonstrated at the 1996 IEEE Symposium on Intelligent Vehicles. Korea has a similar project, PRV. For an extensive survey, we recommend the proceedings of the IEEE Symposium on Intelligent Vehicles which has been held yearly since 1990 in Tokyo (1990,1991,1993,1996), Detroit (1992,1995), and Paris (1994). 1.4. A Few Representative Approaches
There are many ways to design a system to perform lateral and longitudinal control based on vision. In this chapter, we will not attempt to conduct an extensive survey, but instead detail some representative approaches. A first critical choice is that of the type of representation to choose for lane keeping. In the work by conducted at UBM by Dickmanns et al., a totally explicit representation has been chosen. The idea is that of recursively maintaining a set of road and vehicle state parameters. The road parameters include estimates of horizontal and vertical curvature of the road; the vehicle parameters include estimates of heading angle, slip angle and lateral offset relative to the road. The dynamical model represents knowledge about the motion of a vehicle and serves as a tool both for fusion of conventionally measured data (such as speed and distance from an odometer) and for control determination as well as the prediction of the effects of this control input on the evolution of the trajectory and on corresponding changes in the perspective image over time. The image analysis required is thus made much simpler than in the general machine vision setting. Only at the initialization stage is there a need to search extensively t o determine the lane boundaries. After that the task becomes much simpler by exploiting the temporal continuity conditions captured in the dynamical model. By contrast, in the approach of Pomerleau et al.,
4.6 Vision-based Automatic Road Vehicle Guidance 821 little or no explicit representation is used. As a consequence, temporal continuity is not fully exploited. To obtain real time operation, the images have to be considerably sub-sampled. The knowledge of its position and of the road available to the system is much less precise. However, it is argued that it is easier for such a system to adapt to new road configurations, since the absence of explicit representation would make it more flexible. A second critical choice is the type of camera system to use, monocular or binocular. With a monocular (or bifocal) system it is necessary to use model-based techniques which exploit heuristics such as symmetry of the vehicles. The drawbacks are that more general obstacles (such as pedestrians) cannot be recognized, and the estimation of the distance is not precise. For this reason, in monocular systems the obstacle detection part is often performed using radar. A binocular system requires more visual processing. However, it makes it possible to detect general obstacles and to obtain a precise estimate of their distances. Therefore, for the purpose of obstacle detection, binocular stereopsis is becoming a popular option. Interestingly, the task of highway driving seems to be constrained enough so that the systems with different design options are relatively robust and have exhibited relatively similar performance. For instance Dickmann’s system assumes a relatively fixed road appearance, an assumption easily violated in practice. Pomerleau’s system uses a very simplified geometric model and does not take into account dynamics. Yet both systems exhibit reasonable performance. In Section 2, we concentrate on road modeling and detection, with a focus on the low-level techniques. We first describe the different road models, and contrast the approaches based on model-based lane marker detection with more adaptive approaches. In Section 3, we describe in detail the important approach of road following by recursive parameter estimation, which is the basis for the most successful systems. In Section 4 we address the issue of obstacle detection by monocular systems. We finally describe in Section 5 an integrated stereo approach which is beneficial not only for obstacle detection, but also for road following.
2. Road Modeling and Localization
2.1. Road Models The road is often represented as a triangle in the image plane (for example [8,22,23]) which assumes the road is straight and planar. In this model there are only three parameters: the road width, the orientation and offset of the vehicle. The advantage of this model is its simplicity. Since the features to be detected (lane markers or road borders) are parallel, techniques based on vanishing point detection and Hough transforms work well. An example of detected lane markers found this way is shown in Fig. 1. Within a more sophisticated model, this technique could be used when the system is started (or reinitialized in the case of detected inconsistencies), However, modeling also the road horizontal curvature makes it possible to predict more accurately feature locations on curved sections, at a cost of only a few
822
D. Koller
et al.
additional parameters. The curvature information provides a look-ahead parameter which also helps to generate smoother steering commands. This approach is favored in some of the most successful systems, including UBM’s approach (see Section 3.4), YARF [25], one of the latter lane finders used in the NavLab project, the RALPH system (see Section 2.4), and Berkeley’s StereoDrive project (see Section 5). In Europe, the roads are built using the clothoid model described in more detail in Section 3.4, whereas in North America, a constant curvature model is used. This latter model is easier to implement, since it provides a closed-form solution (a circular arc is projected into a section of an ellipsis), which can even be linearized by a parabolic approximation [25]. In more refined representations, the three-dimensional road shape is also modeled. This is part of UBM’s approach. Other attempts make only the assumption that the road does not bank [9,21], or that the road edges are locally parallel [47]. These approaches didn’t appear to be robust: due to the lack of constraints, errors in feature localization could easily result in changes in road shape. 2.2. Structured Roads: Methods Based on Lane Markers On structured roads like highways, specific techniques can be applied to take advantage of the presence of lane markers, since their position, as well as their appearance, can be modeled. In [40] it is argued that the observation of a single point along a lane marker, together with its optical Aow, could be sufficient for road following. The fact that the observation of the lane flow is a reliable source of information for human drivers had been established in [16]. One typical approach [11,25,28] to find the road parameters is as follows: Predict new parameters for each lane marker. Define horizontal search bands in the image (an example is given Fig. 1) based on the predicted parameters and their uncertainties. This limits the amount of image processing to be done. Localize within the search zone the center of lane markers. This is done by applying an operator tuned to respond to the pattern corresponding to the lane marker. In the case of a bright bar, an example of such an operator is a convolution with a double derivative of a Gaussian (DDG) of a width matching that of the bar. An example of the points found by this method is shown in Fig. 1. Backproject the points just found to the ground plane. It is well known that there is a one-to-one correspondence between any plane in space and the image plane, which takes the form of a projective homography:
f
hiiX IC
= h31X
+ h12Y + hi3
+ h32Y + h33
4.6 Vision-based Automatic Road Vehicle Guidance 823
Fig. 1. The initialization of the algorithm is done by detection of portions of straight lines of common orientation (top left). Within the search zone predicted (top right), a precies localization of line markers points is performed (bottom).
where x,y are image coordinates and X, Y are coordinates in the 3-D plane. This plane is taken to be the road plane, assuming that the road is locally planar. The coefficients hij are defined up to a common scale factor, and depend on the camera parameters, as well as the position and orientation of the camera with respect to the road. They can alternatively be determined by specifying four points on the road and on the image. This construction is fundamental in autonomous driving and has been used by many authors. It makes it possible to build a bird-eye’s view from the ground-level view. In the bird-eye’s view important features such as lane markings, which converged towards the top of the original image, now appear parallel. Fit a new road model to the backprojected points, using a statistical fitting method, and the equation of the model. A linear least-median-of-squares (LMedS) algorithm is efficient and provides robustness against outliers. In the UBM’s approach [ll]a variant of this technique is used, which makes an even stronger use of temporal constraints. Provided that the correct operator is applied, this technique has proven to be very effective. The amount of image
824
D. Koller et al.
processing is relatively limited, and it is possible to take into account relatively sophisticated road models (i.e. with curvature), and use sound statistical techniques. Difficulties might arise if an inappropriate operator is used. To cope with this problem, the LANELOCK [22,23] and YARF [25] systems integrate several feature trackers. Another refinement made possible by the use of stereopsis in [28] is to exclude the areas which correspond to obstacles from the search bands, which is potentially useful in crowded traffic scenes. 2.3. ALVINN: A Learning-Based System
The system previously described relies on a precise modeling of the road appearance. In this section, we describe the opposite approach: systems with weak and flexible representations, designed to cope with the variation in road appearance. Systems which have to deal with unstructured roads have to cope with widely changing road appearances, and must be adaptive. They often have to rely only on the road boundary for road localization, and therefore use region-based segmentation schemes, using color [8], or texture [52]. A favored technique to recover the road parameters is parameter space voting methods: detected features vote for all possible roads they are consistent with. These techniques are relatively robust to segmentation errors, but they are impractical with more than three parameters, therefore they are compatible only with the simplest road representations. A typical example of a system for driving on unstructured roads is SCARF [8], which repeats the following steps: classify image pixels, find the best-fitting road model, and update the color models for classification. Alternative approaches that combine machine vision and machine learning techniques have demonstrated an enhanced ability to cope with variations in road appearance. ALVINN [37,38],which we describe next, is a typical system of this type. It evolved from the Navlab project which was initially concerned with all-terrain driving. Other related approaches are [24,29,41]. ALVINN (Autonomous Land Vehicle In a Neural Network) is a perception system which learns to control the NAVLAB vehicles by watching a person drive. ALVINN’s architecture consists of a single hidden layer back-propagation network. The input layer of the network is a 30 x 32 unit two-dimensional “retina” which receives input from the vehicle’s video camera. Each input unit is fully connected to a layer of five hidden units which are in turn fully connected to a layer of 30 output units. The output layer is a linear representation of the direction the vehicle should travel in order to keep the vehicle on the road. To drive the vehicle, a video image from the onboard camera is injected into the input layer. Activation is passed forward through the network and a steering command is read off the output layer. The most active output unit determines the direction in which to steer. To teach the network to steer, ALVINN is shown video images from the onboard camera as a person drives, and told it should output the steering direction
4.6 Vasion-based Automatic Road Vehicle Guidance 825 in which the person is currently steering. The backpropagation algorithm alters the strengths of connections between the units so that the network produces the appropriate steering response when presented with a video image of the road ahead of the vehicle. After about three minutes of watching a person drive, ALVINN is able to take over and continue driving on its own. Because it is able to learn what image features are important for particular driving situations, ALVINN has been successfully trained to drive in a wider variety of situations than other autonomous navigation systems which require fixed, predefined features. The situations ALVINN networks have been trained to handle include single lane dirt roads, single lane paved bike paths, two lane suburban neighborhood streets, and lined divided highways. In this last domain, ALVINN has successfully driven autonomously at speeds of up to 70 mph, and for distances of over 90 miles on a public highway north of Pittsburgh. While systems of this type have been quite successful at driving on a wide variety of road types under many different conditions, they have several shortcomings. First, the process of adapting to a new road requires a relatively extended “retraining” period, lasting at least several minutes. While this adaptation process is relatively quick by machine learning standards, it is unquestionably too long in a domain like autonomous driving, where the vehicle may be traveling at nearly 30 meters per second. Second, the retraining process invariably requires human intervention in one form or another. These systems employ a supervised learning technique such as back propagation, requiring the driver to physically demonstrate the correct steering behavior for the system to learn. One attempt to solve these problems has been to train specialized networks for each road type. In order to overcome the problem on which network to use, a connectionist superstructure, MANIAC [19], incorporates multiple ALVINN networks with the hope that the superstructure would learn to combine data from each ALVINN network. 2.4. Ralph: A Hybrid System
ALVINN relied entirely on learning, which has been found to be insufficient, therefore its successor, RALPH [39] uses a more hybrid approach which involves a partial modeling of road features. Both systems (ALVINN and RALPH) use simple calculations on low-resolution images which makes it possible to implement them in real time. Temporal continuity however is not exploited. It is argued [39] that even for driving on structured roads, adaptive systems are necessary. There are some variations in road markings depending on the type of road (e.g. suburban street us. interstate highway), and the state or country in which it is located. For example, many California freeways use regularly spaced reflectors embedded in the roadway, not painted markings, to delineate lane boundaries. Further challenges result from the fact that the environmental context can impact road appearance. RALPH is an adaptive approach where two parameters are implicitly determined. These parameters are the road curvature, and the vehicle’s lateral position
826
D. Koller
et al.
Fig. 2. (a) RALPH’s control panel showing the view of the road and the resampled bird-eye’s view. (b) The mapping of the image into a bird-eye’s view. Illustration courtesy of D. Pommerleau.
Fig. 3. RALPH’s method for determining curvature. Illustration courtesy of D. Pommerleau.
Fig. 4. RALPH’s method for determining offset. Illustration courtesy of D. Pommerleau.
relative to the lane center. The latter parameter is used to generate directly a steering command. RALPH does not take into account other parameters such as vehicle heading, or variations between the road plane and the camera position and orientation.
4.6 Vision-based Automatic Road Vehicle Guidance 827 In the RALPH system, the image is first sub-sampled and transformed so as to create a low resolution (30 x 32 pixels) bird-eye’s view image, as illustrated in Fig. 2. To determine the curvature of the road ahead, RALPH utilizes an “hypothesize and test” strategy. RALPH hypothesizes a possible curvature for the road ahead, subtracts this curvature from the parallelized low resolution image, and tests to see how well the hypothesized curvature has “straightened” the image. After differentially shifting the rows of the image according to a particular hypothesis, columns of the resulting transformed image are summed vertically to create a scanline intensity profile. When the visible image features have been straightened correctly, there will be sharp discontinuities between adjacent columns in the image. By summing the maximum absolute differences between intensities of adjacent columns in the scanline intensity profile, this property can be quantified to determine the curvature hypothesis that best straightens the image features. The next step in RALPH’Sprocessing is to determine the vehicle’s lateral position relative to the lane center. This is accomplished using a template matching approach on the scanline intensity profile generated in the curvature estimation step. The scanline intensity profile is a one-dimensional representation of the road’s appearance as seen from the vehicle’s current lateral position. By comparing this current appearance with the appearance of a template created when the vehicle was centered in the lane, the vehicle’s current lateral offset can be estimated. Because RALPH uses procedural methods to determine the two relevant parameters, the techniques assume only that there are visible features running parallel to the road, and that they produce a distinct scanline intensity profile, which is more general than other schemes. A second strength of the approach stems from the simplicity of its scanline intensity profile representation of road appearance. The 32-element template scanline intensity profile vector is all that needs to be modified to allow RALPH to drive on a new road type. There are several strategies on how to adapt this template to changing conditions: 0
0
0
0
A human driver centers the vehicle in its lane, and presses a button to indicate that RALPH should create a new template. RALPH selects one from a library of stored templates recorded previously on a variety of roads. RALPH slowly “evolves” the current template by adding a small percentage of the current scanline intensity profile to the template. RALPH uses the appearance of the road in the foreground to determine the vehicle’s current lateral offset and the curvature of the road ahead, as described above. At the same time, RALPH is constantly creating a new “rapidly adapting template” based on the appearance of the road far ahead of the vehicle (typically 70-100 meters ahead). If the appearance of the road ahead changes dramatically, the new template is used.
There are several possible drawbacks of lacking an explicit representation of the geometry of the road and of the vehicle position: the approach generates only
828
D. Koller
el al.
very coarse steering actions, it is difficult to apply refined control strategies (which require precise measurements of the vehicle status) in order to obtain a smooth ride. It might also be more difficult to plan more complex maneuvers such as lane changes.
3. Road Following by Recursive Parameter Estimation 3.1. Theoretical Background Visual-based road following is a special application of temporal image sequence processing. Using temporal image sequences is also referred to as dynamic scene analysis, which means we are analyzing dynamically changing scenes, opposed to stationary scenes. The only changes which can occur in temporal image sequences are changes due t o moving objects, a moving observer, or a change in lighting conditions. Now Physics tells us all about the dynamic behavior of moving objects or moving observers by using well established (continuous) differential equations (socalled motion equations). These motion equations describe the motion of movable objects under the application of forces. The simplest motion is the force-free motion, in which case a moving object keeps its state constant over time. Such a simple motion, however, plays only a theoretical role, since we always have to cope with at least some kind of frictional forces. Motion equations are usually established by variation methods minimizing the action, resulting in Euler-Lagrange equations (see e.g. [15]). Motion equations are second order differential equations of the spatial coordinates which involve the spatial coordinates, their velocities, and the time. In this context motion estimation or model-based tracking refers to estimating the motion parameters according to a motion model. We first provide a brief background before we describe the most popular parameter estimation method in J Section 3.2. A very crucial issue is establishing a correct motion model, which should approximate the real motion as close as possible. We will address this issue in Section 3.3. We conclude this section by describing one of the most successful approaches for road following using recursive parameter estimation techniques in Section 3.4. 3.1.1. Continuous time linear dynamic systems We start by considering a system whose dynamic behavior is reasonably well approximateda by a set time differential equations, also known as plant equations, or motion equations in the case of moving objects:
i ( t )= A(t)s(t)+ B(t)u(t)
use the term approximate here, since we cannot expect to have an exact description of the dynamic system.
4.6 Vision-based Automatic Road Vehicle Guidance 829 with
s ( t ) = the system state vector a t time t
u(t)= the system input or control vector at time t
A(t) = the system matrix a t time t B ( t ) = the input gain matrix at time t . The system state s is defined to be the smallest vector which summarizes the entire state of the system. The system output z ( t ) , which is actually observable is, in general, a vector of a dimension less than the system state:
with the measurement matrix C ( t ) ,which maps the system state t o the measurement. Other important terms in the context of continuous time linear dynamic systems are: (i) controllability, which states that for any arbitrary combination of initial and destination point in state space there is a n input function for which the destination state can be reached from the initial point in finite time, and (ii) observability, which holds, if the initial state can be fully and uniquely recovered in finite time from its observed output and given input. 3.1.2. Discrete time linear dynamic systems In image sequence analysis, however, we have to deal with discrete time linear dynamic systems, with the time step being the inverse video-rate. In this case we have t o turn the continuous differential equation (3.1) into a time discrete dafference equation:
sk+l = Fksk
+ GkUk
(3.3)
with Sk+l
Uk
= the system state vector at time t k = the system input or control vector a t time t k
Fk = the transition matrix at time G k = the
tk
input gain matrix at time t k .
The transition and input gain matrix are obtained in a straightforward manner when the system matrix and the input gain matrix of the continuous case are integrable: S k + l is the result of integrating Eq. (3.1) along a time interval T = t k + l - t k . The measurement equation is then given by:
830 D. Koller et al.
with the measurement matrix Hk at time t k . Similar expressions for controllability and observability are defined as for the continuous case, where finite time is replaced by a finite number of steps and finite number of observations, respectively. 3.2. Linear Recursive Estimation i n Dynamic
Systems - Kalman Filtering We do not provide a discussion of all necessary properties which lead to the so-called Kalman filter equations given below. We will just summarize the basic concept and refer the interested reader to one of the excellent text books [ 2 , 1 4 , 3 2 , 4 3 ] . We confine the equations to the discrete time case. Consider a discrete time linear dynamic system described by a vector difference equation with additive white, Gaussian noise that models unknown disturbances or inaccuracies in the plant equation:
with the n-dimensional state vector S k , the m-dimensional known input or control vector u k , and W k a sequence of zero-mean white Gaussian process noise with covariance E [ W k W F ]= &kb. The measurement equation is
where z k denotes the 1-dimensional measurement vector and V k is a sequence of zero-mean white Gaussian measurement noise with covariance E [ v ~ v=~ Rk. ] The matrices F k , G k , H k , Q k , and Rk are assumed to be known and possibly time varying. The measurement noise and process noise are assumed to be uncorrelated: E [ V k W T ] = 0, vl, k. Regarding Eq. ( 3 . 5 ) , the question is how do we estimate the state vector s k from measurements z k corrupted by noise and given an initial estimate $0 with initial covariance PO. Optimal estimates that minimize the estimation error, in a welldefined statistical sense, are of particular interest. The optimal estimate of s k will be denoted by $ k . If we have more measurements than parameters to estimate ( I > n) we have an over-determined system and can apply weighted least squares estimation and obtain (we dropped the time index k here):
i=(H~R-'H)-'H~R-~Z,
(3.7)
which reduces to the least squares (pseudo-inverse) solution for uniform measurement noise (R = a21). We obtain the same solution if we apply a probabilistic approach using maximum a-posteriori (MAP) estimation.
b~r
denotes the transpose of vector W k , and E [ x ]computes the expectation value of vector x.
4.6 Vision-based Automatic Road Vehicle Guidance 831 3.2.1. Discrete Kalman filter
A major breakthrough in parameter estimation has been achieved in [20] by formulating a time recursive estimation technique, which is now called the Kalman filter (KF) technique. Filtering refers here to estimating the state vector at the current time, based upon all past measurements. Kalman formulated and solved the Wiener problem for gauss-markov sequences through state space representation and the viewpoint of conditional distributions and expectations. His results reduce to Eq. (3.7), which concludes that for Gaussian random variables weighted least squares, MAP, and Bayes estimation lead to the same result as long as the same assumptions are being used. Given a prior estimate of the system state at time t k , denoted G,, we seek an updated estimate, G l , based on use of the measurement Z k . In order to avoid a growing memory filter, this estimate is sought in a linear, recursive formC
where K; and K k are yet to be determined time-varying weighting matrices. The optimum choice of these matrices can be obtained by considering that the estimator should be unbiased and the associated updated error covariance matrix should have a minimal weighted scalar sum of the diagonal elements (e.g. [14]). The result are the so-called Kalman filter equations for discrete time linear dynamic systems:
(3.11)
Equation (3.9) is referred to as the state update equation and Eq. (3.10) is called the state error covariance update equation. K k denotes the so-called Kalman gain and I stands for the identity matrix. The transition equations being used for computing predictions for the state and the state error covariance become: .-
sk+l
= Fkb;
P;+'
=F
+
GkUk
k P z F F
+W k .
(3.12) (3.13)
Kalman's technique, originally only formulated for linear motion equations and update equations with a linear relation between measurements and the estimated state, has been extended to apply also to nonlinear equations, by linearizing the problem and using only first order terms - the so-called extended Kalman filter CThroughoutthe text we denote with + and - entities immediately before and immediately after a discrete measurement, respectively.
832
D. Koller et al.
(EKF). Using an iterative minimization method is also referred to as iterated extended Kalman filter (IEKF). 3.2.2. Discrete extended Kalman filter
In an extended Kalman filter the measurement equation is linearized by expanding the nonlinear measurement function hk up to first order:
where Hk(gL) is the Jacobian of the measurement function h k . We can still use the nonlinear plant equation for the state vector corresponding to Eq. (3.12), however, we have to linearize the transition function fk to predict the state covariance. Instead of Eq. (3.13) we use:
where Fk(6;)
is the Jacobian of the transition function fk.
3.2.3. Kalman filter summary
A Kalman filter is basically a time recursive formulation of weighted least squares and exhibits the following key features: Recursive: Time recursive means that we do not need to store previous images or measurements, opposed to applying batch weighted least squares, where all images (measurements) need to be available at once. This makes it well suited for real-time applications, where we have only one image at a time. Predictive: Based on current estimates and on the applied motion model we can compute a predictive state for the next image frame. This predicted state is an expectation and enables new measurements to beconfined to a limited (expected) search space in the image. This provides a major performance improvement, also a requirement for real-time performance. Optimal: Kalman filters are optimal in the sense of the minimization method. For the linear case it can be shown that it provides the optimal solution. The nonlinear case provides only sub-optimal solutions.
4.6
Vision-based Automatic Road Vehicle Guidance
System (real world)
System Error
System state s ( t ) Uncertainty P ( t )
Transition function f ( t )
P-(t
Observation z ( t ) Initialization i0,PO
833
-
+ 1)
1
Update (internal world)
Prediction
Fig. 5. Block diagram of a Kalman filter: a real world system is supposed to be completely described by a state vector s ( t ) and its uncertainty P ( t ) . The system dynamic behavior is covered by the transition function f(t) and the system (process) noise. An internal representation of the system - initialized by SO and Po - is being updated using a measurement (observation) z ( t ) resulting in system state estimates s + ( t ) and Pf(t). These values are then used, after a time delay, to compute predictions s - ( t + 1) and P-(t l ) , which in turn support the measurement process for the next time step.
+
Another nice feature of using Kalman filters is that by exploiting dynamical models, all state variables of the state vector can be recovered, even though only a subset of the output variables can be observed a t a time. Such a situation is inevitable in vision, since images basically allow only 2-D measurements, whereas the process t o be recovered is described in 3-D. A block diagram overview of the Kalman filter technique is given in Fig. 5. 3.3. Scene Modeling Applying Kalman filtering to dynamic scene analysis requires an extensive modeling of the entire scene with spatial representations of all objects (movable and non-movable) and motion models (motion equations) for all movable objects. In dynamic scene analysis we have to distinguish between objects and subjects: objects denote those entities which just obey physical laws without any internal mission (e.g. a ball kicked by a soccer player), whereas subjects are entities with a certain mission or goal (e.g. the soccer player who kicked the ball). Subjects are much harder to model, since they can exhibit a very complex behavior. In the con-
834
D. Koller et al.
text of traffic scenes and driving we have mainly t o cope with subjects, e.g. other cars, being driver by a driver with a certain goal, or pedestrians. An easy and straightforward approach, however, is to model subjects like objects with their simple physical motion behavior and model any unpredictable deviations due to higher level decisions of the subject as process noise in the Kalman filter. This approach is often feasible, since the time sampling rate of the input measurements is high compared to the high level changes in the motion of subjects. 3.4. Dickrnanns’
4 - 0 Approach
The first successful and by far the most important and impressive work on a visually guided autonomous vehicle has been done in the group of Prof. E. D. Dickmanns of Universitat der Bundeswehr, Munich, Germany. (See [ll]and the references cited therein.) Their work resulted in a demonstration in 1987 of their 5-ton van, the VaMoRs (Versuchsfahrzeug fur autonome Mobilitat und Rechnersehen) , running autonomously on a stretch of the Autobahn (freeway) at speeds of up to 100 km/h (only limited by the power of the engine). Increased computer performance and smaller and lighter hardware allowed them since 1994 t o modify their system to be implemented in a passenger car, the VaMP (VaMoRs Passenger) car, a Mercedes Benz S-class car. This car is running autonomously at normal cruising speeds of about 130 km/h using bifocal visual sensor input, which provides a long enough look-ahead range. A long distance test ride between Munich (Germany) and Odense (Denmark) in November 1995 has been reported in [3]. Their current implementation employs a set of transputers and PowerPCs [31]. The key feature of Dickmann’s approach is what he coined the 4 - 0 approach, which makes extensive use of spatial-temporal models of objects and processes. His approach works very well since for this type of application he can use special spatial and temporal models, the parameters of which are recursively updated while the vehicle is driving. 3.4.1. Spatial-temporal models
-
4 - 0 approach
The basic idea of this 4 D approach, illustrated in Fig. 6 , is to generate an internal representation of the real world using measurements from the real world. The internal representation consists of models instantiated through model hypotheses verification and model parameter estimation. Measurements from the real world are compared to predictions from the internal representation, which describe geometric and dynamic models. The entire internal model is parameterized through a state vector which is being updated based on the difference between the measurements and predictions using minimal square error estimation. This time recursive estimation process is basically a n extended Kalman filter, applied to the problem of state reconstruction from image sequences. The non-linear mapping from 3-D shape parameters and 3-D object states into 2-D image feature locations by perspective projection is locally approximated by first order relations and covered by the
4.6 Vision-based Automatic Road Vehicle Guidance 835
Fig. 6. The 4-D approach to visual road tracking. The left part symbolizes the real world (3D + time). Snapshots are thinned down to a set of features by image processing techniques. Objects are represented by their state parameters, which describe geometric and dynamical models. Illustration courtesy of R. Behringer.
Jacobian matrix for the measurement function as described in Subsection 3.2.2. It turns out that a crucial parameter in these equations is the focal length determined by the camera lens. This is why bifocal vision is used, one camera with a wide angle lens providing a large field of view and a second camera with a tele lens for the large look-ahead ranges required for high speed driving. In addition to visual input from video cameras they are also using conventionally measured data like velocity from the odometer. The Kalman filter application is therefore embedded in a sensor fusion approach. The next subsection covers a description of the road and vehicle model used for lateral vehicle control. Further development of this work has been in collaboration with von Seelen’s group in Bochum [44,55] and the Daimler Benz VITA project [48,49], in order to combine both lateral and longitudinal control. We will discuss the longitudinal control in the context of obstacle detection in Section 4. Their system is now also capable of performing autonomous lane changes on request ([lo]). 3.4.2. Road and vehicle modeling for lateral control Dickmanns’ work models roads in accordance to the European road layout using the so-called clothoid model. Front-wheel steered road vehicles follow a clothoid path when driven at constant speed and at a constant steering rate, which is a reasonable driver model. This is why civil engineers in Europe build roads in accordance with clothoids. Clothoids are planar curves which are characterized by their linear change in curvature C = 1/r, the inverse of the turning radius:
836 D. Koller et al.
C = Co + dC/dl * 1 = Co + C1* 1 .
(3.17)
COis a constant and 1 is the afc length, i.e. clothoids change their curvature linear with the arc length. C1 = 1/A2 is piecewise constant and A is the so-called clothoid parameter. An essential task for smooth road vehicle guidance is hence to recover the clothoid coefficients COand C1 of the road using vision. The arc length 1 is conventionally recovered from velocity readings of the odometer, since the cycle time is known. Since the ideal skeletal line of the road is usually not visible, the road parameters have to be recovered using only the visible parts of the road boundary. In order to robustly detect the road boundary and to resolve the ambiguities between image features from road boundaries and from shadows on the road cast from trees, buildings and other structures next to the road, the assumption of parallel road boundaries is being applied. Dickmanns also accounts for vertical road curvature by applying a 3-D road model. The vertical mapping geometry is mainly determined by the camera position above the local tangential plane and the camera pitch angle. It is assumed that both, horizontal and vertical road curvatures are so small compared to the visual lookahead range, that they can be recovered independently by decoupled differential equations, which makes the problem much more tractable. In addition to the two decoupled horizontal and vertical road curvature models, he applies a simplified but sufficient dynamic model for the vehicle motion and steering kinematics. This model accounts for the slip angle due to softness of the tire and tire slipping which itself is constrained by a differential equation. The overall dynamical model for 3-D road recognition and relative vehicle egomotion estimation consists of three subsystems: (i) for the lateral vehicle dynamics, (ii) for the horizontal, and (iii) for the vertical road curvature dynamics. The full state vector comprises a total of nine parameters. However, the system can be almost completely decoupled into the above three subsystems of which only the horizontal road curvature affects the lateral vehicle dynamics. Finally a prediction error feedback scheme for recursive state estimation according to the extended Kalman filter technique (Subsection 3.2.2) is being applied for continuous state update. Special care has to be taken during the initialization phase, when good object hypotheses are required, a crucial issue in all Kalman filter applications. Dickmanns applies certain constraints about the size of the road and also assumes low initial road curvature as well as a normal initial vehicle position to help the initialization phase. Image measurements are taken in subwindows given by the predicted image location of the road boundary. 4. Monocular Obstacle Detection The problem of obstacle detection and avoidance emerges in the context of longitudinal vehicle control and in car following. The question here is how can computer vision be used to keep a safe driving distance to a car driving in front.
4.6 Vision-based Automatic Road Vehicle Guidance 837 The latter is of special interest for the platooning concept being studied in the
PATH project [50]. Keeping a safe driving distance requires: (i) detecting cars driving ahead on the road, (ii) tracking a car driving ahead, and (iii) accurately measuring the cars’ changing relative distance for speed control. Contrary to stereo approaches, single camera approaches for obstacle detection require additional information in order to reconstruct the depth coordinate, which is lost in the projection from 3-D objects points to 2-D image points. In this section we describe some of the attempts at recovering range information from monocular images.
4.1. Optical Flow Based Methods Classical approaches for monocular obstacle detection are based on motion stereo or optical flow interpretation [6,12]. The key idea of these approaches is to predict the optical flow field for a moving observer under constraint motion (e.g. planar motion). Obstacles are then detected by a significant difference between the predicted and the actual observed optical flow field. The major drawbacks of these approaches are: (a) computational expense and (b) the lack of reliable and accurate optical flow fields and the associated 3-D data (it is well known that structure-from-stereopsis approaches perform better than structure-from-motion approaches). A combination of stereo and optical flow is suggested in [7] in order to perform a temporal analysis of stereo image sequences of traffic scenes. They do not explicitly address the problem of obstacle detection in the context of vehicle guidance, but the general problem of object identification. They extract and match contours of significant intensity changes in (a) stereo image pairs for 3-D information and (b) subsequent frames to obtain their temporal displacement ([34,35]). Object descriptions are finally obtained by grouping the Kalman filtered 3-D trajectories of these contours using a constant image velocity model. In order to distinguish between obstacles and road boundaries or lane markers, they also exploit some heuristics like horizontally and vertically aligned contour segments as well as 3-D information extracted from the stereo data [33]. 4.2. Methods Based on Qualitative 3-0 Reconstruction
Unlike another group of conventional approaches for obstacle detection based on full 3-D reconstruction, which is known to be error prone, it has also been suggested to use only qualitative 3-D reconstruction [53,54]. Reference [54] describes three algorithms for obstacle detection based on different assumptions: two of them just return yes/no answers in terms of the presence of an obstacle in a view without 3-D reconstruction, only based on the solvability of a linear system which expresses the consistency of a set of points under the same motion. Their third algorithm is quantitative in the sense that it continuously updates ground plane estimates and reconstructs partial 3-D structures by determining the height above the ground plane for each point in the scene.
838 D. Koller et al.
4.3. Model-Based Approaches
Other methods for monocular obstacle detection exploit the use of spatial obstacle models. However, full model-based approaches require detailed models of the obstacles, which is not feasible for car following applications or collision avoidance.
4.3.1. Detecting obstacles using assumptions about mirror symmetry CARTRACK system
-
the
A more recent approach exploits heuristics such as symmetry of the bounding box of the vehicle in front, which is based on the fact that rear or front views of most vehicles exhibit a strong mirror symmetry about the vehicle’s vertical axis. This symmetry provides a striking generic shape feature for object recognition [44,55]. They start by using an intensity-based symmetry finder to detect image regions that are candidates for a leading car. The vertical axis from this step is also an excellent feature for measuring the leading car’s relative lateral displacement in consecutive images because it is invariant with respect to vertical movements of the camera and changes in object size. To exactly measure the image size of a leading car, a novel edge detector is being proposed which enhances pairs of edge points if the local edge orientation at these locations is mutually symmetric with respect to a known symmetry axis. The 2-D symmetry is formed by a systematic coincidence of 1-D symmetries and is hence well suited for parallel processing. The problem is then to detect local 1-D symmetries along horizontal scan lines (only strict vertical symmetry axes are considered).
A Local Symmetry Measure: Reference [55] defines a measure for local symmetry within a 1-D intensity function by means of a contrast function: J E n ( u , z s , w ) 2 d u- J O ( U , X s , W ) 2 d U S(XS,
w) =
JEn(U,2,,W)2dU
+
JO(U,2s,W)2dU
(4.1)
where
denotes the odd part of the intensity function I ( x ) ,and
En(u,z,, w ) := E ( u ,z,, w)
-
w
s”i2
E ( v ,zs,w)dv
-w/2
(4.3)
is the normalized even part of the intensity function I(z) (corrected by the bias in order to compare it with the odd counterpart) with
4.6 Vision-based Automatic Road Vehicle Guidance 839
The parameter x, stands for the location of a potential symmetry axis and w is the width of the symmetry interval about 2,. Since they are interested in the maximum symmetry support interval along a scan line, they introduce the following confidence measure for the hypothesis that there is a significant symmetry axis originating from an interval of width w about the position x,: s A ( x s , w )=
W
-( S ( x s ,W )+ 1) 2Wmax
w 5 wrnax.
(4.5)
The values for SA(Z,, w) are recorded in a symmetry histogram. Two-dimensional symmetry detection requires the combination of many such symmetry histograms. This is easily accomplished by summation of the confidence values for each axis position, provided the symmetry axis is straight.
Symmetry Enhancing Edge Detection: Edge detection is finally performed using a feed-forward network whose connection weights represent a symmetry condition, which results from eight oriented filter outputs at two different image locations. It combines evidence for the different categories of discrete orientation symmetry and can serve as a mechanism for detecting edges that are related to other edges by a certain degree of symmetry. The CARTRACK system: The described approach has been applied to detecting and tracking cars with video input from the viewpoint of a driver. The symmetry finder produces a symmetry histogram, the peaks of which are tentatively taken t o represent the horizontal object position. The symmetry enhancing edge detector is then used to verify and improve the localization as well as to estimate the width of the leading car. However, additional heuristics, which need to be confined on the road like cars and obstacles, are required for robust performance, since the proposed symmetry-based approach also picks up any vertical mirror-symmetric object like road signs or certain buildings. This symmetry-based approach for obstacle detection and longitudinal vehicle control has also been successfully combined with the road detection and lateral control system of Dickmanns (cf. Subsection 3.4). 5. The Use of Stereopsis for Autonomous Driving In real traffic scenes, other vehicles are usually present, and this raises two problems. First, they are potential obstacles, which need t o be detected. Second, lane markers are often obstructed by other vehicles, which might defeat the algorithms which do not allow for occlusion of the lane markers.
840
D. Koller et al.
Stereopsis uses two or more cameras to obtain actual range information of every object visible by both cameras. In this way an easy distinction between road plane and obstacle is available. For instance, stereopsis with linear cameras was considered in the PROMETHEUS project [5]. A system based on stereopsis was first advocated in [26]. While providing more information, historically this approach has been considered to be computationally too expensive. With the emergence of more powerful processors however, they are becoming common and even necessary for fully autonomous vehicles, as seen in the 1996 edition of the Intelligent Vehicles Symposium. This section will outline the use of stereopsis for autonomous driving. It will examine the exploitation of domain constraints to simplify the search problem in finding binocular correspondences. This reduces the computational load of stereopsis correspondence. Temporal integration of the results of the stereo analysis is used to build a reliable depth map of obstacles. In crowded traffic scenes where substantial portions of the lane boundaries may be occluded, this makes the road boundary detection more robust. In addition to supporting longitudinal control (i.e. maintaining a safe, constant distance from the vehicle in front) by detecting and measuring the distances to leading vehicles, stereopsis also measures the relative position of the road surface with respect to the vehicle. Measurements of the road surface are used for dynamic update of (a) the lateral position of the vehicle with respect to the lane markers and (b) the camera parameters in the presence of camera vibration. Lane markers are detected and used for lateral control, i.e. following the road while maintaining a constant lateral distance to the road boundary as discussed in the previous sections. Since the areas of the image belonging to the ground plane are identified, this ensures that the search area for lane markers is not corrupted by occlusions.
I
geometric parameters
I-
Dynamic Disparity
search area for
localization of lanemarkers
-
description of lane markers
longitudinal control
control
Fig. 7. The flow of information in the integrated stereo approach.
4.6 Vision-based Automatic Road Vehicle Guidance 841 The principle of the approach, which was presented in [28,51] is illustrated in Fig. 7. Recent developments include a real time implementation described in [45]. A similar approach is [4]. 5.1. Stereo Imaging of a Planar Surface
In order to simplify the identification of potential obstacles and also limit the required computational load of stereo disparity, a stereopsis based system for autonomous driving makes the assumption that there exists a planar surface within the visual field of the stereo sensors. This surface is the driveable road plane. Obstacles are represented by objects which are either above or below this plane. Under the assumption of a planar surface viewed by a stereo camera system consisting of two cameras with parallel optical axes, the resulting disparity of imaged points from that surface is a linear function of the image coordinates. That is, from the relative orientation of the planar surface and intrinsic camera parameters, the disparity in the image is a predetermined function linear in both image coordinates. In this section the equations relating the relative orientation of the road plane to this linear function are derived. They lead to a simple linear least squares solution based on the measured disparities. Assume two identical cameras with parallel optical axes. The baseline separation between camera centers is b. For such a camera setup, it is well known that a visible point in the world is projected onto the same row in both images. This is equivalent to the epipolar lines relating points between the two cameras being horizontal. The column number of the point projections in the two images will be different however. This disparity between column numbers is a function of the distance Z of the point. This distance is measured along the optical axis. Assuming a pin-hole camera model where world points (X, Y,Z ) are projected to image points (z,y) via z=
X Y f-, y = f-
z
z
the pixel disparity between left and right cameras is
A plane in front of the cameras can be parameterized by a unit vector, n, normal to the plane, and a perpendicular distance from the plane. to the center of the baseline connecting the two camera centers (Fig. 8). The components of the normal vector can be written in the coordinate system of the cameras. Its components are n = ( A ,B , C) = (sin0 cos 4, sin 0 sin 4, cos 0)
(5.3)
where 0 is the inclination angle of the plane from the line of sight direction, Z , and 4 is the roll about this direction. Any point on this plane follows the following
842
D. Koller
et al.
Fig. 8. A plane represented by a unit normal vector and distance, h. The normal vector can be written in terms of the inclination and roll angles (0 and 4) of the camera setup.
constraint on its coordinates:
AX
+ BY + C Z = h .
(5-4)
Using the projection equations (5.1), this constraint can be written in terms of image coordinates: AZx/ f B Z y / f CZ = h . (5.5)
+
+
Solving for 2 : Z
= h(Ax/f
+ B y / f + C)-l.
(5.6)
Using this in the disparity equation (5.2), we find that the disparity is linear in both image coordinates x and y:
b d ( ~ , y= ) -(Ax h
+ B y + fC)= CYX + p y + 7 .
(5.7)
This equation proves that the disparity of points on a plane imaged by a stereo camera system consisting of identical cameras with parallel optic axes will be a linear function of image coordinates, independent of the orientation of that plane with respect to the cameras. Note that when the camera roll is zero (4 = 0 ) , the coefficient a! is zero. The disparity is then a linear function of image row number, y . In fact, the disparity will be zero at the horizon line in the image, and linearly increasing from that line. This linearity in the image plane is referred to as the HeZmhoZtz shear, a configuration where the process of computing the stereo disparity is tremendously simplified. This insight is due to Helmholtz [17] who more than a hundred years ago observed that objectively vertical lines in the left and the right view perceptually appear slightly rotated. This led him to the hypothesis that the human brain performs a shear of the retinal images in order to map the ground plane to zero disparity.
4.6 Vision-based Automatic Road Vehicle Guidance 843 Then, any object above the ground plane will have non-zero disparity. This is very convenient because the human visual system is most sensitive around the operating point of zero disparity. Given a collection of disparity measurements in the image, it is trivial to recover the parameters of the plane being imaged. The linear relationship between disparity and plane parameters makes this a standard least squares problem. Thus, given at least three, non co-linear disparity measurements di(zi,yi), we can recover the height h and orientation e l # of the road plane by using a least squares fit to the data: N
From the fit parameters a,
and y we can recover the plane parameters via:
b
h= Ja2
+ p2 + y 2 / p
Note that in Eq. (5.7) the constants relating image coordinates to disparity are all scaled by the inverse of the height of the plane, h. This can be used to determine if a point is above or below the plane by a certain distance. That is, given a disparity d at location (2, y), Eq. (5.7) gives the height of this point above or below the plane. Based on this, each point can be labeled as being either on the plane, within a small tolerance, or an object which is not on the plane and therefore an obstacle to avoid. 5.2. Effects of Error on the Ground Plane Estimate Stereo disparities are known to contain a fair amount of noise. Uncertainty in matching between left and right images due to changes in lighting, inexact camera calibration and different viewpoints, leads to errors in the disparity measurement. This error is usually modeled as consisting of both a stochastic component and an outlier component. The first coming from the finite pixel size of the image, quantization effects and random noise in the intensity values. The second comes from mis-matches in the disparity computation. Whereas the first component is usually on the order of one pixel or less, the second can be many pixels in magnitude. Since the ground plane parameters are recovered from a standard least squares solution, the effect on the solution due to stochastic noise in the data is well known. If this noise is modeled as Gaussian, then the propagation of noise from data to solution is straightforward. For a noise term with standard deviation 0 , the resulting noise component in the solution will have standard deviation roughly equal to t w / n where K. is a scale factor dependent on the measurement distribution. Since the entire image is used to collect disparity measurements, the number will be on
844
D. Koller et al.
the order of thousands of measurements. This makes the solution quite robust to stochastic noise in disparity measurements. In addition to the magnitude of the noise, the error in the solution depends also on the distribution of the disparity measurements in the image. Because of the large number of measurements, it is this term which could affect the solution the most. If the data lie on a single line in space, then no unique solution to the plane fit exists. When the data is close to co-linear, small errors in measurements can lead to large errors in the solution. This can be minimized by using many disparity measurements from different regions of the image. Whereas the effect of stochastic noise on disparity measurements is small when there is a sufficient number of measurements, the solution can be affected strongly by an error in camera calibration. The solution for the ground plane’s relative orientation with respect to the camera developed in Section 5.1 assumed that the optic axes of both cameras are parallel. If the axes are not parallel it can have a large effect on the accuracy of the estimate. For small angles, the error in estimated 4 as a function of relative pitch angle, p is 4h (5.10) 64-.-i;-P while the error in estimated 8 as a function of vergence angle, v is 2h 68 -. - v . b
(5.11)
Since the height of the stereo camera rig is often larger than the baseline separation, the scale factor h/b is greater than one. In our test vehicle the baseline separation was 40 cm while the height above the road plane was 1.8 m. Larger baseline systems placed closer to the road will show fewer effects from calibration error. Figure 9 shows the estimated horizon for two cameras whose optical axes were approximately two degrees from parallel, both in vergence and relative pitch. The estimated horizon in the top image, calculated by assuming the axes were parallel, is clearly incorrect. The second image shows the same estimate after the error in calibration is taken into account. The deviation from parallel can be accommodated by warping either the ihages, or the resulting disparity map by an affine warp. This is due to the fact that a pure camera rotation can be compensated for via an affine warping based on the rotation matrix.
5.3. Binocular Stereopsis Although proposed algorithms in the literature for computing binocular stereopsis are quite computationally expensive, the complexity can be reduced considerably by using region-of-interest processing and exploitation of domain constraints. Region-of-interest processing recognizes that in regions of uniform image brightness, the stereo correspondence between camera views cannot be determined accurately. To reduce the computational load, the search for disparity is not performed in
4 . 6 Vision-based Automatic Road Vehicle Guidance 845 -
Fig. 9. Estimated horizon location for a pair of cameras with optical axis about two degrees away from parallel. The lower image is the estimate obtained after the images are restored t o parallel axes configuration via an affine warping of one of the images.
those regions. The domain constraints provide the Helmholtz shear described in Section 5.1 which limits the range of possible disparities thus reducing the search requirements. The disparity between images is found by computing the normalized correlation between small horizontal windows in the two images at the locations of the pointsof-interest. The normalized correlation for disparity shift T , a t horizontal image location z is: (5.12) where the correlations
ai,j (z, y)
are approximated by summations
c
+W/2
ai,j(.,Y)
=
gi(z
+ u ) g j ( y + u).
(5.13)
u=-w/2
The subscripts ( i , j ) can be either E or T , representing the left or right image. The summation is calculated over a window of size W .
846
D. Koller
et al.
(4 (b) Fig. 10. (a) view from the left camera. (b) light region indicates objects were detected t o be on the road surface, dark indicates objects are above the road surface. Points-of-interest are those locations in the right image where the value of ur,r is above a threshold. The normalized correlation function is calculated only in those regions. Sub-pixel disparities are obtained by quadratic interpolation of the function about the maximum r. Residual disparities - which appear in the image after the ground plane disparity has been mapped to zero - indicate objects which appear above the ground plane. A simple threshold is used t o distinguish between features lying on the ground plane (e.g. lane markers or other stuff painted on the road) and features due to objects lying above the ground plane (which may become future obstacles). Figure 10 shows the result on a single frame. 5.4. Determining Ground Plane Objects
The least squares solution for the ground plane’s relative orientation assumed that all disparity points belonged to the ground plane. In this way a least squares estimator could be used. In reality there will be many objects in the field of view which do not lie on the ground plane. If the disparity measurements of these objects were used, the ground plane estimate would be incorrect. Therefore only the parts of the image corresponding to points on the road surface should be used for the update. In order to accomplish this the relative orientation of the surface needs t o be known. Figure 11 plots the disparity measurements versus image row number for all points in a cluttered traffic scene. If the imaged points all lay on a single plane in front of the cameras (and there was no camera roll, which is assumed here) then all of the points in this graph would lie on a single line. This comes from Section 5.1 where it was shown that the disparity is a linear function of the image row number. This line would intersect the zero disparity line at the image row number corresponding to the horizon in the image. The least squares solution, using all disparity measurements, obtains a plane orientation which places the horizon line more than 50 pixels above its true location. This results in a majority of obstacles being incorrectly labeled as part of the ground plane.
4.6 Vision-based Automatic Road Vehicle Guidance Disparity vs Row Number
., ,
30
0 0
50
100
150 200 image row number
847
250
300
io
Fig. 11. Disparity versus row number for a crowded traffic scene. All points from the ground plane lie on a single line in this space. All objects below or above the ground plane contribute points which do not lie on this line. Here other vehicles contribute a majority of the disparity points. Estimating the orientation of the ground plane using all points results in a bias toward a plane higher than the true one.
Two different methods are used to delineate the two distributions. The first uses the predicted orientation of the plane obtained from the Kalman filter model of the dynamics of the vehicle. The predicted orientation of the plane, plus the uncertainty in that orientation, defines a search window in disparity space to look for candidate ground plane points. This search region corresponds to a narrow wedge in disparity space. Within this search region, a majority of points belong to the ground plane and the least squares estimate approximates the true value. A systematic bias in the estimate still exists when using only the above method. This is due to the fact that most objects are above the ground plane. To improve the estimate, the assumption that objects are connected to the ground plane is used. That is, objects are not floating above the ground plane. Under this assumption, the image of objects above the road plane will be connected to the ground plane image. The pixels in the image just below a detected obstacle are not used when updating the ground plane estimate. As an example, this method would not include the parts of a vehicle’s tires where they touch the ground plane. If these regions are not removed from consideration, the parts of the tires just above the road surface would be included in the update because of the uncertainty in the predicted ground plane orientation. This new technique combined with the limited search based on the Kalman filter prediction produces better results.
848 D. Koller et al.
A similar stereo vision based system developed by the Suburu car company uses only those image points corresponding to lane markers t o make its estimate of the ground surface [42]. To make this distinction, the system selects parts of the image containing white bars on a dark background and uses the disparity of those regions to estimate the ground plane orientation. This works as long as white lane markers exist and are visible. On highways in the United States the first assumption is not always valid. Lane markers are often hard to distinguish due to a lack of paint, dirty roads or the use of tactile markers instead of painted ones. Also, the markers can be occluded by other vehicles. Moreover, vehicles with coloration which resembles lane markings would confuse the system. To demonstrate the performance of the system, Fig. 12 shows a single frame of a sequence in which the test vehicle is passing a stalled vehicle on the shoulder of the road. Detecting this vehicle early is important for safe driving. The relative speed between vehicles is very high and the lateral distance between vehicles is small. This figure also shows the estimated positions of the stalled vehicle with respect t o the test vehicle (located at the origin in this plot). The ellipses centered at the position
'20 \ Relalivs speed vemus lrsme number
-l01 2 0-
100-
3 0-
no -
4 0-
-50 -
1 60-
-60
40 -
-80
-
-DO
-
20-
(I
-
-70 -
-10
-5
0
5
-100 10
5
10
15
20
25
30
Fig. 12. A single frame from a sequence in which the test vehicle passes a stalled vehicle. The first plot shows the estimated position (with uncertainty ellipses) of the vehicle over time. The test vehicle is at rest at the origin in this reference frame. The second plot shows the estimated relative velocity of the stalled vehicle, with its uncertainty envelope, as a function of time. The frame rate was 30 frames per second so these plots represent less than one second of time. The correct relative velocity was 55 mph.
4.6 Vision-based Automatic Road Vehicle Guidance 849 estimates are the uncertainty in position as given by the Kalman filter. Note the relative horizontal and vertical scales. The second plot shows the estimate of the relative velocity (negative in this case since the vehicle is at rest with respect to ours) and its uncertainty. The estimate improves rapidly with new measurements. This is necessary since the object is detected for only 28 frames which is less than one second at the 30 frames per second video rate.
back
view from front
front
Fig. 13. Model of the suspension system of the vehicle. The springs are critically damped with a natural frequency of 1.8 Hz. The cameras are also free to rotate about their viewing direction (roll). The roll dynamics are modeled by a critically damped torsion model.
. measurements -filter
C1.8-
zl
L
100.5' 0
I
50
100
150
200
250
I 300
frame number Fig. 14. The height and inclination angle of the cameras as tracked by the Kalman filter. The measurements obtained from the least squares solution are also shown.
850
D. Koller
et al.
5.5. Dynamic Ground Plane Estimation The measurements of the current ground plane orientation, (e,q5,h ) , are used to update a Kalman filter model of the dynamics of the vehicle. This simple model assumes that the cameras are mounted on top of two critically damped springs (Fig. 13). The stereo camera rig also contains a roll component. The roll is modeled by a second order critically damped (spring) system. The total state dimension is 6, consisting of the position and velocity of both springs, as well as the rotation and velocity of the roll term. By assuming small angles about the rest position, the measurement model is a linear transformation from spring positions to camera angles. Figure 14 shows the state estimate for the height and inclination angle of the cameras, along with the measurements provided from the least squares solution. The filter effectively smoothes the measurements while following the trend of the motion. Using a Kalman filter to estimate the plane orientation offers a number of advantages. The first is a temporal integration and smoothing of measurements. The filter also provides a prediction which is used to search the disparity measurements for the next plane update. Finally, by using a critically damped system, the system will gracefully return to its rest position given erroneous or missing measurement data. The rest position currently is based on calibration parameters. In reality, this rest position is a function of vehicle load which must be estimated each time it changes. Integration with accelerometers as well as driver controls would improve the model. 5.6. Temporal Integration Computing depth from just a pair of images is known to be sensitive to noise. One can improve the accuracy of the depth estimation by exploiting the temporal integration of information with time using the expected dynamics of the scene via Kalman filters. Objects of interest will be assumed to be either other vehicles on the road or stationary objects connected to the road plane. In addition we can exploit the physical constraints of the environment. We can assume we are interested in connected, rigid objects. This would allow us to use spatial coherence in identifying objects from the depth map. The spatial coherence of objects is used to segment the depth map into objects of interest. First, connected components are found in a 3-D space consisting of the two image dimensions plus the depth dimension. In the two image dimensions, points are connected if they are one of the four nearest neighbors. In the depth dimension they are connected if the difference in depth is less than the expected noise in the depth estimates. Figure 15 gives an example of two objects which are connected in this imageldepth 3-D space. These connected components form the basis of potential objects which are to be tracked with time. If the same object appears in two consecutive frames, a Kalman filter is initialized to track its position and velocity with respect to our vehicle. Figure 16 shows the objects found by this method.
4.6 Vision-based Automatic Road Vehicle Guidance 851 Image y
Image x
Fig. 15. Connected components in image/depth space consist of those pixels which are nearest neighbors in image coordinates as well as having depth differences less than depth uncertainty.
Fig. 16. Objects identified as being in the same lanes of traffic as the test vehicle. On the right side of the image is a “bird’s-eye-view” from above the road surface showing the relative position of the tracked objects with respect to the test vehicle.
References [I] 0. D. Altan, H. K. Patnaik and R. P. Roesser, Computer architecture and implementation of vision-based real-time lane sensing, in Proc. Intelligent Vehicles Symp. (1992) 202-206. [2] Y. Bar-Shalom and X.-R. Li, Estimation and Tracking: Principles, Techniques, and Software (Artech House, Boston, London, 1993). [3] R. Behringer and M. Maurer, Results on visual road recognition for road vehicle guidance, in Proc. Intelligent Vehicles Symp. (1996) 415-420. [4]M. Bertozzi and A. Broggi, Real time lane and obstacle detection on the gold system, in Proc. Intelligent Vehicles Symp. (1996) 213-218.
852
D. Koller et al.
[5] J.-L. Bruyelle and J.-G. Postaire, Disparity analysis for real time obstacle detection by linear stereovision, in Proc. Intelligent Vehicles Symp. (1992) 51-56. [6] S. Carlsson and J.-0. Eklundh, Object detection using model based prediction and motion parallax, in Proc. European Conf. Computer Vision (1990) 297-306. [7] S. Chandrashekar, A. Meygret and M. Thonnat, Temporal analysis of stereo image sequences of traffic scenes, in Proc. Vehicle Navigation and Infor. Sys. Conf. (1991) 203-2 12. [8] J. Crisman, Color Vision for the Detection of Unstructured Roads and Intersection, PhD Thesis, Carnegie-Mellon-University, 1990. (91 D. DeMenthon and L. Davis, Reconstruction of a road by local image matches and 3-D optimization, in Pmc. Int. Conf. Robotics and Automation (1990) 1337-1342. [lo] E. D. Dickmanns and N. Muller, Scene recognition and navigation capabilities for lane changes and turns in vision-based vehicle guidance, Control Engineering Practice 4, 5 (1996) 589-599. (111 E. D. Dickmanns and B. D. Mysliwetz, Recursive 3-D road and relative ego-state recognition, IEEE Trans. Pattern Anal. Mach. Intell. 14 (1992) 199-213. [12] W. Enkelmann, Obstacle detection by evaluation of optical flow fields from image sequences, in Proc. European Conf. Computer Vision (1990) 134-138. [13] 0. D. Faugeras, Three-dimensional Computer Vision: A Geometric Viewpoint (MIT Press, 1993). (141 A. Gelb (ed.), Applied Optimal Estimation (MIT Press, Cambridge, MA, 1974). [15] H. Goldstein, Classical Mechanics (Addison-Wesley Press, Reading, MA, 1980). [16] D. A. Gordon, Perceptual basis of vehicular guidance, Public Roads 34, 3 (1996) 53-68. [17] H. v. Helmholtz, Tkatise on physiological optics (Dover, NY, translated by j.p.c. southall edition, 1925). [18] B. P. K. Horn, Robot vision (MIT Press, 1986). (191 T. M. Jochem, D. A. Pomerleau and C. E. Thorpe, Maniac: A next generation neurally based autonomous road follower, in Image Understanding Workshop (1993) 473-479. [20] R. E. Kalman, A new approach to linear filtering and prediction problems, J . Basic Engin. (ASME) 82D (1960) 35-45. [21] K. Kanatani and K. Watanabe, Reconstruction of 3-D road geometry from images of autonomous land vechicles, IEEE Trans. Robotics and Automation 6, 1 (1990) 127-132. [22] S. K. Kenue, Lanelock: detection of lane boundaries and vehicle tracking using image processing techniques, Part 1, in SPIE Mobile robots I V , 1989. [23] S . K. Kenue, Lanelock: detection of lane boundaries and vehicle tracking using image processing techniques, Part 2, in SPIE Mobile robots I V , 1989. [24] K. I. Kim, S. Y. Oh, J. S. Lee, J. H. Han and C. N. Lee, An autonomous land vehicle: design concept and preliminary road test results, in Proc. Intelligent Vehicles Symp., 1993. [25] K. Kluge and C. Thorpe, Representation and recovery of road geometry in YARF, in Proc. Intelligent Vehicles Symp. Detroit, MI, 1992, 114-119. [26] D. Koller, Q.-T. Luong and J. Malik, Using binocular stereopsis for lane following and lane changing maneuvers, in Proc. Intelligent Vehicles Conf., Paris, France, 1994. [27] B. B. Litkouhi, A. Y. Lee and D. B. Craig, Estimator and controler for lanetrak, a vision-based automatic vehicle steering system, in Proc. IEEE Conf. Decision and Control (1993) 11-468-478.
4.6 Vision-based Automatic Road Vehicle Guidance 853 [28] Q.-T. Luong, J. Weber, D. Koller and J. Malik, An integrated stereo-based approach to automatic vehicle guidance, in Proc. Int. Conf. Computer Vision, Cambridge, MA, (1995) 52-57. [29] W. MacKeown, P. Greenway, B. Thomas and W. Wright, Road recognition with a neural network, in Proc. 1st IFAC Int. Workshop on Intelligent Autonomous Vehicles (1993) 151-156. [30] L. Matthies, Stereo vision for planetary rovers: Stochastic modeling to near real-time implementation, The Int. J. Computer Vision 8 (1992) 71-91. (311 M. Maurer, R. Behringer, S. Fherst, F. Thomanek and E. Dickmanns, A compact vision system for road vehicle guidance, in Proc. Int. Conf. Pattern Recogn. (1996) C7A.1. [32] P. S. Maybeck, Stochastic Models, Estimation and Control (Academic Press, London, 1979). [33] A. Meygret and M. Thonnat, Object detection in road scenes using stereo data, in Pro-Art Workshop on Vision, Sophia Antipolis, Apr. 19-20, 1990. (341 A. Meygret and M. Thonnat, Segmentation of optical flow and 3-D data for the interpretation of mobile objects, in Proc. Int. Conf. Computer Vision, Osaka, Japan, Dec. 4-7, 1990, 238-245. [35] A. Meygret, M. Thonnat and M. Berthod, A pyramidal stereovision algorithm based on contour chain points, in Proc. European Conf. Computer Vision, S . Margherita, Ligure, Italy, May 18-23, 1992, 83-88, G. Sandini (ed.), Lecture Notes in Computer Science 588 (Springer-Verlag, Berlin, Heidelberg, New York, 1992). [36] H. Moravec, Robot Rover Visual Navigation (UMI Research Press, 1981). [37] D. A. Pomerleau, Neural Network Perception for Mobile Robot Guidance (Kluwer Academic Publishing, 1994). [38] D. A. Pomerleau, Progress in neural network-based vision for autonomous robot driving, in Proc. Intelligent Vehicles Symp. (1992) 391-396. [39] D. Pommerleau, Ralph: Rapidly adapting lateral position handler, in Proc. Intelligent Vehicles Symp., 1995. (401 D. Raviv and M. Herman, A new approach to vision and control for road following, in Proc. Int. Conf. Computer Vision and Pattern Recogn. (1991) 217-225. [41] M. Rosenblum and L. S. Davis, The use of a radial basis function network for visual autonomous road following, in Proc. Intelligent Vehicles Symp., 1993. [42] K. Saneyoshi, Drive assist system using stereo image recognition, in Proc. Intelligent Vehicles Symp., Tokyo, 1996. [43] L. E. Scales, Introduction to Non-Linear Optimization (Macmillan, London, 1985). [44] M. Schwartzinger, T. Zielke, D. Noll, M. Brauckmann and W. v. Seelen, Visionbased car-following: Detection, tracking, and identification, in Proc. Intelligent Vehicles Symp. (1992) 24-29. [45] C. J. Taylor, J. Malik and J. Weber, A real-time approach to stereopsis and lanefinding, in Proc. Intelligent Vehicles Symp. (1996) 207-212. [46] C. Thorpe (ed.), Vision and Navigation: The Carnegie-Mellon Navlab (Kluwer Academic Publishers, Norwell, Mass, 1990). [47] M. A. Turk, D. G. Morgenthaler, K. D. Gremban and M. Marra, Vits - a vision system for autonomous land vehicle navigation, IEEE Trans. Pattern Anal. Mach. Intell. 10 (1988) 342-361. [48] B. Ullmer, Vita - an autonomous road vehicle (arv) for collision avoidance in traffic, in Proc. Intelligent Vehicles Symp. (1992) 3 6 4 1 . 149) B. Ulmer, Vita ii - active collision avoidance in real traffic, in Proc. Intelligent Vehicles Symp. (1994) 1-6.
854
D. Koller et al.
[50] P. Varaiya and S. E. Shladover, Sketch of an ivhs architecture, in Proc. V N I S '91 Conference, Paper No. 912772, Dearborn, MI, Oct. 1991, 325-341. [51] J. Weber, D. Koller, Q.-T. Luong and J. Malik, New results in stereo-based automatic vehicle guidance, in Proc. Intelligent Vehicles Conf., Detroit, MI, 1995, 530-535. (521 J. Zhang and H. H. Nagel, Texture based segmentation of road images, in Proc. Intelligent Vehicles Symp. (1994) 26G265. (531 Z. Zhang, R. Weiss and A. R. Hanson, Qualitative obstacle detection, in Proc. Int. Conf. Computer Vision and Pattern Recogn., Seattle, WA, June 19-23, 1994, 554-559. [54] Z. Zhang, R. Weiss and A. R. Hanson, Obstacle detection based on qualitative and quantitative 3d reconstruction, I E E E Trans. Pattern Anal. Mach. Intell. 19, 1 (1997) 15-26. [55] T. Zielke, M. Brauckmann and W. Von Seelen, Intensity and edge-based symmetry detection with an application to car-following, CVGIP 58, 2 (1993) 177-190.
PART 5 ARCHITECTURE AND TECHNOLOGY
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 857-867 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
ICHAPTER 5.1 I VISION ENGINEERING: DESIGNING COMPUTER VISION SYSTEMS
RAMA CHELLAPPA Department of Electrical Engineering, Center for Automation Research, University of Maryland, College Park, Maryland 20742, USA and AZRIEL ROSENFELD Center for Automation Research, University of Maryland College Park, Maryland 20742, USA The goal of computer vision is to derive descriptive information about a scene by computer analysis of images of the scene. Vision algorithms can serve as computational models for biological visual processes, and they also have many practical uses; but this paper treats computer vision as a subject in its own right. Vision problems are often ill-defined, ill-posed, or computationally intractable; nevertheless, successes have been achieved in many specific areas. We argue that by limiting the domain of application, carefully choosing the task, using redundant data (multi-sensor, multi-frame), and applying adequate computing power, useful solutions t o many vision problems can be obtained. Methods of designing such solutions are the subject of the emerging discipline of Vision Engineering. With projected advances in sensor and computing technologies, the domains of applicability and ranges of problems that can be solved will steadily expand. Keywords: Computer vision, vision engineering.
1. Introduction The general goal of computer vision is to c-rive information about a scene by computer analysis of images of that scene. Images can be obtained by various types of sensors; the most common kind are optical images obtained by a TV camera. An image is input t o a digital computer by sampling its brightness at a regularly spaced grid of points, resulting in a digital image array. The elements of the array are called pixels (short for “picture elements”), and their values are called gray levels. Given one or more digital images obtained from a scene, a computer vision system attempts t o (partially) describe the scene as consisting of surfaces or objects; this class of tasks will be discussed further in Section 2. Animals and humans have impressive abilities t o successfully interact with their environments - navigate over and around surfaces, recognize objects, etc. - using 857
858
R. Chellappa & A. Rosenfeld
vision. This performance constitutes a challenge to computer vision; at the same time, it serves as a n existence proof that the goals of computer vision are attainable. Conversely, the algorithms used by computer vision systems to derive information about a scene from images can be.regarded as possible computational models for the processes employed by biological visual systems. However, constructing such models is not the primary goal of computer vision; it is concerned only with the correctness of its scene description algorithms, not with whether they resemble biological visual processes. Computer vision techniques have many practical uses for analyzing images. Areas of application include document processing (e.g. character recognition), industrial inspection, medical image analysis, remote sensing, target recognition, and robot guidance. There have been successful applications in all of these areas, but many tasks are beyond current capabilities (e.g. reading unconstrained handwriting). These potential applications provide major incentives for continued research in computer vision. However, successful performance of specific tasks on the basis of image data is not the primary goal of computer vision; such performance is often possible even without obtaining a correct description of the scene. Viewed as a subject in its own right, the goal of computer vision is to derive correct (partial) descriptions of a scene, given one or more images of that scene. Computer vision can thus be regarded as the inverse of computer graphics, in which the goal is to generate (realistic) images of a scene, given a description of the scene. The computer vision goal is more difficult, since it involves the solution of inverse problems that are highly underconstrained ("ill-posed"). A more serious difficulty is that the problems may not even be well defined, because many classes of real-world scenes are not mathematically definable. Finally, even well-posed, well-defined vision problems may be computationally intractable. These sources of difficulty will be discussed in Section 3. In spite of these difficulties, vision systems have achieved successes in many domains. The chances of success are greatly increased by limiting the domain of application, simplifying the task to be performed, increasing the amount of image data used, and providing adequate computing power. These principles can be stated concisely as: Define your domain; pick your problem; improve your input; and take your time. They will be illustrated in Section 4. Following these principles in attempting t o solve vision problems provides a foundation for a discipline which we may call Vision Engineering, as discussed in Section 5. 2. Vision Tasks
If a scene could be completely arbitrary, not very much could be inferred about it by analyzing images. The gray levels of the pixels in an image measure the amounts of light received by the sensor from various directions. Any such set of brightness measurements could arise in infinitely many different ways as a result of
5.1 Vision Engineering
859
light emitted by a set of light sources, transmitted through a sequence of transparent media, and reflected from a sequence of surfaces. Computer vision becomes feasible only if restrictions are imposed on the class of possible scenes. The central problem of computer vision can thus be reformulated as follows: given a set of constraints on the allowable scenes, and given a set of images obtained from a scene that satisfies these constraints, derive a description of that scene. It should be pointed out that unless the given constraints are very strong, or the given set of images is large, the scene will not be uniquely determined; the images only provide further constraints on the subclass of scenes that could have given rise t o them, so that only partial descriptions of the scene are possible. Computer vision tasks vary widely in difficulty, depending on the nature of the constraints that are imposed on the class of allowable scenes and on the nature of the partial descriptions that are desired. The constraints can vary greatly in specificity. At one extreme, they may be of a general nature - for example, that the visible surfaces in the scene are all of some “simple” type (e.g. quadric surfaces with Lambertian reflectivities). [Constraints on the illumination should also be specified - for example, that it consists of a single distant light source. Note that the surfaces may be “simple” in a stochastic rather than a deterministic sense; for example, they may be fractal surfaces of given types, or they may be smooth surfaces (e.g. quadric) with spatially stationary variations in reflectivity (i.e. uniformly textured surfaces).] At the other extreme, the constraints may be quite specialized - for example, that the scene contains only objects having given geometric (“CAD”) descriptions and given optical surface characteristics. Similarly, the desired scene descriptions can vary greatly in completeness. “Recovery” tasks call for descriptions that are complete as possible, but “recognition” and “navigation” tasks usually require only partial descriptions - for example, identification and location of objects or surfaces of specific types if they are present in the scene. In its earliest years (beginning in the mid-l950s), computer vision research was concerned primarily with recognition tasks, and dealt almost entirely with single images of (essentially) two-dimensional scenes: documents, photomicrographs (which show thin “slices” of the subject, because the depth of field of a microscope image is very limited), or high-altitude views of the earth’s surface (which can be regarded as essentially flat when seen from sufficiently far away). The mid-1960’s saw the beginnings of research on robot vision; since a robot must deal with solid objects at close-by distances, the three-dimensional nature of the scene cannot be ignored. Research on recovery tasks began in the early 1970’s, initially considering only single images of a static scene, but by the mid-1970’s it was beginning to deal with time sequences of images (of a possibly time-varying scene) obtained by a moving sensor. By definition, recovery tasks require correct descriptions of the scene; but recognition and navigation tasks can often be performed successfully without
860
R. Chellappa & A . Rosenfeld
completely describing even the relevant parts of the scene. For example, obstacles can often be detected, or object types identified, without fully determining their geometries. Thirty-five years of research have produced theoretical solutions to many computer vision problems; but many of these solutions are based, explicitly or tacitly, on unrealistic assumptions about the class of allowable scenes, and as a result, they often perform unsatisfactorily when applied to real-world images. As we shall see in the next section, even for static, two-dimensional scenes, many vision problems are ill-posed, ill-defined, or computationally intractable.
3. Sources of Difficulty 3.1. Ill-Posedness
As already mentioned, the gray levels of the pixels in a n image represent the amounts of light received by the sensor from various directions. If the scene does not contain transparent objects (other than air, which wewill assume t o be clear), the light contributing t o a given pixel usually comes from a small surface patch in the scene (on the first surface intersected by a line drawn from the sensor in the given direction). This surface patch is illuminated by light sources, as well as by light reflected from other patches. Some fraction of this illumination is reflected toward the sensor and contributes to the pixel; in general, this fraction depends on the orientation of the surface patch relative to the direction(s) of illumination and the direction of the sensor, as well as on the reflectivity of the patch. In short, the gray level of a pixel is the resultant of the illumination, orientation, and reflectivity of a surface patch. If all these quantities are unknown, it is not possible to recover them from the image. Only under limited conditions of smoothly curved Lambertian surfaces with constant albedo can one recover estimates of illuminant direction, surface albedo and shape from a single image [l]. This example is a very simple illustration of the fact that most vision problems are “ill-posed”, i.e. underconstrained; they do not have unique solutions. Even scenes that satisfy constraints usually have more degrees of freedom than the images to which they give rise; thus even when we are given a set of images of a scene, the scene is usually not uniquely determined. In some special cases, with the availability of singular points, unique solutions may be obtained [2]. In applied mathematics, a common approach to solving ill-posed problems is to convert them into well-posed problems by imposing additional constraints [3]. A standard method of doing this, known as regularization, makes use of smoothness constraints; it finds the solution that minimizes some measure of nonsmoothness (usually defined by a combination of derivatives). Regularization methods were introduced into computer vision in the mid-19807s, and have been applied to many vision problems [4]. Evidently, however, solutions found by regularization often do not represent the actual scene [5]; for example, the actual scene may be piecewise smooth, but may also have discontinuities, and a regularized solution tends to
5.1 Vision Engineering 861 smooth over these discontinuities. To handle this problem, more general approaches have been proposed which allow discontinuities [6], but which minimize the complexity of these discontinuities - e.g. minimize the total length and total absolute curvature of the borders between smooth regions. In effect, these approaches [7] find solutions that have minimum-length descriptions (since the borders can be described by encoding them using chain codes). However, the actual scene is not necessarily the same as the scene (consistent with the images) that has the simplest description. Evidently, not all scenes of a given class are equally likely; but the likelihood of a scene depends on the physical processes that give rise to the class of scenes, not on the simplicity of its description, and certainly not on the simplicity of a description of its image. As an alternative to the regularization approach, direct methods have been suggested for shape recovery from radar [8] and visible images [9]. For illumination sources near the camera, good results have been obtained on simple optical images. Direct methods are rigid, in that they cannot be easily generalized to arbitrary illumination directions or to incorporate additional information. In addition, the lack of smoothing may present problems in the presence of noise.
3.2. Ill-Definedness It is often assumed in formulating vision problems that the class of allowable scenes is “piecewise simple”, e.g. that the visible surfaces are all smooth (e.g. planar or quadric) and Lambertian. This type of assumption seems at first glance to strongly constrain the class of possible scenes (and images), but in fact, the class of images is not constrained at all unless a lower bound is specified on the sizes of the “pieces”. If the pieces can be arbitrarily small, each pixel in an image can represent a different piece (or even parts of several pieces), so that the image can be completely arbitrary. For a two-dimensional scene, it suffices to specify a lower bound on the piece sizes; but for a three-dimensional scene, even this does not guarantee a lower bound on the sizes of the image regions that represent the pieces of surface; occlusions and nearly-grazing viewing angles can still give rise to arbitrarily small or arbitrarily thin regions in the image. Lower bounds on piece sizes are important for another very important reason: they make it easier to distinguish between the ideal scene and various types of “noise”. In the real world, piecewise simple scenes are an idealization; actual surfaces are not perfectly planar or quadric or perfectly Lambertian, but have fluctuating geometries or reflectivities. [Note that these fluctuations are in the scene itself; in addition, the brightness measurements made by the sensor are noisy, and the digitization process also introduces noise.] If the fluctuations are small relative to the piece sizes, it will usually be possible to avoid confusing them with “real” pieces. [Similarly,the noisy brightness measurements - assuming that they affect the pixels independently - yield pixel-size fluctuations, and digitization noise is also of at most pixel size; hence these types of noise too should usually not be
862
R . Chellappa & A . Rosenfeld
confused with the pieces.] Of course, even if we can avoid confusing noise fluctuations with real scene pieces, their presence can still interfere with correct estimation of the geometries and photometries of the pieces. Most analyses of vision problems (e.g. for piecewise simple ideal scenes) do not attempt to formulate realistic models for the “noise” in the scene; they usually assume that the noise in the image (which is the net result ofthe scene noise, the sensor noise, and the digitization noise) is Gaussian and affects each pixel independently. Examination of images of most types of real scenes shows that this is not a realistic assumption; thus the applicability of the resulting analyses to real-world images is questionable. The problem of ill-definedness becomes even more serious if one attempts to deal with scenes containing classes of objects that do not have simple mathematical definitions - for example, dogs, bushes, chairs, alphanumeric characters, etc. Recognition of such objects is not a well-defined computer vision task, even though humans can recognize them very reliably.
3.3. Intractability Even well-defined vision problems are not always easy to solve; in fact, they may be computationally intractable [10,11]. An image can be partitioned in combinatorially many ways into regions that could correspond to simple surfaces in the scene; finding the correct (i.e. the most likely) partition may thus involve combinatorial search. For example, even for scenes consisting of polyhedral objects, the problem of deciding whether a set of straight edges in an image could represent such a scene is NP-complete. Even identifying a subset of image features that represent a single object of a given type is exponential in the complexity of the object, if more than one object can be present in the scene, or if the features can be due to noise. Parallel processing (e.g. [12]) is widely used to speed up computer vision computations; it is also used very extensively and successfully in biological visual systems. Very efficient speedup can be achieved through parallelism in the early stages of the vision process, which involve simple operations on the image(s); but little is known about how to efficiently speed up the later, potentially combinatorial stages. Practical vision systems must operate in “real time” using limited computational resources; as a result, they are usually forced to use suboptimal techniques, so that there is no guarantee of correct performance. In principle, the computations performed by a vision system should be chosen to yield maximal expected gain of information about the scene at minimal expected computational cost. Unfortunately, even for well-defined vision tasks, it is not easy to estimate the expected gain and cost. Vision systems therefore usually perform standard types of computations that are not necessarily optimal for the given scene domain or vision task; this results in both inefficiency and poor performance.
5.1 Vision Engineering 863 4. Recipes for Success
4.1. Define Your Domain Well-defined vision problems should involve classes of scenes in which both the ideal scene and the noise can be mathematically (and probabilistically) characterized. For example, in scenes that contain only known types of man-made objects, the allowable geometric and optical characteristics of the visible surfaces can be known to any needed degree of accuracy. If the objects are LLclean”, and the characteristics of the sensor are known, the noise in the images can also be described very accurately. In such situations, the scene descriptions that are consistent with the images are generally less ambiguous (so that the problem of determining these descriptions is relatively well-posed) because of the relatively specialized nature of the class of allowable scenes. If, in addition, the number of objects that can be present is limited, the complexity of the scene description task and the computational cost of recognizing the objects are greatly reduced. For example, it has been shown [ll]that when all the features in the image can be assumed to arise from a single object, the expected search cost to recognize the object is quadratic in the number of features, and the number of possible interpretations drops rapidly to one as the number of features extracted from the image increases. The number of interpretations and the search cost are much higher when the scene is cluttered, so that the object of interest may be occluded and a significant part of the data may come from other objects in the scene. 4.2. Pick Your Problem
Even for specialized scene domains, deriving complete scene descriptions from images - the general recovery problem - can still be a very difficult task. However, there is no reason to insist on unique solutions to vision problems. The images (further) constrain the class of possible scenes; the task of the vision system is to determine these constraints. This yields a partial description of the scene, and for some purposes this description may be sufficient. In fact, in many situations only a partial description of the scene is needed, and such descriptions can often be derived inexpensively and reliably. A partial description may require only the detection of a specific type of object or surface, if it is present, or it may require only partial (“qualitative”) characterizations of the objects that are present (e.g. are their surfaces planar or curved). Two illustrations of the value of partial descriptions are: (i) An autonomous vehicle can rapidly and accurately follow the markers on a road; it need not analyze the entire road scene, but need only detect and track the marker edges [13,14]. By using additional domain-specific knowledge about the types of vehicles, their possible motions, etc., significant improvements in 3-D object and motion estimation have been reported in [15].
864
R. Chellappa €9 A . Rosenfeld
(ii) An active observer, by shifting its line of sight so that the focus of expansion due to its motion occupies a sequence of positions, can robustly detect independent motion anywhere in the region surrounded by these foci [16]. In this region, independent motion is indicated by the sign of the normal flow being opposite to that of the expansion. 4.3. Improve Your Inputs
Vision tasks that are very difficult to perform when given only a single image of the scene generally become much easier when additional images are available. These images could come from different sensors (e.g. we can use optical sensors that detect energy in different spectral bands; we can use imaging sensors of other types such as microwave or thermal infrared; or we can use range sensors that directly measure the distances to the visible surface points in the scene). Alternatively, we can use more than one sensor of the same type - for example, stereo vision systems use two or more cameras. Even if we use only a single sensor, we can adjust its parameters - for example, its position, orientation, focal length, etc. to obtain multiple images; control of sensor parameters in a vision system is known as active vision [17]. It has been shown that by using the active vision approach, ill-posed vision problems can become well-posed, and their solutions can be greatly simplified. These improvements are all at the sensor level; one can also consider improving the inputs to the higher levels of the vision process by extracting multiple types of features from the image data using different types of operators (e.g. several edge detectors). This strategy leads to a situation where “less is required from more”, i.e. where it is easier to derive the desired results if more input information is available, unlike the traditional situation where “more is required from less”. Animals and humans integrate different types of sensory data, and control their sensory apparatus, to obtain improved or additional information (e.g. tracking, fixation). Obtaining additional constraints on the scene by increasing the amount of image data is evidently a sounder strategy than making assumptions about the scene (smoothness, simplicity, etc.) that have no physical justification. Many successful computer vision systems have made effective use of redundant input data. In the following paragraphs we give three examples: (i) In [18], thermal ( 8 . 5 ~ - 1 2 . 5 ~and ) visual imagery are combined to identify objects or regions such as vehicles, buildings, areas of vegetation and roads. The visual image is used to estimate the surface orientation of the object. Using the surface orientation and other collateral information such as the ambient temperature, wind speed, and the date and time of image acquisition, an estimate of the thermal capacitance of the object is derived. This information, in conjunction with the surface reflectivity of the object (derived from the visual image) and the average object temperature (derived from the thermal image), is used in a rule-based system to identify the types of objects mentioned above.
5.1 Vision Engineering 865
(ii) Photometric stereo [19] is an excellent example of using more inputs t o resolve the inherent ambiguities in recovering shape from shading using a single image irradiance equation. In this scheme, the viewing direction is held constant, but multiple images are obtained by changing the direction of illumination. One then generates as many coupled irradiance equations as there are illumination directions. By solving these equations, robust estimates of the surface orientation can be obtained. Photometric stereo can be very useful in industrial applications where the incident illumination can be controlled. (iii) Stereo matching is the process of fusing two images taken from different viewpoints t o recover depth information in the scene. The process involves identifying corresponding points or regions in two views and using their relative displacements together with camera geometry to estimate their depths. If the baseline (the distance between the two cameras) is large, accurate depth estimates can be obtained, but at considerable added computational cost in the feature matching process. With a short baseline the cost of matching is less, but the depth resolution is low. In [20] a method is described that uses multiple stereo pairs with different baselines generated by lateral displacements of a camera. A practical system with seven cameras has been developed. This is a very good example in which, by using more inputs, the complexity of the algorithms is considerably reduced, while at the same time the results are improved. 4.4. Take Your Time
Since the early days of computer vision, the power of general purpose computational resources has improved by many orders of magnitude. This, combined with special purpose parallel hardware, both analog [21] and digital (VLSI), has greatly expanded the range of tractable vision tasks. The availability of increasingly powerful computing resources allows the vision system designer much greater freedom to adopt an attitude of “take your time” in vision algorithms, as well as freedom to use redundant input data. With no end in sight as regards expected improvements in computing power, the required time t o solve given vision problems will continue to decrease. Conversely, it will become possible to solve problems of increased complexity and problems that have wider domains of applicability.
5. Vision Engineering
Perception engineering has been defined by Jain [22] as the study of techniques common to different sensor-understanding applications, including techniques for sensing and for the interpretation of sensory data, and how t o integrate these techniques into different applications. He pointed out the existence of a serious communication gap between researchers and practitioners in the area of machine perception, and proposed establishing the field of perception engineering t o bridge this gap. However, he did not formulate any principles that could serve as guidelines
866
R. Chellappa & A. Rosenfeld
for the design of successful machine perception systems. We believe that the principles discussed in Section 4 can serve as foundations for an approach to computer vision that we shall refer to as Vision Engineering. The central task of vision engineering is to make vision problems tractable by applying the four principles: carefully characterizing the domain, choosing the tasks to be performed (breaking a given problem up into subtasks, if necessary), and providing adequate input data and adequate computational resources. We feel that these principles and their extensions will find increasing application in the design and construction of vision systems over the coming years. References [l] Q. Zheng and R. Chellappa, Estimation of illuminant direction,albedo and shape from shading, I E E E Trans. Pattern Anal. Mach. Intell. 13 (1991) 680-702. [2] J. Oliensis, Uniqueness in shape from shading, Int. J. Comput. Vision 6 (1991) 75-104. [3] A. N. Tikhonov and V. Y. Arsenin, Solution of Ill-Posed Problems (Winston, New York, 1977). [4] T. Poggio, V. Torre and C. Koch, Computational vision and regularization theory, Nature 317 (1985) 314-319. [5] J. Aloimonos and D. Shulman, Integration of Visual Modules: An Extension of the M a m Paradigm (Academic Press, Boston, MA, 1989). [6] D. Terzopoulos, Regularization of inverse visual problems involving discontinuities, I E E E Trans. Pattern Anal. Mach. Intell. 8 (1986) 413-426. [7] Y. C. Leclerc, Constructing simple stable descriptions for image partitioning, Int. J. Comput. Vision 3 (1989) 73-102. [8] R. L. Wildey, Topography from a single radar image, Science 224 (1984) 153-156. [9] J. Oliensis, Direct method for reconstructing shape from shading, in Proc. D A R P A Image Understanding Workshop, San Diego, CA, Jan. 1992, 563-571. [lo] L. M. Kirousis and C. H. Papadimitriou, The complexity of recognizing polyhedral scenes,J. Comput. Syst. Sci. 37 (1988) 14-38. [ll] W. E. L. Grimson, Object Recognition by Computer (MIT Press, Cambridge, MA, 1990) Chapter 10. [12] V. K. Prasanna Kumar, Parallel Architectures and Algorithms f o r Image Understanding (Academic Press, New York, 1991). [13] E. D. Dickmanns and V. Graefe, Dynamic monocular machine vision, Mach. Vision Appl. 1 (1988) 223-240. [14] E . D. Dickmanns and V. Graefe, Applications of dynamic monocular machine vision, Mach. Vision Appl. 1 (1988) 241-261. [15] J. Schick and E. D. Dickmanns, Simultaneous estimation of 3D shape and motion of objects by computer vision, I E E E Workshop o n Visual Motion, Princeton, NJ, Oct. 1991, 256-261. [16] R. Sharma and J. Aloimonos, Robust detection of independent motion: An active and purposive solution, Center for Automation Research Technical Report CAR-TR-534, University of Maryland, College Park, 1991. [17] J. Aloimonos, I. Weiss and A. Bandophadhay, Active vision, Int. J. Comput. Vision 1 (1987) 333-356. [18] N. Nandhakumar and J. K. Aggarwal, Integrated analysis of thermal and visual images for scene interpretation, I E E E Trans. Pattern Anal. Mach. Intell. 10 (1988) 469-481. [19] R. J. Woodham, Photometric method for determining surface orientation from mul-
5.1 Vision Engineering 867 tiple images, in B. K. P. Horn and M. J. Brooks (eds.), Shape from Shading (MIT Press, Cambridge, MA, 1989). [20] M. Okutomi and T. Kanade, A multiple-baseline stereo, in Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, Miami, FL, June 1991, 63-69. [21] C. Mead, Analog VLSI and Neural Systems (Addison-Wesley, Reading, MA, 1989). [22] R. Jain, Perception engineering, Mach. Vzsion Appl. 1 (1988) 73-74.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 869-890 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 5.2 I OPTICAL PATTERN RECOGNITION FOR COMPUTER VISION
DAVID CASASENT Department of Electrical and Computer Engineering Center for Excellence in Optical Data Processing Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA Optical processors offer many useful operations for computer vision. The maturity of these systems and the repertoire of operations they can perform is increasing rapidly. Hence a brief updated overview of this area merits attention. Many of the new algorithms employed can also be realized in digital and analog VLSI technology and hence computer vision researchers should benefit from this review. We consider optical morphological, feature extraction, correlation and neural network systems for different levels of computer vision with image processing examples and hardware fabrication work in each area included. Keywords: Classifier neural net, correlator, distortion-invariant filters, feature extractor, Hough transform, Hit-or-Miss transform, morphological processor.
1. Introduction
A book could be written on each aspect of optical pattern recognition. Thus, only the highlights of selected optical processing operations can be noted here. The reader is referred to the references provided, several texts [l-31, recent conference volumes [4,5], journal special issues [6] and review articles [7,8] for more details on each topic. To unify and best summarize this field, we consider (in separate sections) optical systems for low, medium, high and very high-level computer vision operations. Although the boundaries between these different levels are not rigid, we distinguish low-level vision by noise and image enhancement operations, mediumlevel vision by feature extractors, high-level vision by correlators and very high-level vision by neural net operations. As we shall show, optical processing has a role in each area. We will mainly emphasize recent work at Carnegie Mellon University in these different areas. Section 2 discusses the major optical processing architectures we consider (feature extractors, correlators and neural nets) and presents one possible unified hierarchical approach to the use of all techniques for scene analysis. Section 3 details and provides examples of optical morphological processors for low-level vision and for detection. Section 4 considers the role for optical processing in medium-level vision with attention to feature extractors for product inspection and for subsequent analysis of regions of interest (ROIs). Section 5 details a variety of advanced 869
870
D. Casasent
distortion-invariant optical correlation filters for several applications in high-level computer vision. We then consider in Section 6 very high-level vision operations with attention to optical neural nets for object identification and brief remarks on their use as production systems. 2. Operations Achievable
Many optical processing architectures exist that are of use in computer vision. The ability to compute the Fourier transform (FT) at Pa of 2-D input data at P1 with a simple lens (Fig. (la)) is probably the most widely used concept in optical processing. To simplify analysis, the IFTI2 is often sampled with a detector with wedge and ring shaped detector elements [9] as in Fig. l(b). This is a very attractive feature space for analysis of an input object since the magnitude FT is shift invariant, the wedge samples are scale invariant and the ring samples are rotation invariant. The use of 32 wedge and 32 ring detector elements also greatly simplifies analysis by dimensionality reduction.
-
I
L
Fig. 1. Optical Fourier transform system (a) and wedge ring detector (b) [23].
Many other operations and feature space descriptions of an input object are possible and can be implemented optically [8]. These include: moments, chord distributions, polar-log FT spaces, and the Hough transform. From this we see that optical processors can implement a wide variety of image processing functions beyond the classic FT. Figure 2 shows an optical system that computes the Hough transform at TV frame rates. The input object at P1 is imaged onto a computer generated hologram (CGH) at P2 which forms the Hough transform at P3 in parallel. The CGH consists of a set of N cylindrical lenses at different angles. The Hough transform input denotes the position, orientation and length of all lines in the input. Extensions to other curved shapes are possible and have been optically demonstrated. All of the aforementioned feature spaces can be optically produced using CGHs [8]. The optical correlator is also one of the most used optical processors. Figure 3 shows the schematic of a space and frequency multiplexed optical correlator. The input is placed at PI, a set of spatially-multiplexed filters is shown at P2.
5.2 Optical Pattern Recognition for Computer Vision 871
Fig. 2. Optical Hough transform system using a computer generated hologram.
L,
Multiple Laser Diodes
1
P,
" Input Data
LI
P9
F i l t e r Bank and HOES
SELECT FILTER AND OPERATION
M u l t i p l e or Single Output Correlation Plapes
I
Fig. 3. Space and frequency-multiplexedoptical correlator architecture [35].
Different laser diodes activated at Po allow different P2 filters to be accessed. At each spatial P2 location several (e.g. four) frequency-multiplexed filters are placed. When one Po laser diode is activated it selects a set of filters and the P3 output is the correlation of the PI input and a set (e.g. four) of P2 filters with the four correlations appearing in parallel in the four quadrants of P3. With access to a large filter bank at Pz, many operations are possible (with a real-time device at P2, adaptive filters are possible). Optical correlators have two major advantages in scene analysis: they can detect multiple objects in parallel (and are essential for parallel analysis of scenes containing multiple objects) with correlation peaks occurring at the locations of each object in the field of view and they are the optimum detector systems when noise is present. The optical correlator is also quite versatile. With CGH filters at P2, the P3 output can be any of the feature spaces noted. With large banks of filters possible, one can use different filters and achieve detection, recognition and identification. When the Pz filter used is a structuring element and when the P3 output is properly thresholded, the P3 output can be any morphological operation [lo].
872
D. Casasent
As we have just noted (to be detailed in subsequent sections), the optical correlator is a most versatile and multifunctional optical image processing architecture. This is of major importance since optical correlators are rapidly reaching a significant level of maturity. As one example, we consider the solid optics correlator fabricated by Teledyne Brown Engineering [Ill. The system uses modular optical elements for laser diode collimation, Fourier transform, imaging and beam splitter, etc. components. These are assembled into a rugged optical correlator of small size as shown in Figs. 4 and 5. In Fig. 4,the output from the laser diode light source on the left is collimated and passes through the input spatial light modulator (SLM) and its FT is formed at the right end where a reflective filter is placed. The light reflected from the filter is Fourier transformed and reflected onto the output detector via the beam splitter (BS) to produce the output correlation. Figure 5 shows the actual optical correlator system with a magneto-optic MO SLM input and a dichromated gelatin (DCG) filter. This is typical of the high state of maturity that this key optical architecture has reached [12]. With the advanced filters we describe, this system will be most suitable for image processing.
f- DETECTOR ARRAY
LASER INPUT APERTURE
I
I I
LASER COLLIMATOR
I
FOCUSING LENS
II
I
I
I
I1
f \
TELEPHOTO FOCUSING FILTER RELAY SOLID BLOCK, LENS FOURIER LENS SECOND TRANSFORM BEAMS PL ITTER PLANE OPTION I I
FOURIER TRANSFORM LENS
Fig. 4. Schematic diagram of the solid optics correlator [ll].
The final basic optical processor architecture we consider is the optical matrixvector multiplier [13] of Fig. 6. The 1-D P1 input vector g can be realized by a linear LED or laser diode array or by a 1-D SLM. The light leaving PI is expanded in 1-D to uniformly illuminate the columns of a 2-D matrix mask or SLM at Pz with transmittance M (a matrix). The light leaving Pz is then integrated horizontally to produce the output vector y = & that is the matrix-vector product. We consider the use of this system as t h e basic building block for an optical neural net for object identification and as an artificial intelligence production system (Section 6). As a multilayer neural net, the P1 outputs are the input neurons, the matrix M is a set of weights and the P3 outputs are the hidden layer neurons. A cascade
5.2 Optical Pattern Recognition for Computer Vision 873
Fig. 5 . Photograph of the solid optics correlator showing its major elements [ll].
Fig. 6. Optical matrix-vector neural net processor.
of two such systems yields the standard multilayer neural net of Fig. 7 (one matrixvector system is an associative processor) and with P3 to PI feedback it is a production system (as we discuss in Section 6). The optical matrix-vector element has also achieved a high degree of maturity as seen in the schematic of Fig. 8 which shows this system component fabricated in integrated optics [14]. The basic optical image processing architectures can be viewed as low, medium, high and very high level computer vision modules. They can be used in many ways for computer vision. The approach we find to be the most useful is shown in Fig. 9. We consider the general scene analysis problem when multiple objects are present in high clutter. We separate the scene analysis problem into a hierarchy of detection, recognition and identification steps. For detection, we employ morphological correlator processors (Section 3). For recognition, we use distortion-invariant correlation filters (Section 5 ). For identification, we use feature extractors (Section 4) applied to the regions of interest (ROIs) obtained from detection and a neural net (Section 6) to analyze the feature space and provide the final object identification.
874
D. Casasent
INPUT
CLUSTERS SEVERAL PER CLASS
CLASS
Fig. 7. Three-layer nonlinear neural net classifier used [23].
tmnsporent insubtor
n -electrode of detectcx army
GoAs l h - Show detector army
, 23 .... .
... .
31 32
Fig. 8. Integrated optical matrix-vector processor schematic [14].
INPUT SCENE
MORPHOLOGICAL PROCESSOR
DISTORT ION INVARIANT CORRELATOR
DETECT ION
RECOGNITION
+
FEATURE EXTRACTOR
NEURAL NET
IDENTIFICATION
Fig. 9. One optical realization of three levels of scene analysis.
3. Low-Level Optical Morphological Processors The two basic morphological operations are dilation (region growing) and erosion (region shrinking). We achieve both on a correlator using a filter that is a structuring element (typically a disc), whose size determines the size of a hole or inlet to be
5.2 Optical Pattern Recognition for Computer Vision 875
filled in or the size of a noise blob or protrusion to be removed. If the correlation output is thresholded low (high) dilation (erosion) results [15]. Thus, including structuring element filters at Pz and an output threshold at P3 allows the system of Fig. 3 to also implement morphological operations. These are local operators. Since filling in holes on a white object (dilation) and removing noise and regions (erosion) distorts the boufidary of the object, these operations are generally used in pairs. A dilation followed by an erosion is a closure and an erosion followed by dilation is an opening. Figure 10 shows examples of these operations. The noisy input with holes (the treads of the tank) on the object is shown in Fig. lO(a). The opening of it is shown in Fig. 10(b) (the erosion removes noisy background smaller than the size of the structuring element used and the dilation restores the boundary). The closure of Fig. 10(b) is shown in Fig. 1O(c) (it fills in holes on the object). Edge enhancement (Fig. 10(d)) is also easily achieved by the difference between a dilation and an erosion and appears to be preferable to conventional edge-enhancement methods [lS]. Other operations such as removal of a nonuniform background [IS] etc. are also possible.
Fig. 10. Optical morphological image enhancement [35].(a) input, (b) opening of (a), (c) closure of (b), (d) edge-enhanced (b).
We find these standard morphological operations to be most useful to improve the image of an object after detection. To achieve detection, we use a modified [17] hit-or-miss (HOM) [18] morphological transform. In the basic HOM algorithm, the input image is thresholded, correlated with a hit structuring element, and thresholded; the complement of the thresholded input is then correlated with a miss
876
D. Casasent
Fig. 11. Optical morphological HOM detection example. (a) Input scene, (b) HI hit structuring element, (c) M I miss structuring element, (d) output data.
structuring element (typically the complement of the hit element with a white border or background present) and thresholded; the intersection of the two correlations is the HOM result. Figure 11 shows an example of our new algorithm. Figure ll( a ) shows a scene with hot (bright) and cold (dark) objects present. We threshold the image above the mean and perform a n HOM correlation with the structuring element (not to scale with Fig. ll(a)) in Fig. l l ( b ) . We then threshold the image below the mean and perform an MOH (miss or hit) correlation with the structuring element in Fig. 11(c). The union of the two correlations detects all objects (Fig. l l ( d ) ) . The hit filter (Fig. l l ( b ) ) has a white region equal t o the smallest object and the central dark part of the miss filter (Fig. l l ( c ) ) is the size of the
5.2 Optical Pattern Recognition for Computer Vision 877
largest object (the size of the white border region in Fig. l l ( c ) depends upon the background expected). The HOM correlation detects hot objects and the MOH correlation detects cold objects. The hit correlation detects all objects larger than the smallest object, the miss correlation detects all objects less than the largest object and their union detects only objects within the desired range of sizes. We find this morphological function to be most attractive for the first (detection) phase of scene analysis in Fig. 9. When necessary, we use conventional image enhancement morphological operations prior t o the last feature extraction step in Fig. 9. 4. Medium-Level Computer Vision (Feature Extraction) Once regions of interest (ROIs) have been extracted (detection) from a scene, one must learn more about the contents of each such ROI. One technique that is very general (since it extends t o a large number of multiple classes) is to calculate features associated with each ROI. These features include those noted in Section 2 and others. They are a reduced dimensionality description of each ROI and hence are easier to analyze (from a computation standpoint). They are generally also a n in-plane distortion-invariant feature space. They almost always have shift invariance (this is essential since the location of the object in the ROI is not known) and this greatly simplifies training. In conventional pattern recognition, these features (as a feature vector) are input to a linear classifier (consisting of one or a number of linear discriminant functions, LDFs). As an example of the power of a feature space processor, we consider the recognition of multiple classes of objects (two aircraft: an F4 and F104) with about 128 x 128 pixel resolution with four degree-of-freedom distortions (roll, pitch, and z and y translations). We considered f60" distortions in both pitch and roll at 2.5" increments (for each roll angle, all f 6 0 " pitch variations are considered). We trained a modified linear Ho-Kashyap classifier [19,20] on distortions every 5" in roll and pitch (625 distorted images per class). For each image, the 32-element wedge FT feature space was calculated and fed to the classifier algorithm. We then tested the classifier on 1152 distorted test images not present in the training set with cn = 0.1 of white Gaussian noise also present and obtained a very respectable 91.2% correct recognition. This demonstrates the ability of feature extractors to provide object discrimination in the face of very severe object distortions. For our present discussion, their major use is their potential to handle many classes of objects. The wedge ring detector sampled FT is the most widely used optical feature space with many product inspection applications and with a well engineered system having been fabricated [21]. Here we describe a product inspection application of the Hough transform in which the specific locations and orientations of portions of a product to be inspected are of concern [22]. Figure 12 shows the product, a package of cigarettes. The specific issues of concern are that: (1) the package wrapper be aligned within 1.8", (2) the closure seal (A) at the top be present, aligned within 3.2", and that the bottom of it extends properly within 0.5 mm,
878 D. Casasent
D
C
Fig. 12. Cigarette package to be inspected [22].
and (3) that the tear strip (B) be present, parallel to the top within tolerances, and be properly positioned within 0.5 mm. To achieve these inspection tasks, we form the Hough transform of each package as it is assembled. We form four slices of the Hough transform at 0 = 38", 142", 0" and 90". The 38" and 142" angular slices denote the presence of and proper location of the two angular lines (C and D) and hence determine if the package is properly aligned. The 90" Hough transform slice has peaks corresponding to horizontal lines in the object (from top to bottom of the image, peaks occur due to the top of the package, the tear strip and the bottom of the closure seal). These indicate the presence of the tear strip and the seal and if they are at the proper location from the top of the package within tolerances. If either is at an angle, the corresponding Hough transform peak on the 90" slice becomes broader and its height decreases. The 0" slide of the Hough transform denotes vertical lines, specifically the two edges of the seal. If the seal is perfectly aligned, both Hough transform peaks will be of the same height and in the proper position horizontally on the package. If the seal is not aligned properly, the Hough transform peaks will be different in height. Figure 13 shows the Hough transform of a cigarette package with six regions along the four Hough transform slices noted with the portions of the product to which they correspond indicated. For each product, we thus investigate the six indicated Hough transform regions for a Hough transform peak and the value of each peak. The laboratory real-time Hough transform system assembled operated at 30 products per second and exhibited over 99% correct inspection. A-om errors in the Hough transform peak positions or heights, the nature of each product defect can be determined.
5.2 Optical Pattern Recognition
for Computer Vision
879
Fig. 13. Real time optical laboratory Hough transform of Fig. 12 [22].
It is important that only one object be present in the field of view and that noise be reduced when feature extractors are employed. The detection ROI location system achieves the one object requirement for scene analysis and LED or laser diode sensors achieve this for product inspection applications. Morphological processing techniques can be employed to reduce noise and improve the image if needed. Figure 9 allows for such operations prior to feature extraction and object identification. 5. High-Level Computer Vision (Correlators) For this level of computer vision we consider advanced distortion-invariant filters used in correlators. Such correlation filters use internal object structure or the boundary shape of the object rather than simple rectangular filters as in the morphological HOM detection filters in Section 3. A wide variety of such filters exist and are generally extensions of the synthetic discriminant function (SDF) filters [24]. These SDF filters used a training set of different distorted images. The vector inner product matrix of the training set was used with a control vector that specified the correlation peak value to calculate the filter function. The filter is a linear combination of the training set of images. 5.1. Filter Synthesis
The synthetic discriminant function filters control only one or several points in the correlation plane and hence have limited storage capacity (number of training images N T ) before large sidelobes occur that cause false alarms. This filter clutter is due to [25] the reduced SNR that occurs for large NT. The minimum average correlation energy (MACE) filter was the next significant development since its intent is to reduce correlation plane sidelobes. It achieves this by minimizing the
880 D. Casasent
correlation plane energy [26]
E=E+gE, where is the vector version of the FT (Fourier transform) of the desired filter function and is a diagonal matrix with elements equal to the sum of IFTI2 of the training images. We minimize (5.1) subject to a constraint on the correlation peak value for all training images
H +=X- &-,
(5.2)
where + denotes the conjugate transpose, the columns of the matrix & are the FTs X iof the training set images and the elements of the control vector u are the correlation peak values specified for each training image (the elements of u are typically chosen to be one). The solution to (5.1) subject to the constraint in (5.2)is found by Lagrange multiplier techniques to be [26] (5.3) The MACE filter solution yields a sharp correlation peak, which localizes the target’s position well. However, such sharp correlation peaks result because the spectrum of the filter has been whitened emphasizing high frequencies. As a result, this filter has poor recognition of non-training set intra-class images and it is sensitive to noise. To overcome these problems, we recently introduced the minimum noise and average correlation energy (MINACE) filter [27]. This uses a better bound on the spectral envelope of the images. And it also inherently uses a specified noise power spectrum in synthesis. For a filter with one training image i, the filter solution is
H= -
g;
1 =(= x X +=T a -lX =) -Iu - 9
(5.4)
and for NT training images, the filter solution is
Its form is the same as in (5.3), however the preprocessing function diagonal matrix with diagonal elements
Ti(u, v) = max[Di(u, v),N(u, v)] and
g iis now a (5.6)
is a diagonal matrix with diagonal elements
T ( u , v )= max[TI(u,v),TZ(U,21),...,TNT(u,v)].
g.
(5.7)
The key step is the choice of the preprocessing function Its elements are chosen separately for each spatial frequency u and v based on the magnitude of the spatial frequencies of the signal D and the noise N. Specifically, if the signal is above the noise at some spatial frequency, we select the signal; otherwise, we use the selected noise level N. This comparison is done separately for each spatial frequency and for
5.2 Optical Pattern Recognition for Computer Vision 881
all training images. This reduces the filter's response at high frequencies and other frequencies (where noise dominates) and hence improves intra-class recognition and performance in noise. This filter has another major advantage of use in our present problem: we can control the filter's recognition and discrimination performance. This is achieved by varying the amount of noise N (through its variance u 2 ) used in filter synthesis. We define the control parameter c = u2/DC
(5.8)
to be the ratio of the noise energy (for white Gaussian noise) to the DC value of the signal energy. Large values of c emphasize lower spatial frequencies and provide filters that are good for detection (intra-class recognition and noise performance). Low c values emphasize higher spatial frequencies and such filters are good for identification. Medium c values prove to be useful for recognition. This provides the MINACE filter with a flexibility not found in other correlation filters which are quite rigid. Specifically, by varying the training set and the control parameter c, the same filter synthesis algorithm yields filters suitable for the three different levels in scene analysis (detection, recognition, and identification). We now provide initial examples of such results to demonstrate the concepts. 5.2. Test Results
Figure 14 provides an example of this multifunctional filter synthesis algorithm. Figure 14(a) shows the input scene. It contains 13 objects as noted in Fig. 14(b) (the values in parentheses indicate the orientation of each object). We formed a MINACE filter trained only on 36 orientations of the ZSU object at 10" intervals with a large c = 0.1 value. This object was chosen since it is the smallest one and we initially desire to only implement detection. Figure 14(c) shows the detection correlation results. We find peaks for all 13 objects and only one false alarm (lower left). Thus, this filter can achieve detection of all interesting ROIs independent of their orientation and in considerable noise. To achieve recognition of only the larger and more dangerous objects (the SCUD and FROG missile launchers), we trained a MINACE filter on only SCUDs and FROGS and used a lower c = 0.05 value. The results (Fig. 14(d)) show correlation peaks at the locations of these five mobile missile launchers. This filter thus achieves recognition of a subclass of objects (large missile launchers) independent of their orientation and in noise. To achieve identification of only the SCUD objects, we form another MINACE filter trained only on SCUDs and with a smaller c = 0.001 value (a lower c provides more discrimination). The results (Fig. 14(e)) locate the three SCUD objects and demonstrate identification. Correlators are well-known to be ideal for detection in noise and when multiple objects are present. This example demonstrates how the same basic filter synthesis algorithm can achieve detection, recognition and identification and hence can solve quite complex scene analysis problems.
882
D. Casasent
(4
(4
(el
Fig. 14. Advanced MINACE distortion-invariant hierarchical correlation filter results. (a) Input, (b) input, (c) detection, (d) recognition (SCUDS/FROGS), (e) identification (SCUDS).
Another noteworthy example of filter performance is now briefly noted. For identification or discrimination between two very similar objects, correlation techniques using all object pixels rather than reduced dimensionality feature space methods are preferable. Figure 15 shows images of the SA-13 and ZSU-23 objects with about 32x12 pixel resolution. As seen, they are quite similar. To identify the SA-13 and discriminate it from the ZSU-23 object when 36 different rotated versions of each object are considered, we used a MINACE filter with c = 0.001 and NT = 19 training images of the SA-13 at 19 of the 36 distorted angles. This filter successfully recognizes all true class SA-13 objects and yields no correlation peaks above 0.5 for any of the 36 false class ZSU-23 objects. Three other properties of the MINACE filter emerge from this example. As we increase c, we can reduce the size NT of the required training set (e.g. NT = 19 not 36 here). Use of a flat spectrum for our MINACE noise model thus also effectively models object distortions (e.g. controlling the spatial frequencies used to recognize an object in noise is similar to controlling
5.2 Optical Pattern Recognition for Computer Vasion 883
Fig. 15. Two similar objects for identification and discrimination (a) SA-13, (b) ZSU-23.
the spatial frequencies to achieve intraclass recognition, as this example has shown). Finally, no false class training images were used (i.e. we could have, but did not, train the filter to produce a zero or low output correlation value for troublesome false class images). This is attractive since one does not generally know every false class object that is possible. In multiple correlation stages of the identification portion of scene analysis, this may be allowable (and necessary) in some cases. 6. Very High-Level Computer Vision (Neural Nets) Many potential applications for neural nets in computer vision have been advanced [28]. These include image enhancement and feature extraction. We find other techniques (Sections 3 and 4) to be preferable and sufficient for these operations. The major reason is the large number of neurons and interconnections required when the neural net input is an iconic (pixel-based) image representation. For example, one can achieve shift invariance in a neural net with N input neurons by the use of N4 interconnections [29]. However, when N = 512’, this is very excessive and since the same property can be achieved with the FT etc., we find such methods to be preferable. When multiple objects are present in the field of view, no neural net can handle all objects in parallel. Conversely, a correlator (Section 5) easily achieves this. A correlator is in fact a most powerful neural net with the filter function being the set of weights applied to input iconic neurons (an image) with the unique property that the weights are applied in parallel to every region of the input scene. Thus, for such cases, we find a correlator using advanced distortion-invariant filters to be preferable. In general, we find the use of such FT-based free-space optical interconnections to be preferable to other neural net approaches which achieve shift invariance with much more hardware with many hard-wired forced interconnections required (and are not easily achieved without optical processing techniques).
884 D. Casasent
6.1. Neural Net Classifier Algorithm In our opinion, one of the major uses of neural nets is their ability to provide an algorithm for determining nonlinear piecewise discriminant surfaces for classifiers. We now highlight our adaptive clustering neural net [30] and how it uses linear discriminant functions (linear classifiers) and neural net techniques to achieve a nonlinear classifier. As the input neuron representation space we use feature space neurons (Section 4) obtained for ROIs from a morphological detection processor. The classic three-layer neural net we use is shown in Fig. 7. The input P1 neurons are a feature space (we use wedge IFT samples in our example). The output P5 neurons indicate the class of the input object. To determine the number of hidden layer neurons, we use standard clustering techniques [31] to select prototypes or exemplars for each class from the full training set. The number of prototypes decided upon is the number of hidden layer neurons N3. We typically use three prototypes per class. We assign each of these to a hidden layer neuron (i.e. we use 3C hidden layer neurons, where C is the number of object classes). Each prototype and hence each P3 hidden layer neuron corresponds to a feature vector or a point in the multidimensional feature space. As the initial weights from P1 to each P3 neuron (e.g. P3 neuron i), we use the feature vector p-a . for prototype i. This results in a set of Pl-to-P3 weights that are classic linear discriminant functions as used in standard pattern recognition. These are only the initial weights. They are then adapted into nonlinear discriminant functions by our neural net algorithm. To achieve this, we add an additional input neuron whose value is minus 0.5 times the sum of the squares of the other weights. Thus, with NF features, we use N1 = NF 1 input P1 neurons with the weights wij (from P1 input neuron j to P3 neuron i) described by
+
where pij is element j of the prototype vector p--2 . . This insures that the hidden layer neuron closest to the input vector at PI will be the most active one. We use a winner-take-all selection of the most active P3 neuron during classification. We now use neural net training to adapt these initial Pl-to-P3 weights (the neural net thus forms weights that are combinations of linear discriminant functions and hence it produces piecewise nonlinear discriminant surfaces as we shall show). To adapt the weights, we present each of the training set of image feature spaces to the neural net and we determine the P3 neuron values for each input g (for P3 neuron i this is simply the vector inner product g T ~ofi the training set input vector g and the weight vector zui from all PI neurons to P3 neuron i). We then calculate the most active P3 neuron i(c) in class c of the input vector and the most active neuron i(Z) in any other class. We denote their weight vectors by wi(.)and For each
5.2 Optical Pattern Recognition for Computer Vision 885
input, we then determine an error E for a perceptron error function as
where S = 0.05 in our case. After each presentation of the training set, we calculate the derivative a E / a a i and use it to adapt the weights using a conjugate gradient algorithm [32]. We then present the training set again, calculate E in (6.2) and continue to adapt the weights until convergence or negligible change occurs. 6.2. Neural Net Classifier Results
To best demonstrate the power of a neural net to produce piecewise nonlinear decision surfaces from linear discriminant functions, we consider the artificial problem in Fig. 16 with 383 samples in three classes (181 in class 1 on the left and bottom represented by a triangle; in class 2 in the center represented by a circle; and 105 in class 3 in the upper right represented by a diamond). We chose this 3-class example because it uses only two features and the results can thus be displayed. We used our ACNN algorithm to solve this problem using Nl = 3 input PI neurons (the two features plus one bias neuron), N3 = 2C = 6 hidden layer neurons and N5 = C = 3 output neurons, the number of data classes. Figure 16 shows the piecewise nonlinear decision boundaries produced (they consist of six straight lines, modified combinations of the six initial linear discriminant functions associated with the six hidden layer neurons). The results obtained gave Pc = 97% correct recognition after only 80 iterations during training (in classification our ACNN algorithm is a one-pass non-iterative algorithm). We compared this neural net performance to that of the standard but very computationally expensive multivariate Gaussian
0 class 2
+ class 3 boundaries I
.c
FEATURE 1 Fig. 16. Discrimination problem showing nonlinear decision surfaces automatically produced by our neural net algorithm (231.
886
D. Casasent
classifier which achieved only PC = 89.5% correct recognition. Thus, as expected a neural net is necessary to solve this problem. This 2-D (two feature) example is instructive to visually demonstrate the ability of a neural net algorithm to easily compute complex decision surfaces for difficult discrimination problems. We now consider a more complex version of the pattern recognition problem in Section 4, one that requires the use of a neural net classifier. Specifically, we consider three not two aircraft (F4, F104 and DC-10) and a larger f85" range of roll and pitch distortions. Figure 17 shows several views of each aircraft. In each set of images, the top center image is top-down with no distortions in roll or pitch. Each row left-to-right corresponds to pitch angles of -80", -40", 0", +40", and +80". From top-to-bottom, they correspond to roll angles of 0", 40" and 80". We attempted to solve this multiclass pattern recognition problem using the linear classifier of Section 4 and obtained poor results (Pc < 60%). We then used our ACNN algorithm with Nl = 33 input neurons (32 wedge IFT( features plus one bias neuron), N3 = 3C = 9 hidden layer neurons and N5 = C = 3 output neurons, one per class. The training set consisted of the feature spaces for 630 distorted objects per class (3 x 630 = 1890 in total). The test set consisted of over 1800 distorted objects at intermediate roll and pitch distortions between those used in training. The results we obtained were excellent (Pc = 98.6% correct recognition). This example vividly demonstrates the advantage of a neural net over a linear classifier for complex discrimination problems. Thus, in our general block diagram (Fig. 9), we show a neural net classifier used on feature space data calculated for regions of interest (ROIs) obtained from the detection portion of our general scene analysis system. 6.3. Production System Neural Net Another useful very high-level function in scene analysis is a production system. In this case the various facts learned about each ROI in the scene must be analyzed to obtain further data. To achieve this, one can write a set of IF-THEN rules, e.g. I F a - + b I F a a n d c a n d f -+g I F b + a IFf andg-+c
where the antecedents are the entries to the left, the consequents are those on the right and the arrow denotes THEN. More complex formulations with predicate calculus can be produced, but this example suffices to show the point. One can implement the above production system rules on a neural net as we now discuss [33]. We use a two-layer neural net (optical matrix-vector multipler) with each fact (antecedent or consequent) assigned to a specific neuron and with an equal number of input and output neurons. We encode the rules in the weights (matrix) as shown in Fig. 18 for the above example.
5.2 Optical Pattern Recognition for Computer Vision 887
Fig. 17. Distorted images of the three aircraft used (a) F4, (b) F104, (c) DC10.
There are seven input and output neurons (a to g) for this simple example. The first matrix-vector multiplication and the neuron outputs after the first iteration indicate new facts learned from the initial input facts. We feed the output neurons back to the input neurons keeping previously activated input neurons (facts) still “on”. These subsequent iterations allow the system to learn new rules not directly encoded in the original rules. The iterations continue until an output object “consequent” neuron has been activated in which case the object identification of the input scene region has been determined. The optical realization of this system is
888
D. Casasent
Outputs (control signals or j u s t feedback)
Inputs (from sensors, correlators or feedback) Fig. 18. Production system neural net [23].
the simple matrix-vector processor of Fig. 6 with feedback (Section 2). One can extend this basic system in many ways such as by using analog neurons proportional to the probabilities of each antecedent fact and by use of predicate calculus rather than propositional calculus formulations. We recently [34]demonstrated this system optically in real time for a set of objects composed of generic object parts (circles, rectangles, horizontal and vertical posts, etc.).
7. Summary We have briefly reviewed the role for optics in four levels of general computer vision. As seen, optical processing has a significant role in each area, and optical hardware to achieve these functions is rapidly maturing. The algorithms described in each area of computer vision are novel and can also be implemented digitally. Our new morphological low-level vision algorithm for detection of regions of interest (ROIs) in a scene is most attractive. Its implementation and the realization of other low-level image enhancement operations on an optical correlator are also most attractive. We use a high-level vision optical correlator with new distortioninvariant filters to further analyze ROIs (and in some cases for detection and even object identification). For the general case, once ROIs have been detected, initially analyzed by a correlator and enhanced (if necessary) by morphological techniques, we perform feature extraction and finally use a neural net (for complex multiclass problems) for object identification of each ROI. References
I11 H. Stark (ed.), Applications
of Optical Fourier Transforms (Academic Press, 1982). I21 S. H. Lee (ed.), Optical Information Processing, Vol. 28, Topics in Applied Physics (Springer-Verlag, 1981). 131 G . I. Vasilenko and L. M. Tsibul’kin, Image Recognition by Holography (Consultants Bureau, 1989). [41 D. Casasent (ed.), Optical Pattern Recognition, PTOC. SPIE, -Val. 201, 1979. P. Schenker and H. K. Liu (eds.), Optical and Digital Pattern Recognition, Proc. SPIE, Vol. 754, 1987.
5.2 Optical Pattern Recognition for Computer Vision 889 [5] D. Casasent and A. Tescher (eds.), Hybrid Image Processing, Proc. SPIE, Vol. 638, 1986. D. Casasent and A. Tescher (eds.), Hybrid Image and Signal Processing, Proc. SPIE, Vol. 939, 1989. D. Casasent and A. Tescher (eds.), Hybrid Image and Signal Processing ZI, Proc. SPIE, Vol. 1279, 1990. [6] B. V. K. Vijaya Kumar (ed.), Optical Engineering, Special Issue on Optical Pattern Recognition 29,9 (1990). [7] D. L. Flannery and J. L. Homer, Fourier optical signal processors, in Proc. IEEE 77 (1989) 1511-1527. [8] D. Casasent, Coherent optical pattern recognition: A review, Optical Engineering 24 (1985) 26-32. (91 G. G. Lendaris and G. L. Stanley, Diffraction-pattern sampling for automatic target recognition, Proc. IEEE 58 (1979) 198-205. [lo] P. Maragos, Tutorial: Advances in morphological image processing and analysis, Optical Engineering 26 (1987) 623-632. [ll] P. C. Lindberg and C. F. Hester, The challenge to demonstrate an optical pattern recognition system, Proc. SPIE, Vol. 1297, Apr. 1990, 72-76. [12] D. A. Gregory, J. C. Kirsch and J. A. Loudin, Optical correlators: optical computing that really works, Proc. SPZE, Vol. 1296, Apr. 1990, 2-19. [13] J. Goodman, A. R. Dias and L. Woody, Fully parallel high-speed incoherent optical method for performing discrete Fourier transforms, Optics Letters 2 (1983) 1-3. [14] J. Ohta, M. Takahashi, Y. Nitta, S. Tai, K. Mitsunaga and K. Kyuma, A new
approach to a GaAs/AlGaAs optical neurochip with three layered structure, in Proc. ZJCNN Int. Joint Conf. on Neural Networks, Washington, D.C., Jun. 1989, Vol. 11, 11-477-11-480. [15] D. Casasent and E. Botha, Optical symbolic substitution for morphological transformations, Applied Optics 27 (1988) 3806-3810. [IS] D. Casasent, R. Schaefer and J. Kokaj, Morphological processing to reduce shading and illumination effects, Proc. SPIE, Vol. 1385, 1990, 152-164. [17] D. Casasent and R. Schaefer, Optical implementation of gray scale morphology, Proc. SPZE, Vol. 1658, Feb. 1992. [18] D. Casasent, R. Schaefer and R. Sturgill, Optical hit-or-miss morphological transform, Applied Optics 31 (1992) 6255-6263. [19] R. Duda and P. Hart, Pattern Classification and Scene Analysis (John Wiley and Sons, New York, 1973). [20] B. Telfer and D. Casasent, Ho-Kashyap optical associative processors, Applied Optics 29 (1990) 1191-1202. [21] D. Clark and D. Casasent, Practical optical Fourier analysis for high-speed inspection, Optical Engineering 27,5 (1988) 365-371. [22] J. Richards and D. Casasent, Real-time optical Hough transform for industrial inspection, Proc. SPZE, Vol. 1192, 1989, 2-21. [23] D. Casasent, Optical processing and hybrid neural nets, Proc. SPIE, Vol. 1469, Apr. 1991. [24] D. Casasent, Unified synthetic discriminant function computational formulation, Applied Optics 23 (1984) 162C1627. [25] B. V. K. Vijaya Kumar and E. Pochapsky, Signal-to-noise ratio considerations in modified matched spatial filters, J. Opt. SOC.Am. A 3 (1986) 777-786. [26] A. Mahalanobis, B. V. K. Vijaya Kumar and D. Casasent, Minimum average correlation energy (MACE) filters, Applied Optics 26 (1987) 3633-3640. [27] G. Ravichandran and D. Casasent, Minimum noise and correlation energy (MINACE) optical correlation filter, Applied Optics 31 (1992) 1823-1833.
890
D. Casasent
[28] H. Wechsler (ed.), Neural Networks for Human and Machine Perception (Academic Press, 1991). [29] C. L. Giles, R. D. Griffen and T. Maxwell, Encoding geometric invariances in higherorder neural networks, in D. Anderson (ed.), Neural Information Processing Systems, Denver, CO (AIP, 1988) 301-309. 1301 D. Casasent and E. Barnard, Adaptive clustering optical neural net, Applied Optics 29 (1990) 2603-2615. [31] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Puns. Inf. Theory 13 (1967) 21-27. [32] M. J. D. Powell, Restart procedures for the conjugate gradient method, Muthematical Programming 12 (1977) 241-254. [33] E. Botha, D. Casasent and E. Barnard, Optical production systems using neural networks and symbolic substitution, Applied Optics 27 (1988) 5185-5193. [34] D. Casasent and E. Botha, Optical correlator production system neural net, Applied Optics 31 (1992) 103&1040. [35] D. Casasent, Optical morphological processors, Proc. SPIE, Vol. 1350, 1990, 380-394.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 891-924 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
I CHAPTER 5.3 I INFRA-RED THERMOGRAPHY: TECHNIQUES AND APPLICATIONS*
M. J. VARGA and P. G . DUCKSBURY Image Processing and Interpretation, Defence Evaluation and Research Agency St Andrews Road, Malvern, W R l 4 3PS, UK Infra-red (IR) technology is applied in a wide range of application domains, e.g. military, medical and others. All objects, live or dead and of any colour, emit infra-red radiation by virtue of their temperature; the exact degree of radiation is determined by the absolute temperature and the thermal characteristics of the material from which it is made. The radiation is present day or night with or without external illumination. Infra-red technology is concerned with the detection and imaging of the emitted infra-red radiation. Infra-red imaging is therefore a method for producing an image of the heat emitted from any object’s surfaces. A thermogram is a calibrated graphic record of the temperature distribution obtained by thermography.
Keywords: Infra-red thermography, medical imaging, Bayesian networks, Target Deterction.
1. Introduction Infra-red (IR) technology is applied in a wide range of application domains, e.g. military, medical and others. All objects, live or dead and of any colour, emit infra-red radiation by virtue of their temperature; the exact degree of radiation is determined by the absolute temperature and the thermal characteristics of the material from which it is made. The radiation is present day or night with or without external illumination. Infra-red technology is concerned with the detection and imaging of the emitted infra-red radiation. Infra-red imaging is therefore a method for producing an image of the heat emitted from any object’s surfaces. A thermogram is a calibrated graphic record of the temperature distribution obtained by t hermography. In this chapter, both the hardware and some of the processing techniques associated with infra-red technology are discussed. Examples of military and medical applications will demonstrate the versatility of infra-red imagery. *The chapter is published with the permission of the controller of Her Britannic Majesty’s Stationary office @British Crown Copyright 1998/DERA.
891
892
M. J. Varga €9 P. G. Ducbbury
1.1. Infra-Red Wavebands Spectrally infra-red radiation is located between the visible and radio frequencies. Infra-red radiation is generally thought of in three spectral bands: Short Wavelength Infra-red (SWIR) also called near infra-red, lying between 0.7-2.0 pm; Medium Wavelength Infra-red (MWIR) ranging from 3.0-5.0 pm; and Long Wavelength Infra-red (LWIR) between 8.0-14.0 pm. Both the MWIR and the LWIR are strongly absorbed by water and organic compounds. Infra-red sources are either thermal (i.e. emitted by matter in the temperature range 100-3000 K) or electronic (i.e. emitted by high-energy electrons interacting with magnetic fields) [l-31.
1.2. Infra-Red Detectors There are two types of infra-red detectors, those that require cooling (cooled) and those that do not (uncooled). Cooled infra-red detector systems are bigger and more expensive than uncooled IR detector systems, but are more sensitive and can produce higher resolution images. Uncooled IR detectors on the other hand are cheaper, lighter, more compact and hence more portable. At present, however, they tend to be less sensitive and are commonly used for 8-14 pm only. The heart of a cooled thermographic camera is an infra-red photo-detector typically made of indium antimomide (InSb) or cadmium-mercury-telluride. This lies in the focal plane of the camera and is cooled by liquid nitrogen. Uncooled infra-red detectors typically use pyroelectric bolometer techniques. The infra-red radiation from an object is focused on the detector - the focusing system can be based on either refractive or reflective optics. The thermogram can be produced by a n array of detectors which converts the infra-red radiation directly into a n image. Alternatively, a scanning system can be used in which an image is built up by mechanically scanning the image onto a single detector or a linear or two-dimensional array. The signal can be represented as a grey-level or colour coded image. Long linear array detectors can be used with simple scanning mechanisms to generate high performance imaging over wide fields of view. Two-dimensional focal plane arrays increasingly provide the basis of systems which require no scanning and offer high sensitivity. There is a wide and developing range of infra-red focal plane array sensors using different detector technologies. Common to the development of all these arrays is the continual increase in thermal sensitivity. This enhanced sensitivity may be used directly or compromised to provide different operating designs. The available optical system and the required frame rate determine the choice of read-out process required t o achieve a given sensitivity [4].
2. Military Infra-Red Imaging Infra-red technology is applied in a variety of military applications and there is a need for both cooled and un-cooled systems. Cooled IR detectors offer high performance when required, for example in; weapon sights, surveillance systems, remote ground sensors, imaging IR seekers, non-co-operative target recognition,
5.3 Injra-Red Thermography: Techniques and Applications
893
mine sensors, driving aids, fire fighting and rescue. Where lower performance is acceptable, un-cooled infra-red detectors reduce the logistic burden and their low power and compactness is particularly useful for remote and autonomous operation.
3. Military Surveillance: Downward Looking IR Imagery The fundamental task here is generically defined as the location of some region of interest in an image, for example, the fast and automatic detection of urban regions [6]. This could be used as a cueing aid for more detailed processing, such as the detection of road networks or junctions and buildings in order to allow registration of imagery to maps for navigation, for change and target detection, image distortion correction as well as map-update. It could also be an attention cueing device for human image interpreters. This section describes the use of a Pearl Bayes Network (PBN) for the automatic extraction of knowledge about regions from infra-red linescan imagery, i.e. surveillance imagery.
3.1. In*-Red
Linescan Imaging and Correction
The aerial infra-red linescan imagery used in this application is produced by a sensor which has a single detector. Scanning in the x direction is achieved via a rotating mirror which has a uniform angular velocity and gives a 120" field of view. Scanning in the y-direction is achieved by the aircraft motion. The scanner arc introduces a characteristic (sec2) distortion into the imagery at either extreme of the 120" arc. This can be corrected with a relatively simple trigonometric transformation. Figure 1 illustrates the relationship in which h is the height of the aircraft, p is the bank angle of the aircraft, Ax is the ground resolution of a single pixel with A0 being the corresponding swathe angle for that pixel. The equation for the correction is Ax = ABh/ cos2(0 p).
+
r =
h
cos(0
+P 1 AQ
leH Ax Fig. 1. Linescan Sensor Distortion.
3.2. Pearl Bayes (Belief) Networks Pearl Bayes networks [7] are directed acyclic graphs, see Fig. 2. In this graph nodes B, C and E represent different statistical information extracted from the image, whilst node A represents the "belief" in detecting an urban patch. A graph
894
M . J. Varga €4 P. G. Duclcsbury
GBA
Fig. 2. Pearl Bayes Network.
G is a pair of sets (V,A) for which V is non-empty. The elements of V are vertices (nodes) and the elements of A are pairs (x,y) called arcs ( l i n k s ) with z E V and y E v. Consider the simple network that is shown in Fig. 2. Here the symbol 7r represents the causal support ( o r evidence) whilst X represents the diagnostic support (or evidence). GZA and GBA are subgraphs as described in the next section, together with the equations for computing the belief and propagation of information. a 3.2.1. Belief equations Consider the link from node B to A then the graph G consists of the two subgraphs GZA and GBA. These two subgraphs contain the datasets DLA and DBA respectively. From Fig. 2 it can be observed that node A separates the two subgraphs GAA u G z A U GLA and G I F . Given this fact we can write the equation: P(D,FIAi,
D i A , DAA, DLA) = P(DAFIAi)
(3.1)
aThe equations are derived along similar lines to those derived by Pearl in [7] where in his example node A has just two predecessors and two successors.
5.3 Infra-Red Thennogmphy: Techniques and Applications 895 by using Bayes rule while the belief in Ai can be written as
where a is taken to be a normalizing constant. It can be seen that Eq. (3.2) is computed using three types of information: 0 0
0
Causal support T (from the incoming links). Diagnostic support X (from the outgoing links). A fixed conditional probability matrix (which relates A with its immediate causes B , C and E ) .
Firstly the causal support equations:
Secondly the diagnostic support equation is given by
Finally the conditional probability matrix is defined to be
The belief Eq. (3.2) can now be rewritten in order to obtain the belief at node A based on the observations at B, C and E7 e.g. the belief that an urban region is detected:
The belief at nodes B7Cand E can be obtained from the equations (3.9) (3.10) (3.11)
896 M. J. Varga & P. G. Ducksbury
In other words the belief is the resultant product of causal support information, diagnostic support information (belief) and prior knowledge. The propagation equations described below are iterated to support belief of a certain event. 3.2.2. Propagation equations
The propagation equations for the network are derived as follows, firstly the diagnostic support. From a previous analogy with Eq. (3.6) we can write
by partitioning the DBA into its component parts, namely A, DAF, D$A, D i A we can obtain
likewise for X ~ ( c j )and XA(Ek)
and
3.2.3. Causal equations
These are defined using a similar analogy as follows. (3.16)
and from this we then derive the equation
An important point to realise is the fact that Eqs. (3.13)-(3.15) and (3.17) demonstrate that the parameters X and r are orthogonal to each other i.e. perturbation of one will not affect the other. Hence evidence propagates through a network and there is therefore no reflection at boundaries.
5.3 Infra-Red Thermography: Techniques and Applications 897 s1 -s3 : Edges,
Extrema, Distribution type from fine resolution image.
s 4 - s 6 : EdgCS,
fmm coarse resolution image.
B
: Belief
B
: Belief
from tine molution.
Half Resolution Pixel Data
- - - - - - - - - (fine) --statistics
FUl Resolution (raw)Pixel Data Fig. 3. Multi-resolution approach.
3.3. Region Segmentation Using a Pearl’s Bayes Network
The above Pearl’s Belief network approach [7]has been adapted for the detection of urban regions [8,11],using a high powered parallel processing system for improved performance. The belief network is used in a multi-resolution sense to combine statistical measures of texture into the detection ( o r belief) of the required region,b see Fig. 3. The problem is approached by taking several statistical measures from small patches of a n image which are treated as a set of judgements about the content of the patches. These statistics are the number of edges, the number of extrema and grey level distribution type. These statistics are quantised down into a small number of levels. The number of edges and extrema are both reduced to five levels, whilst the distribution type has four possibilities. It is important t o stress that any suitable measure that provides the required textural discrimination could have been used. The statistics are then used to produce a set of judgements; for example, an expert might, upon looking at a particular window issue a report of the form (0.0,0.7,0.9,0.6,0.0). This means that he believes there is a 70% chance that level 2 describes the number of edges, 90% chance that its level 3 will do the same and 60% for level 4. But he believes there to be no chance of it being levels 1 or 5. bThe authors have recently carried out some initial work into using the Belief Network approach at a higher level of abstraction, i.e. for combination of several region finding algorithms.
898 M. J. Varga tY P.
G. Ducbbury
For the purpose of the system described here the Belief at nodes B f , B, and B in Fig. 3 is quantised to three levels, namely (low, medium, high). The fixed conditional probability matrices (i.e. P ( B f I s 1 ,s2,sg) etc), which are the prior information and relate the given node with its causal information, are created along similar lines to the approach used in [9] and [lo]. They are based upon the assumption that the probability of an event at a given node should be greater if its causal information is more tightly clustered together than it should be if the causal information is further apart. For the P ( B ( B fB,) , matrix (which relates the beliefs from the fine and coarse resolutions) slightly more emphasis is given to the causal information received from the coarse resolution belief. P ( Bf i l s l j , s2k, s31) is described formally as
. .
if a = j = k = 1 0.25/a if (i # j = k = 1 ) AND (0 < li - j l 5 C ) 0-75 if ( j = k = 1 ) l.O/p P(Bfilslj,S2k,s31) = AND (max(j, k,1 ) - min(j, k,I ) 5 2 C ) AND (min(j, k,1 ) L i 5 max(j, k,I ) 10.0 otherwise (3.18) such that & k , l P ( B f i I s l j ,S2k, s31) 5 1 vi where c = 1 and i , j , k, range over the number of variables in B f , s l , s2 and s3 respectively. a and p represent the number of different values of i satisfying the constraint. P ( B I B f ,B c ) is defined as
I
1
I
0.9 0.7 0.3 P(Bi1BfjBck) = 0.6 0.1 ( 0.0
if i = j = k if (i = j ) AND (li - kI 5 1 ) if (i = k) AND (li - j l 5 1 ) if (i = j ) AND (li - kl > 1 ) if (i = k) AND (li - j l > 1 ) otherwise
(3.19)
such that x j , k P(Bi1BfjBck) 5 1 V i where i, j , k range over the number of variables in B , Bf and B c respectively.
3A. Performance An example result is shown in Fig. 4, which demonstrates the location of urban regions in a near infra-red image. The image was taken during the night at approximately 4000 feet. An outline has been drawn around the areas labelled as most likely to be urban. It is possible to post-process such a result to remove small isolated regions which are likely to be erroneous classifications, this has not been done here. A probability surface could also have been integrated into the image to produce a smoother transition between the areas of different classification.
5.3 Znfra-Red Thermography: Techniques and Applications
899
Fig. 4. Outline of Potential Urban Region.
This system can be easily adapted to alternate applications of a broadly similar nature, i.e. classifying or clustering regions. The only change necessary may be a different set of statistics which more accurately describe the detail required in the image. In addition, if the number of input nodes alter then the prior knowledge in the fixed conditional probability matrix will need to change. However the set of basic equations given previously can be used to automatically generate this informat ion. This approach has been demonstrated successfully for the texture based segmentation of driveable regions for autonomous land vehicles, and more recently for urban region segmentation in both SPOT, and Russian Satellite imagery in [8,12,13]. The algorithm was developed initially on a SUN workstation prior to implementation on a parallel processor architecture developed within the UK Defence Research Agency [141.
4. Target Detection and Tracking in Forward Looking IR Imagery 4.1. Introduction One of the most powerful features in any modern battlefield surveillance system is the ability to carry out automatic detection and tracking of targets. The amount of information being presented t o human operators is increasing at an alarming
900 M . J. Varga €4 P. G. Ducbbury
rate and needs to be reduced. Any system that can simply filter the data to present results of possible targets without all the intermediate information will be of significant benefit. The task here is generically defined as the detection of small point sized “hot” targets in IR imagery. These potential targets will be within some operator specified sizes, typically 2 x 2 t o 10 x 10 pixels. The resulting detection can be used t o aid subsequent target recognition through cueing for a narrow field of view imager and/or a human operator. The requirement for the system described here was not necessarily to locate all targets or indeed just the targets, but rather t o locate possible areas of interest for further analysis. The wide field of view sensor used produces very small potential targets. These targets have low contrast and low signal-to-noise ratios which makes their detection difficult. 4.2. System Overview
The system combines some “conventional” image processing techniques with morphological analysis t o perform automatic cueing and tracking of small objects/ targets. Most stages of the process have been deliberately chosen because of their suitability for future implementation in special DSP hardware modules. The process is shown schematically in Fig. 5. Only some of the main elements of this system are considered below.
Median Filter + lllmholding __________
Morphological closing Local Morphological Dilation (Redback) L
Display Annotation
Fig. 5. Algorithm Block Diagram.
5.3 Infra-Red Thennography: Techniques and Applications 901 4.2.1. Destriping Sensor imperfections mean that the imagery contains marked fixed pattern stripe artefacts. A destriping stage is therefore required to reduce or even to eliminate the banding effects that occur in this type of imagery. It is likely that future generations of infra-red imagers for this application will produce images of improved quality with imperceptible striping artefacts and hence such a destriping algorithm will become unnecessary. The destriping algorithm (151 removes bias variation between adjacent scan lines. Two adjacent scan lines i and i 1 are equalised by determining the distribution differences between adjacent pixels along the scans. The majority of entries should reflect differences in the baseline between the two scan lines. An additive correction to scan line i 1 is obtained from the median of the difference distribution, the median being used as it is a statistically robust measurement. This process is then repeated using scan lines i 1 and i 2 and so on. The disadvantage of this approach is that a software implementation is relatively slow. So an alternative scheme was developed as an intermediate measure (prior to hardware design and implementation of the full destriper). The approach is basically to model the sensor responses in order t o estimate a set of corrective factors. If it is assumed that the image is uniform over a number of neighbouring scan lines then any differences should be due to the sensor itself. The median of each image row is obtained and the maximum of all medians is taken as the maximum sensor response. The difference of the maximum response and the median of each row can then be used as an additive amount for that row. The analysis can be done on the first frame in the sequence and then at successive frames the corrective amounts are simply added to each row.
+
+
+
+
4.2.2. Median filter The median filter [16,17] is used to remove the salt-and-pepper type of noise. In a small window over an image the pixel values are likely to be homogeneous with only a small number of them being attributable to noislk. These noisy pixels tend to be a t either extreme of the grey level distribution and therefore are unlikely t o be selected as the output from a median filter (the output being the median of the ranked inputs). This filter has the advantage of reducing noise without having a smoothing effect on the image. In this instance since the possible targets are very small only a 2 x 2 median filter" is applied; despite this small window size the filter is successful in removing noise. =An important point to be considered is that a median filter when used for noise suppression can in fact be replaced by gray scale morphology. This point does not really apply to this particular algorithm as the filter we use only has a small kernal. It is however important and worth mentioning. A morphological opening followed by a closing operator can achieve the same effect as a median filter. Morphology has two distinct noise suppression stages, the opening suppresses positive noise impulses whilst the closing suppresses negative noise impulses.
902
M. J. Varga €4 P. G. Ducksbury
4.2.3. Thresholding Thresholding [17,18]is used to segment objects from the background. It is useful, whenever possible, to calibrate detected feature values (e.g. grey level) so that a given amplitude interval represents a unique object characteristic. There are various useful adaptive thresholding schemes, for instance, based on the examination of local neighbourhood histograms or other measures. Unfortunately these approaches can produce a significant amount of noise and in this application such noise would pose a major problem due to the small target sizes. In an effort to minimise the problem for this application the thresholding scheme used is based upon the global mean and variance of the portion of the imagery being processed. The threshold is set at p 30. This proved to be an acceptable level for this application domain, but an option has been provided in the algorithm for this to be varied interactively during run time (see also the section below on local morphological dilation).
+
4.2.4. Morphological closing The term morphology originally comes from the study of forms of plants and animals. In image processing it means the study of topology or structure of objects from their images. Morphological processing refers to certain operations where an object is “hit” with a structuring element and hence “reduced” to a more revealing shape. Most morphological operations are defined in terms of two basic operations, namely erosion and dilation. Erosion is a shrinking operation, whereas dilation is an expansion operation; erosion of an object is accompanied by enlargement or dilation of the background. If X the object and K the morphological structuring element are thought of as sets in two-dimensional Euclidean space, then the erosion of X by K is the set of all points x such that K , is included in X,where K , is the transformation of K so that its origin is located at x. The dilation of X by K is the set of all points x such that K , intersects with X. The morphological closing operator is defined as an erosion followed by dilation. Closing aims at blocking up narrow channels and thin lakes and is ideal for the study of inter-object distance. The reasons for applying a morphological closing operator in this application are twofold. Consider the thresholding of an image, this can and does result in the fragmentation of objects. Firstly, an object which is a target could be fragmented into several parts thus leading to the possibility of several targets being detected instead of one. Secondly, an object which is not a target (perhaps by virtue of its size) could be fragmented into small parts which are then likely to be identified as possible targets. To resolve this problem a morphological closing operator is applied in an attempt to piece the fragments back together. The structuring element kernel use is deliberately kept small to try to avoid merging several genuine targets. A fuller and more detailed description of morphology can be found in numerous papers in the literature, see, for example; [19-231.
5.3 Infra-Red Thermography: Techniques and Applications 903 4.2.5. Connected component labelling
The objective of connected component labelling (CCL) is to take a binary image and to apply a segmentation in order to obtain a set of connected regions, each of these disjoint regions being labelled with a unique identifier. Although this stage is currently performed in software a design and hardware module exists for future use [24]. This fundamental process is important in many applications and can be used as an input to shape recognition tasks. 4.2.6. Target elimination
Once a labelled image has been obtained regions can be discarded according to a number of criteria. The human operator will initially have specified a set of bounds for targets of interest. Regions which have a width, height or total area outside these constraints are discarded. It is not possible to discard objects based upon shape without knowledge of the type of targets, such information is unavailable in this application domain. 4.2.7. l?racking/prediction
Once an acceptable set of regions has been obtained the co-ordinates of the centre points are passed to the tracking process. Tracking introduces an element of temporal consistency into the system. This is used to resolve a number of issues such as false targets (due to segmentation errors or genuine noise in the imagery), targets appearing and disappearing, and overlapping targets. Once these issues have been resolved a prediction stage is performed to estimate the target’s position in the next frame. Targets develop a “history” over n frames and therefore isolated noise which appears for n - 1 frames or less will not be tracked and can be eliminated. The initial part of the tracking is actually an association stage where observations are associated with tracks. This uses a standard assignment optimisation algorithm [25] which was modified by [26] to deal with targets which appear and disappear. It was also modified by the authors to resolve the problem of several observations being identical distances from a given track but outside the permittable (gated) regions for all other tracks. This condition appeared to cause the standard algorithm to fail to converge to an optimum assignment. Kalman filtering [27] is the classical approach for the prediction of a target’s new position. It is the optimal predictor for tracking. It has been shown 128,291 that if the x and y target co-ordinates can be decoupled the Kalman filter can be reduced to the so called CY - ,O filter, which is much simpler and requires no matrix multiplication. 4.2.8. Local morphological dilation (Adaptive threshold feedback)
An important point arising from the thresholding is the difficulty in setting a threshold level at just the correct value for detection of all targets. As has been
904 M. J. Varga €4 P. G. Duclcsbury
mentioned previously it is possible for noise to be included and also for a genuine target (but one which is very small and/or emitting low thermal radiation) to be excluded from the thresholded image. The effect of this has been reduced by incorporating feedback from the tracking algorithm and essentially using the track confidence to adapt the output from the thresholded image. This is achieved simply by performing a local morphological dilation in an area around known targets (targets that developed a history). This attempts to enlarge the thresholded output to a point where it would be accepted by the target elimination stage and effectively reduces the number of dropouts due to weak targets. If a target genuinely disappears then this approach will have no effect. 4.2.9. Performance
A target (or rather areas of possible interest) is indicated by a diagonal line in Fig. 6 , the lower part of which points towards the target centre whilst the upper part has a unique target identifier.
Fig. 6. Target Detection.
The algorithm was initially developed on a SUN workstation prior to implementation on a parallel architecture. In order to increase performance to real-time the algorithm has now been ported onto a version of the architecture which includes a number of dedicated DSP modules.
5. Medical Applications 5.1. Infra-Red Imaging i n Medicine Infra-red thermography has been used clinically for over thirty years. It provides a means of obtaining high quality images of thermal patterns on the surface of the human body. The IR systems used for medical applications should have a sensitivity that varies from 0.01"C to 1°C and respond to temperatures in the range 5OC to 40"C, depending on the particular system and the part of the body to be examined [30]. In these systems any error caused by variation in the detector's response can be calibrated out, for example, by alternatively detecting the radiation from the object and the radiation from a reference source.
5.3 Infra-Red Thermography: Techniques and Applications 905 The correlation between the skin temperature and underlying malignant disease was first realised in the case of breast cancer. This inevitably resulted in initial infrared thermographic studies being concentrated on breast diseases [31]. However, there were problems, for instance, with limited sensitivity t o deep lying tumours and poor control of environmental conditions during examination and recording. The majority of the detected radiation comes from the topmost layers of the skin, i.e. body surface to 300 microns depth. The surface temperature of the skin is affected by both internal and external factors. The internal factors can be pathological or physiological, while the external factors are a function of ambient conditions, such as temperature, humidity and air flow. Indeed ambient air flow is very important in medical thermography, and a uniform environment without any cooling draught, direct warmth of sunlight, or radiators etc., is essential. In general an ambient air temperature of between 18°C and 20°C has been found to be appropriate. Temperatures below 18°C may induce a cold stress response and shivering, resulting in “flicker” thermograms. While temperatures above 20°C may cause sweating and create other anomalies and noise on the image. It is also important for the area of the body under examination to reach a steady state in controlled environmental conditions. In some cases, it is necessary for the patient to partially undress so as to expose the area of the body to be examined directly t o the ambient temperature for a short stabilisation period (10-15 minutes is usually enough). Loose clothing and underwear are required to avoid altering the local blood flow ( a n d thus the overlying skin temperature) through the pressure and restriction caused by tightfitting garments. Dressings, ointment or any other surface moisture will affect, to a certain degree, the infra-red emission from the skin. These must all be eliminated prior to equilibration if thermography is to be used in a controlled manner. This sensitivity is due to the fact that infra-red radiation in the wavelength typically used (i.e. between 3-5 or 8-12 pm) are strongly absorbed by water and organic compounds [32]. If environmental conditions are adequately controlled, heat emission from the skin is largely determined by the underlying blood supply. In the absence of deeper lying organic disease or other factors, which may indirectly alter skin blood flow, the thermographic image of the heat emitted by the skin may be interpreted in terms of the status of the underlying peripheral circulation [33-361. Thermography can therefore be used for detecting peripheral arterial disease, diabetic angiography, Raynaud’s phenomenon and related conditions, inflammatory conditions, and for the determination of amputation level. For deeper-seated pathological conditions radio-isotope imaging, ultrasound or radiography are more suitable. At present IR thermography is most widely used in applications associated with the vascular system [37,38], peripheral and cutaneous circulations as well as relatively superficial tissue lesions. In some cases thermography provides a beneficial preliminary or complementary aid to examination, in others it fills in gaps in the existing armoury of assessments. However, its use in clinical assessment is still considered by some to be controversial, partly due to the wide range of temperatures
906
M. J .
Varga & P. G. Duclcsbury
of lesions or diseases (this is particularly true in breasts) and also due to the lack of understanding of the basic principles of thermography (i.e. its characteristics and limitations). The work reported in Secs. 6 and 7 addresses some of these recognised problems.
5.2. Static and Temporal Sequence of Thermogmms Various methods have been used for the presentation, analysis and classification of thermograms. These include functional images, spatial signatures and texture signatures. Historically these methods have been applied to individual static thermograms for diagnosis. Many conditions, however, are not evident from such single static images. Historically, also, the assessments from a static thermogram would normally be based on an individual patient’s data only, much like the common use of X-ray pictures. Diagnostic results would be in the form of an index or some form of written report, with or without graphical explanation. These approaches do not fully utilise the information available from thermography. It has been found that useful information can be obtained by observing the thermal behaviour of the body over time. In order to do this the technique of temperature stress testing has been developed whereby a temporal thermal response is induced in the body under controlled conditions. A sequence of thermograms is taken over time to record the body’s thermal behaviour. Two diagnostic systems will be described both of which use analysis of the temporal thermal response and wider statistical data for automatic classification and diagnosis. The first, in Sec. 6, is concerned with diagnosing joint disease, namely arthritis and rheumatism as well as Raynaud’s syndrome. It is based on analysing the bodie’s thermal behaviour after a “cold stress”. The second system, described in Sec. 7, is concerned with the diagnosis of breast cancer after a “heat stress”. 5.3. Advantages and Disadvantages of Thermography in
Medical Applications There are numerous advantages t o applying infra-red thermography in clinical and medical investigations. They include:
0
0 0 0
The recording of a patient’s thermogram is inexpensive to perform, although initial equipment costs are high. The technique is simple, and can be repeated at frequent intervals, allowing real time assessment (especially compared with radiological images), and hence results in good patient co-operation. It is non-invasive and involves no radiation hazard. Thermography may indicate lesions too small to be seen on a roentgenogram. Each patient has a unique thermogram, useful in a follow-up analysis. Thermographic images are opaque and do not contain the overlapping objects present in radiographic images.
5.3 Infra-Red Thermography: Techniques and Applications 907 0
0
0
Thermographic images are inherently quantitative in nature due to the direct representation of the physical temperature. Only a rudimentary knowledge of anatomy and pathology is necessary t o interpret a thermogram. Diagnostic criteria involve essentially a measure of the symmetry and relative temperature distribution and pattern. Because the temperature signal is available directly in an electrical form, it is simple to connect the thermographic equipment directly to a computer via an analogue-to-digital converter for automatic image acquisition and storage.
There are, of course, also disadvantages which include: 0
0 0 0 0
0
0 0
Lack of specificity. Provides limited range of image brightness. Low spatial resolution. Not all abnormalities exhibit observable thermal phenomena. Thermal variations do not necessarily have a spatial relationship to the disease investigated. Thermographic signs can occur in benign as well as malignant conditions (e.g. breast thermography) Occasionally, an anatomical aberration gives a false positive reading. Simultaneous, bi-lateral symptoms could be diagnosed as negative.
6. Joint Disease Diagnosis Using Dynamic Infra-Red Thermography
6.1. System Overview The main objective of this work was to automate the analysis and classification of a temporal series of thermograms (recording the response of the hand to a cold stress) into different classes namely, normal, Raynaud’s and inflammatory. Various other quantitative imaging systems have been used for assessment of such conditions, for example: the differential thermistor thermometer, the infra-red thermometer, radiography and arteriography. Their limitations lie in the technical difficulties, expense and/or their invasive nature. In this system statistical pattern recognition techniques are used to analyse the results of a temperature stress test. The temperature stress is induced by immersing the hand in a cold water bath at 20°C for 1 minute (cold stress). The body’s response to the stress is recorded by thermograms taken a t regular two minute intervals over about twenty minutes. In order to study the thermal response of the hand it is, in general, necessary t o track the thermal behaviour over time of every point on the hand. It is therefore necessary to have knowledge of the correspondence between points in each of the series of images. The simplest way t o do this would be to ensure that there was no movement of the hand over the twenty minute period, thus the thermal response of each object pixel could be tracked in a straight forward manner. Unfortunately, it is not possible to restrain the hand in a way that does not affect its thermal response. The patient must simply hold the hand in front of
908
M. J . Varga & P. G. Ducksbury
the camera and the correspondence must be built up by aligning the images of the hand with an aligning and stretching algorithm [39]. The thermograms of hands used in this work were taken at a distance of between 0.75-1.5 metres, depending on the size of hand in question. The thermograms were digitised into a 128 x 128 square image with 256 grey-levels. In this work only the anterior view of the hand (back of the hand) was used, and no attempt was made to analyse thermograms taken from posterior or lateral views. The feature extraction method developed for this application is based on the Kittler & Young transformation. The function of this transformation is to extract those features that appear most important for classification. A 7-Nearest Neighbour (7-NN) classifier, built using the Condensed Nearest Neighbour (CNN) technique, is applied to the features. It was recognised that it would be desirable for the resultant classification (diagnosis) t o be presented as a colour coded diagnostic image showing both the classified disease category and the location of the “hot spots” (inflammatory condition), or “cold spots” (Raynaud’s). The severity of the disease should also be indicated through the intensity of the colour in the diagnostic image. For example, using green colour for normal cases, red for inflammatory cases and blue for Raynaud’s. In the case of uncertainty, the resultant image would have a non-primary colour. The system achieved about 96% accuracy at pixel level.
6.2. Thennographic Colour Coding Schemes 6.2.1. Gradual natural scale The use of grey-scale (Fig. 7) or pseudo-colour in medical images has been the subject of debate. Some believe that colour-coding is artificial and misleading and can create confusion which leads to misinterpretation. While others find that the use of grey-scale in some images makes it difficult to differentiate pathological areas from normal areas. This controversy could partly be due to the use of inappropriate pseudo-colouring systems which are insensitive to the particular information required and the requirement of special expertise t o interpret the colour codes.
Fig. 7. A Grey-Scale Hand Thermogram.
5.3 Infra-Red Thermography: Techniques and Applications 909
Fig. 8. Gradual Natural Scale.
Two different colour coding schemes are considered here [40]. The first, the Gradual Natural coding scheme is shown in Fig. 8. It is a smooth gradual scale ranging from black (cold) to white (hot) with blue, green and red gradually intermixing with one another over the inter-mediate range. Such a scheme conveys the overall temperature range of an image in an easily identified and recognised colour spectrum, giving a general idea of where the temperatures lie. The semi-circular disc shown on the left of the image is the temperature standard. The encoding is based on three primary colours - red, green and blue, and has 8 bits to represent all the colours, i.e. 256 levels. These levels are split into 3 ranges, one for each colour, and within the range the intensity can be varied uniformly and gradually. At the boundary between any two primary colours (e.g. green and red), a gradual mixing of the two colours (e.g. a decrease in green colour intensity accompanied b y an increase in red colour intensity) results in the perception of a non-primary colour (e.g. yellow). This non-primary colour is necessary to create smooth changes over the boundary of the two different colours, thus providing an overall gradual and smooth colour spectrum. 6.2.2. Randomised blocking scale
The second coding scheme, Randomised Blocking, is illustrated in Fig. 9; as its name suggests it uses randomised colour blocks. As before, black and white are at either ends but the intermediate colours are small repetitive blocks with different colours and intensities. Adjacent temperatures are represented by significantly different colours, so that slight temperature differences will be accentuated; such differences would otherwise be undetected. The coding scheme is constructed as follows. The 8-bit control byte is split into three fields, one for each primary colour; red, green and blue. Thus in the example of Table 1, bits 0, 3, and 6 are associated with red (R), bits 1, 4, and 7 for green
910
M. J. Vurga €d P. G. Ducksbury
Fig. 9. Randomised Blocking Scale.
Table 1. Coding Scheme. Bit
7
6
5
4
3
2
1
0
Colour Example
G
R
B
G
R
B
G
1
1
0
1
1
0
1
R 1
(G) and bits 2 and 5 for blue (B). This 3-3-2 combination means that there are 7 possible intensities for both red and green but only 3 for blue. An example is given in Table 1, here level 7 of green is combined with level 7 of red and no blue (e.g. resulting in a yellow colour). This coding scheme is also useful for coding of disease classification categories because in such applications there is not necessarily a uniform continuum of information to be encoded (i.e. just blue of some degree, just red of some degree or just green) but rather, as in this case, any possible combination of red and green or blue and green. 6.3. Observed Hand Thermal Characteristics During the course of data collection, some thermal characteristics or behaviour patterns were noted. It is these patterns that the system must be able to extract, analyse and quantify. 6.3.1. Normal condition It is known that in a normal hand all the fingers remain at similar temperatures (27°C f2"C) and display a positive gradient, i.e. temperature increases towards the finger tips. Most showed hypothermia during the 15-20 minutes stabilisation period, which resulted in an increase in hand temperature and which decreased the temperature differences between a normal and an inflamed hand.
5.3 Infra-Red Thermography: Techniques and Applications 911
Fig. 10. Example of a Normal Hand.
After the cold stress the hand rewarmed quickly and throughout the test the hand maintained its temperature distribution with hotter fingers. Examples of a normal hand is shown in Fig. 10. In some normal cases a diagonal gradient pattern could be noted while in other cases isolated cold fingers were found which did not necessarily relate to the symptoms. 6.3.2. Raynaud’s conditions In Raynaud’s phenomenon (primary and secondary) the mean temperature was always lower than 27°C (approxcimately 22°C - 23°C) and during the cold stress the temperature range across the hand was typically up to 10”Cl whereas in normal hands it was no more than 6°C. The symptoms were characterised by a well banded isothermal pattern with negative gradient (i.e. colder towards the finger tips). In some patients with severe cases of secondary Raynaud’s the gradient could be over as much as 12°C. Similar, but less marked, banded patterns over the hand were found in patients with primary Raynaud’s. The hands of patients with Raynaud’s condition tended to cool down during the stabilisation period prior to the cold stress. In patients with primary Raynaud’s, the hands rewarmed after the stress and in the end the gradient differences reduced. In patients with severe secondary Raynaud’s the hands would cool down further in response t o the cold stress. An example is given in Fig. 11. 6,3.3. Inflammatory conditions
Higher temperature is often recorded (29°C - 34°C) for arthritic hands due to the inflammatory mechanism. Classic symptoms being swelling, slight deformity and the presence of a “hot spot”. The temperature rise on the overlying skin at the affected joint can be up t o 5°C. The precise nature and extent of such hypothermic areas are determined by the underlying pathology. For example, synovitis may cause a localised area of increased temperature, while chronic rheumatoid arthritis
912
M. J. Varga & P . G. Ducksbury
Fig. 11. An Example of a Raynaud’s Hand.
Fig. 12. An Example of an Inflammatory Hand.
may result in a generalised hypothermia over the whole joint. Gout also causes dramatic and characteristic increase in temperature over affected joints. Inflamed areas remained at higher temperatures after cold stress. During the “warm up” period the affected hand warmed up as with the normal hand but less markedly. In fact, due to vasodilation, the temperature difference between the two classes was less prominent in the early stage of cold stress response [38]. An example is given in Fig. 12. 6.4. Application of the Kittler €3 Young Method to Sequences
of Hand Thermograms 6.4.1. The Kittler & Young method
Ideal classification criteria should be based on the discriminatory potential of both the class means and class variances, scaling them according to their relative significance. Such criteria are, however, complex and difficult to formulate. In many practical situations the contribution of the differential ability of class variances is neglected. This simplicity is assumed by all the standard variants of the
5.3 Infra-Red Thermography: Techniques and. Applications
913
Fig. 13. Series of Thermograms.
Karhunen-Loeve transformation method [41]. The Kittler & Young method [42] is a feature extraction method based on two Karhunen-Loeve transformations, and is intended to overcome this problem. 6.4.2. Application of the Kittler
63 Young method
The Kittler & Young method was applied to a series of thermograms (e.g. those in Fig. 23) to compress the differential thermal information, based on both the thermal means and variances (of diflerent classes), of thermograms into a series of transformed “Eigen” images. In order to accentuate the features of the first three resultant transformed Eigen images for visual inspection and for further use as diagnostic images, the following colour coding scheme, based on the RB scale, was applied: 0
The first transformed image (Fig. 14) was coded into 7 different levels of green, where the maximum value corresponded to the darkest green and vice versa. The
Fig. 14. First Transformed Image Coded in Green.
914
M. J . Varga Ed P. G. Ducksbury
Fig. 15. Second Transformed Image Coded in Red.
Fig. 16. Third Transformed Image Coded in Blue.
0
0
7 levels of green are given the following values: 0, 2, 16, 18, 128, 130, 144, 146 respectively. These values coincided with the colour “bits” assignment in Table 1. The second transformed image (Fig. 15) was coded into 7 different levels of red, as described above but given the following values: 0, 1, 8, 9, 64, 65, 72, 73. The third transformed image (Fig. 16) was coded into 3 different levels of blue, 0, 4, 32, 36.
Individually the three colour coded transformed images with different intensities of the corresponding colour accentuate the variations in thermal behaviour of different parts of the hand, with the highest intensity corresponding to the maximum variation. Among them the green image conveys the most discriminatory information about the variation in the corresponding parts of the hand over the time series, as well as within the hand anatomy. The coding matched quite well with the diagnosis of the physician and could be used as a general guidance diagnostic image. In order to make the most of the
5.3 Infra-Red Themography: Techniques and Applications 915
Fig. 17. Composite Transformed Image.
three transformed images, however, a composite transformed diagnostic image was developed (Fig. 17). This is created by summing the three colour-coded Eigenimages together. From this composite image, the difference in thermal responses in the hand could be seen more clearly than in any of the three colour coded transformed images individually. The “hot spots” (in inflammatory conditions) and “cold spots” (in Raynaud’s condition) could be identified more easily than using the three individual transformed images. This composite transformed Eigen-image was found to match most closely the physician’s diagnosis. 6 . 5 . Classification
6.5.1. Classifier training and testing data The available data came from three diagnostic classes; inflammatory, normal and Raynaud’s. Some of the affected areas were localised, though mixed classes commonly occur in the same patient (for example only parts of the hand might be affected by a disease while the rest of the hand is normal). The training data therefore consisted of appropriate classes only and not all the pixels in a classified hand. This meant, for example, that only the inflamed areas of a hand were used in the inflammatory training set, and similarly, for the normal and Raynaud’s classes. The training data thus contained only representative data of its class and was not mixed with other classes. The selection of different class representative vectors was carried out by visual inspection of the diagnostic composite Eigen-images. Square or rectangular areas were located manually on displayed composite Eigen-image and the corresponding regions in the original series of thermograms were then extracted. These selected areas formed the training data. The optimal co-ordinate system were then obtained by applying the Kittler & Young analysis to this data. In order t o reduce the storage and computational requirement for the classification a Condensed Nearest Neighbour (CNN) classifier 1431 was used.
916
M. J. Varga €9 P. G. Ducksbury
6.5.2. Condensed nearest neighbour ( C N N ) classifier
In an ideal minimal subset, the 1-NN classification of any new pattern based on such a subset would be the same as the 1-NN classification with the complete set. Although it is simple to formulate such procedures, the computational complexity involved when the pattern space has a moderately large number of dimensions makes this impractical. As an alternative, Hart [43] described the CNN technique which provided a consistent subset of the original set for the NN rule, and could correctly classify all the remaining points in a sample set. However, the CNN rule will not in general find a minimum consistent subset that is consistent with a minimal number of elements. The CNN algorithm is defined as follows: Assume that the original set is arranged in some order, and two bins named STORE and GARBAGE are set. 0
0
0
0
Place first sample in STORE. Second sample is classified by NN rule, using the current store as the reference set. If the second sample is classified correctly it is placed in GARBAGE, otherwise into STORE. By induction, the ith sample is classified by the current contents of the STORE. If classified correctly it is placed in GARBAGE, otherwise in STORE. After one pass through the original sample set the procedure continues to go through GARBAGE until termination, which occurs when one of the below conditions is encountered. -
0
GARBAGE is exhausted, with all its members transfered to STORE or One complete pass is made through the GARBAGE with no transfer to STORE, as the underlying decision surface is stationary.
The final contents of STORE are used as reference points for the NN rule, the contents of GARBAGE are discarded.
When the Bayes risk is small, i.e. the underlying densities of the various classes have small overlap, the algorithm will tend to select points near the boundary between the classes. Therefore points deeply inside a class need not be transfered to STORE as they will be classified correctly. Consequently, when Bayes risk is high, the STORE will contain most of the points in the original sample set and no substantial reduction in sample would be possible. CNN tends to pick the initial STORE set randomly in the data space, and becomes more selective subsequently when the decision boundary becomes better defined. This results in vectors chosen later in the process tending to lie close to the decision boundaries. 6.5.3. Classification results
Three CNN classifiers were built based respectively on the Eigen images corresponding to the first 2, 3 and 4 Eigen vectors. These were tested using the 7-NN
5.3 Infra-Red Thennography: Techniques and Applications
917
classification technique with a Euclidean distance measure and a majority vote system. The testing samples were transformed using the Eigen-vector matrix derived from the training data set. It was found that the best performance was achieved when a three-dimensional CNN classifier was used giving an error rate of only 4.3% with respect to pixels as classified by a physician. It was found that the majority of errors came form the mis-classification of the Raynaud’s class as normal. This is possibly due t o the similarity in behaviour of a “cold hand complaint” and that of a mild Raynaud’s condition. It can be concluded from the above that only 3 dimensions are needed because the Kittler & Young method has succeeded in compressing most of the discriminatory information into the first three components of the transformed feature vectors, making the rest of the components in the transformed space redundant. The performance of the system compares favourably with a clinician’s diagnosis. Moreover, the majority of the errors came from the 4-3 tie condition of the 7-NN classifier where the correct category was in the 3-minor vote position. Hence, although they were mis-classified their correct classification could still be identified when presented in the colour representation technique developed in this study and described below. 6.6. Presentation of Classification Results for Diagnosis The resultant classified thermal features were used to produce diagnostic images; in these colour coded “classification pixels” were used to replace the corresponding pixels in the thermogram. The degree of confidence in the classification was denoted by the different intensity of the assigned disease class colour. For example, in the 4-3 tie condition of normal-Raynaud’s classification, then fourth degree intensity of green (normal) together with third degree intensity of blue (Raynaud’s) were summed together, giving a “greenish-blue” colour as the final “diagnosis colour” to
Fig. 18. A Diagnostic Image Of Normal Class.
918
M. J. Varga €9 P. G. Ducksbuy
Fig. 19. Classified Raynaud’s Hand.
Fig. 20. A Diagnostic Image of Inflammatory Class.
the pixel in question. In fact this “greenish-blue” colour was found to be associated closely to those patients with complaints of cold hands. While “yellowish” cases were found to be the patients with mild inflammatory conditions, i.e. when there was a tie between normal (green) and inflammatory (red). The resultant diagnostic images indicated the locations of the affected areas on the hands as well as the degree of “ t r u t h f u l n e s ~(severity ~~ and certainty) of classification by means of different colours and varying degrees of intensity Figs. 18-20. The diagnostic results compared extremely well with the physician’s diagnosis. The effect of the error rate was that the exact dimensions of the affected area might not be precisely defined, but the locations were identified. The classification errors were most likely to be at the boundary between two different classes on the hand.
5.3 Infra-Red Thermography: Techniques and Applications 919
7. Breast Cancer Detection Every year 1500 women in the UK alone die of breast cancer. Studies have shown that early detection is crucial t o survival in breast cancer, and it is believed that the use of appropriate image processing can make screening and diagnosis easier. Currently X-ray mammography is the most commonly used imaging technique [45]. Due to their invasive nature X-ray mammograms cannot be acquired regularly. An alternative complementary technique using non-invasive infra-red thermography together with low-level microwave radiation is described here.
7.1. System Overview This work is concerned with the analysis of normal and abnormal infra-red mammograms. The analysis is based on a series of infra-red mammograms of the breast subjected to a warming temperature stress. The temperature stress is induced using a very low level of microwave radiation. This produces a higher heating rate in tumourous tissue than in normal healthy tissue. The technique uses a sequence of mammograms taken during the cooling process. The image analysis and processing is basically the same as that described for the diagnosis of joint diseases in Sec. 6 above.
7 . 2 . Temperature Stress Technique and Data Collection It has been found that the static infra-red mammogram provides insufficient information for diagnosis due t o the limited transmission of IR radiation through fatty tissues. Therefore in this study a temperature stress was applied to induce a thermal response of the breasts t o aid diagnosis. Here the body is exposed to a
Fig. 21. Infra-red Mammograms.
920
M. J.
Varga €9 P. G. Ducksbury
micro-wave heating (frequency: 0.45 Ghz, power density: 80-100 m W/cm2) for two minutes. The irradiation penetration is typically 1.7 cm in muscle and 10 cm in fatty tissue, so the heating process can penetrate the breast fairly deeply. In order t o reduce excessive heating of the subcutaneous fat layer an active convective cooling of the skin is necessary. The use of this cooling means that the ambient temperature control is more relaxed in this application than in the previous study. After the stress there is a temperature transient in the breasts. It is the nature of this transient upon which the analysis is based. Thermograms are taken a t 30 second intervals for about 8 minutes in order to record this transient, see Fig. 21. Tumours appear as hot spots due to differences in the dielectric constants, vascularization, density and specific heat. The result of this process is that small and/or deep tumours can be detected. Furthermore, these infra-red mammograms are easier to interpret than X-ray mammograms; radiographers’ interpretation of X-ray mammograms is known to highly variable. Another advantage of this system is that it takes less time to record the series of thermograms (approximately 8 minutes) than the 30 minutes typically required to set up a single conventional thermogram (allowing time for
acclimatisation etc.). 7.3. Observed Thermal Characteristics of Breast During the course of data collection some thermal characteristics or behaviour patterns were observed. It is these patterns that the system must be capable of extracting, analysing and quantifying [44]. 0
0
0
0
Normal tissues - Temperature does not rise as high as the abnormal tissues. Temperature drops linearly with time. Veins - Temperature rises slightly higher than the abnormal tissues. Temperature drops more rapidly with time. Nipple -Temperature does not rise as high as the abnormal tissues. Temperature drops quickly with time. Tumours - Temperature is generally 0.3”C - 1.5”C higher than normal tissues. Temperature remains high but sometimes drops a little.
7.4. Elcperimental Results
Initial results from this work showed that in some severe cases the “hot spots” (tumours) can be identified using colour coded composite Eigen images from the Kittier & Young transformation. Also, that the vascular pattern can be identified to a certain extent , this can also help diagnosis because it indicates if micro-calcification exists. However, this approach can only be used as a supplementary diagnostic for breast cancer detection and the following-up treatment. At present X-ray mammography still must be used as the main diagnostic aid. It must be stressed that these are only preliminary results. Before such a technique can be applied in practice a far larger evaluation would be required, c.f. the 30 patients used in this study.
5.3 Infra-Red Thermography: Techniques and Applacatzom
921
8. Other Infra-Red Applications Infra-red technology is used in many application domains other than defence and medicine, for example: infra-red security system to detect intruders through their body heat and movement, fire fighting, rescue and driving aids [46].
8.1. Fire Fighting In fire fighting airborne imaging systems can often provide high-resolution data in a more timely fashion than space-based systems. Aircraft with on-board infrared and multi-spectral instruments are used, for example, in fighting forest fires to acquire images and information on the fire's perimeter, its hot spots and direction of travel. Information from this type of imagery can help fire fighters to suppress the fire significantly more efficiently than through the use of space-based imaging systems. This type of infra-red and multi-spectral imaging system can also be applied in law enforcement (e.g. m a ~ j u a n adetection) and forestry (e.g. identification of diseased tress within a forest). Often the collected images or video sequence acquired at the site are transmitted to a home base for use. Image processing, computer vision and pattern processing techniques are then used to analyse the images to develop strategies and action plans. 8 . 2 . Monitoring Radioactive Waste
One way of dealing with radioactive waste is to pour a mixture of molten glass and waste into canisters for storage. An infra-red detection system has been used to monitor the mixture level inside the canisters during filling and thus prevent spills. The system feeds live thermal video to a remote control room where operators can monitor the hot-glass level during the filling operation. The main advantage of using infra-red technology in this application is that it does not require radiationreference sources. This allows non-radioactive start-up testing of the process and facilitates safe worker entry into the vitrification cell before actual processing of the radioactive waste. 9. Future Infra-Red Sensor Technology Most current thermal imaging systems require complex cooling techniques with concomitant penalties in system size and weight and a significant logistic burden. Therefore, much research focuses on un-cooled, compact, low power and cheaper solutions. Ferroelectric detectors are perhaps the most promising devices for uncooled or ambient temperature IR imaging. Good performance can be achieved with large arrays where there is a single ferroelectric detector element for each image pixel. The performance can be further enhanced by reductions in detector noise bandwidth through advanced element and integrated circuit readout design. For the present however, cooled technology is still required for applications which need the highest performance.
922
M. J. Varga €4 P. G. Ducksbury
A major problem associated with current infra-red sensors is the non-uniformity inherent in the detector array. This requires correction prior to any subsequent image processing. At present correction is achieved using off-focal plane processing. However, the continuing advances in silicon integrated circuit technology will now allow more functionality, including non-uniformity correction, to be included within each pixel of the focal plane sensor array itself. Transferring this function onto the focal plane will result in a more cost effective solutions, giving improved performance and reliability together with reduce size and weight. In addition, the background pedestal current can be removed, on a pixel by pixel basis, resulting in systems benefits such as improved range and image quality. Moreover, there are opportunities to implement novel and advanced image processing and pattern recognition techniques on the focal plane array. This will result in new capabilities such as motion detection and clutter rejection which will significantly enhance the performance of IR systems.
Acknowledgements British Crown Copyright 1996/DERA. Published with the permission of Her Britannic Majesty’s Stationary office.
References [I] W. L. Wolfe and G. J. Zissis (eds.), The Infrared Handbook, revised edition, 3rd printing, The Infrared Information Analysis (IRIA) Center, Environmental Research Institute of Michigan, USA, 1989. [2] A. W. Vere, L. L. Taylor, M. Saker and B. Cockayne, Nonlinear Optical Material for the 3-5 p m Waveband - Available Performance and Technical problems, Technical report, unpublished, October 1994. Defence Research Agency, Farnborough, Hants GU14 6TD, UK. [3] L. F. Pau and M. Y. Ei Nahas, An introduction to infra-red image acquisition and clas sification (Research Studies Press Ltd., Letchworth, Hertfordshire, England, 1983). (41 R. A. Ballingall, I. D. Blenkinsop, I. M. Baker and Parsons, Practical design consideration in achieving high performance from infrared hybrid focal plane arrays, Proc. SPIE Vol. 819 Infrared Technology XIII (1987), 239-249. [5] A. J. Myatt, D. A. Spragg, R. A. Ballingall and I. D. Blenkinsop, Flexible electronic control and correction System for use with IR focal plane arrays, S P I E Vol. 891 Infrared Technology XIII (1987), 239-249. [6] P. W. Foulkes, Towards Infrared Image Understanding, Ph.D. Thesis, Engineering Department, Oxford University, 1991. [7] J. Pearl, Fusion, propagation, and structuring in Belief Networks, Artificial Intelligence 29 (1986) 241-288. [8] P. G. Ducksbury, Parallel texture region segmentation using a Pearl Bayes Network, British Machine Vision Conference University of Surrey (1993) 187-196. [9] P. A. Devijver, Real-time modeling of image sequences: based on hidden Markov mesh random field models in decision making in context, A Course on Statistical Pattern Recognition, by P. A. Devijver and J. Kittler (Surrey University, 1989). [lo] P. G. Ducksbury, Evidential Reasoning - A Review and Demonstration, UK DTI IED project 1936: Vision by Associative Reasoning, Report no. VAR-TR-RSRE92-4, July 1992.
5.3 Infra-Red Thermography: Techniques and Applications 923 [ll] J. M. Brady, P. G. Ducksbury and M. J. Varga, Image Content descriptors - Concepts, Unpublished technical report, DRA Malvern, UK, March 1996. [12] P. G. Ducksbury, Driveable region detection using a Pearl Bayes Network, IEE Colloquium on Image Processing for Transport Applications (London, Dec 1993). [13] P. G. Ducksbury, Image Content Descriptors: Feature Detection Stage, Unpublished technical report, DRA Malvern, UK, March 1996. [14] R. W. M. Smith, Conceptual Hierarchical Image Processor (CHIP): System Design, Issue 1.0, Unpublished technical report, DRA Malvern, UK, October 1992. [15] D. M. Booth and C. J. Radford, The Detection of Features of Interest in Surveillance Imagery: Techniques Evaluation, Unpublished technical report, DRA Malvern, UK, Nov 1992. [16] B. I. Justusson, Median filtering: statistical properties, in T. S. Huang (ed.), Two dimensional digital signal processing 11, Top. Appl. Phys. (Springer Verlag, Berlin, 1981) 161-196. [17] A. Rosenfeld and A. C. Kak, Digital Picture Processing, Vols. 1 and 2 (Academic Press, New York, 1982). [18] P. K. Sahoo, S. Soltani, A. K. C. Wong and Y. C. Chen, A survey of thresholding techniques, Computer Vision, Graphics and Image Processing 41 (1988) 233-260. [19] P. Maragos and R. W. Schafer, Morphological systems for multidimensional signal processing, Proc. IEEE, 78, 4 (April 1990). [20] Z. Hussain, Digital Image Processing: Practical Applications of Parallel Processing Techniques (Ellis Honvood, 1991). [21] R. M. Haralick, Mathematical morphology and computer vision, 22nd Asilamer Conf. Signals and Computers, Pacific Grove, CA, US, 31-0ct/2-Nov 88. [22] J. Serra and P. Soille (eds.), Computational Imaging and Vision, Mathematical Morphology and its Applications to Image Processing, Vol. 2 (Kluwer Academic, 1994). [23] F. K. Sun and S. L. Rubin, Algorithm development for autonomous image analysis based on mathematical morphology, Proc. SPIE 845: Visual Communications and Image Processing II(1987). [24] R. W. M. Smith and C. J. Radford, Development of a connected component labeller DSP module for CHIP, Unpublished technical Report, DRA Malvern, UK, Oct. 1993. [25] S. S. Blackman, Multiple Target Trucking with Radar Applications (Artech House, 1986). [26] G. Brown, R. W. M. Smith and C. J. Radford, Target Acquisition and Tracking of a Staring Array Sequence on CHIP, Unpublished technical Report, DRA Malvern, UK, Sep. 1993. [27] B. Kalman, New results in linear prediction and filtering, J. Basic Engin. 83-D (1961) 95-108. [28] D. J. Salmond, The Kalman Filter, The a - p filter and smoothing filters, Royal Aircraft Establishment, report TM-AW-48, February 1981. [29] A. W. Bridgewater, Analysis of 2nd and 3rd order steady state tracking filters, AGARD Conf. Proc. 252 (1978). [30] Yu. Gulyaev, V. Marov, L. G. Koreneva and P. V. Zakharov, Dynamic infrared thermography in humans, IEEE Engineering in Medicine and Biology (November/December 1995), 766-771. [31] J. E. Goin and J. D. Haberman, Automated breast cancer detection by thermography: performance goal and diagnostic feature identification. Pattern Recogn. 16, 2 (1983) 125-129.
924
M. J. Varga €4 P. G. Ducksbury
[32] M. J. Loh (nee’ Varga), Application of Statistical Pattern Recognition techniques to Analysis of Thermograms, 1986, Department of Community Medicine, Cambridge University, 1986. [33] R. N. Lawson, Implications of surface temperatures in the diagnosis of breast cancer. Can. Med. Assoc. J. 75 (1956) 309-301. [34] B. H. Phillips and K. Lloyd-Williams, The clinical use of thermography. Brit. J. Hosp. Med. Equip. Suppl. 1974. [35] L. M. Carter, The clinical role of thermography. J. Med. Eng. Tech. 2 (1978) 125. [36] M. V. Kyle, G. Pam, R. Sallisbury, P. Page-Thomas and B. L. Hazelman, Prostalgandin E l Vasospastic Disease and Thermography, presented at Heberden Round, 1982. [37] P. A. Bacon, A. J. Collins, F. J. Ring and J. A. Cosh, Thermography in the assessment of inflammatory arthritis, Clinical Rheumatology, Dis Vole. 2, M. I. V. Jayson (ed.) Philadelphia, W.B. Saunders and Co., 51-65. [38] C. Rajapakse, D. M. Grennan, C. Jones, L. Wilkinson and M. Jayson, Thermography in the assessment of peripheral joint inflammation - a re-evaluation, Rheumatology and Rehabilitation 20 (1981) 81-87. [39] M. J. Varga and R. Hanka, Dynamic elastic image stretching applied to thermographic images, Special Issues: IEE Proceedings-I, Communications, Speech €4 Vision (1990). [40] M. J. Varga and R. Hanka, Pseudo-colouring systems for thermographic images, 11th European Conf. Visual Perception, Bristol, UK, (1988). [41] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach (Prentice-Hall International, 1982). [42] J. Kittler, Mathematical methods for feature selection in pattern recognition, Znt. J. Man. Manch. Stud. 7 (1975) 603-637. [43] P. E. Hart, The Condensed Nearest Neighbour Rule (CNN), ZEEE Trans. Information Theory 14 (1968) 515-516. [44] I. M. Ariel and J. B. Cleary, Breast Cancer Diagnosis and Treatment (McGraw-Hill Book Company, 1987). [45] M. J. Varga and P. De Muynck, Thermal analysis of infra-red mammography, 11th Int. Conf. Pattern Recogn., The Hague, The Netherlands, 1992. [46] R. Highman and J. M. Brady, Model-based image enhancement for infra-red images, IEEE Physics Based Modelling in Computer Vision Workshop, Boston, USA, 1995. [47] J. E. Shamblin and G. T. Stevens, Operations Research : A Fundamental Approach (McGraw-Hill, 1974).
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 925-944 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 5.4 I VIEWER-CENTERED REPRESENTATIONS IN OBJECT RECOGNITION: A COMPUTATIONAL APPROACH
RONEN BASRI Department of Applied Mathematics, The Weizmann Institute of Science Rehovot 76100, Israel Visual object recognition is a process in which representations of objects are used to identify the objects in images. Recent psychophysical and physiological studies indicate that the visual system uses viewer-centered representations. In this chapter a recognition scheme that uses viewer-centered representations is presented. The scheme requires storicgonly a small number of views to represent an object. It is based on the observation that novel views of objects can be expressed as linear combinations of the stored views. This method is applied to rigid objects as well as to objects with more complicated structure, such as rigid objects with smooth surfaces and articulated objects.
Keywords: Alignment, linear combinations, 3-D object recognition, viewer-centered representations, visual object recognition.
1. Introduction Visual object recognition is a process in which images are compared to stored representations of objects. These representations, their content and use, determine the outcome of the recognition process. The features stored in an object’s model determine those properties that identify the abject and overshadow other properties. It is not surprising, therefore, that the issue of object representation has attracted considerable attention (reviews of different aspects of object representations can be found in [l-51). For many objects, shape (as opposed to other cues, such as color, texture, etc.) is their most identifiable property. In shape-based recognition a model contains properties of the object that distinguish it from objects with different shapes. The wide range of possible shape representations is divided into two distinct categories, object-centered representations and viewer-centered ones. Object-centered representations describe the shape of objects using view independent properties, while viewer-centered representations describe the way this shape is perceived from certain views. Recent psychophysical studies indicate that viewer-centered representations are used in a number of recognition paradigms (see details in Section 2). 925
926
R . Basri
Recognition of 3-D objects from 2-D images is difficult partly because objects look significantly different from different views. A common approach to recognition, which received the name alignment, aligns the object’s model t o the image before they actually are compared [5] (see also [6-81). We present a scheme that combines the use of viewer-centered representations with the alignment approach. The scheme, referred to as the “Linear Combinations” scheme (originally developed in [9]) represents an object by a small set of its views. Recognition is performed by comparing the image t o linear combinations of the model views. The scheme handles rigid objects as well as more complicated objects, such as rigid objects with smooth bounding surfaces and articulated objects.
2. Viewer-Centered Representations The issue of object representation is critical to recognition. It determines the information that makes a n object stand out and the circumstances under which it can be identified. In addition, it divides the computational process into its online components, the “recognition” part, and off-line components, the “learning” or “model acquisition” part. Object-centered representations describe the shape of objects using view independent properties. These representations usually include either view-invariant properties of the object (e.g. [10,11]) or structural descriptions defined within some intrinsic coordinate system, such as generalized cylinders [12,13], constructive solid modeling [14],and the vertices and edges in polyhedra [8,15]. Object-centered models in general are relatively concise. A single model is used t o recognize the object from all possible views. Viewer-centered representations describe the appearance of objects in certain views. Typically, a viewer-centered model consists of a set of one or more views of an object, possibly with certain 3-D shape attributes, such as depth or curvature (in a similar fashion to the 2;-D sketch suggested by Marr and Nishihara [13]). Often, a viewer-centered representation covers only a restricted range of views of an object. A number of models is then required to represent the object from all possible views. Viewer-centered representations are in general easier t o acquire, t o store, and t o handle than object-centered ones. For instance, with viewer-centered models there is no need to perform elaborate computations to account for self occlusion, since such occlusion is implicit in the model views. Recent psychophysical and physiological studies indicate that in certain recognition paradigms the visual system uses viewer-centered representations. A number of experiments establish that the response time in recognition tasks varies as a function of the angular distance of the object t o be recognized from either its upright position or a trained view. This effect, known as the mental rotation effect (originally shown in views comparison tasks by Shepard and Metzler [IS]),was found in naming tasks of both natural and artificially made objects [17-231. These effects considerably
5.4 Viewer-Centered Representations in Object Recognition 927 diminish with practice [18,20,23,24]. Namely, as subjects become more familiar with the objects, their response time becomes more and more uniform. Practicing the task on views of one object does not alter the performance for other objects [18,25],indicating that this is not a side effect resulting from the subjects’ learning to perform the experiment better, but that indeed subjects attain richer representations of the objects with practice. Findings by Tarr and Pinker [23,26,27]suggest that massive exposure to different orientations of objects does not necessarily result in the formation of object-centered representations. They showed cases where the response time was linear with the angular separation between the observed object and its closest view in the training set. Additional support to these findings was found in measuring the error rates in naming tasks. A few studies show that the number of incorrect namings increases with the angular separation between tested views and either trained views or the object’s upright position [19,28,29]. Edelman and Bulthoff [28] found that error rates increase not only as a function of distance of the tested view to the training set, but they also depend on the specific relation between the tested view and the trained views. In their experiment subjects were trained on two views of an object. It was found that intermediate views, views that lie within the range between the trained views, were correctly recognized more often than extrapolated views, that is, views that lie outside this range. Interestingly, they also found that, unlike response time, error rates do not diminish with practice [30], indicating that even after practice subjects did not attain complete view-invariant representations. Evidence consistent with the use of multiple viewer-centered descriptions was also found in single-cell activity recordings. Perret et al. [31] have investigated the response properties of face-sensitive cells in area STS of the macaque’s visual cortex. They have found that cells typically respond to a wide range of 3-D orientations, but not to all viewing directions. A face-selective cell that responds to e.g. a face-on view will typically not respond to a profile view, but will respond to a wide range of intermediate orientations. The authors concluded that “High level viewer-centered descriptions are an important stage in the analysis of faces” ([31] p. 314). It is important to remember that these experiments can be interpreted in more than a single way, and that the tested paradigms may not reflect the general recognition process. (See for example [32] where a case of dissociation of mental rotation from recognition is presented.) It seems, however, that a large number of experiments are consistent with the notion of viewer-centered representations.
3. Alignment
A major source of difficulty in object recognition arises from the fact that the images we see are two-dimensional, while the objects we try to recognize are threedimensional. As a result, we always see only one face of an object at a time. The images of the same object may differ significantly from one another even when these
928
R. Ban’
Fig. 1. Deformation of an image following a 1 5 O rotation of a car. An o\ierlaid picture of the car before and after rotation. Although the rotation is fairly small, the discrepancies between the two images are fairly large.
views are separated by a relatively small transformation (see for example Fig. 1). Cluttered scenes introduce additional complexity due to partial occlusion. One approach t o overcome these difficulties is t o first recover the underlying 3-D shape of the observed object from the image (using cues like shading, stereopsis, and motion) and then to compare the result with the 3-D model (e.g. [13,33]) Although in recent years there has been tremendous progress in understanding early visual processes, current shape recovery algorithms still seem to be limited in their ability to provide accurate and reliable depth information. Moreover, people’s ability to recognize objects seem t o be fairly robust t o elimination of depth cues (e.g. [30,34]). The ability t o recognize objects from line drawings, which contain only sparse information about the shape of objects, demonstrates that shape recovery may not be essential for recognition. The alignment approach avoids recovering the underlying 3-D shape of the observed object by comparing the object’s model t o the image in 2-D (“template matching”). To account for orientation differences between the stored model and the observed image these differences are compensated for before the model and the image are compared. The transformation that compensates for these differences is called “the alignment transformation.” Alignment is therefore a two-stage process. First, the position and orientation (pose) of the observed object is recovered, and then the model is transformed to this pose, projected t o the image plane, and compared with the actual image. A large number of studies use alignment-like algorithms to recognize 3-D objects from 2-D images [5-8,35-371. These studies vary in the representations used and the method employed t o recover the alignment transformation. Most of these studies use object-centered representations. When viewer-centered representations are used, the naive approach usually is taken; namely, the system can recognize only the stored views of an object (e.g. [37-391). For example, in [37] an object is modeled by a large number of views (the representation includes a table of 72 x 72 = 5184 views). A view is recognized only if the image is related t o one of these views by a rotation in the image plane, in which case this view and the image share the same appearance.
5.4 Viewer-Centered Representations in Object Recognition 929 In the rest of this chapter we present an alternative to these approaches: an alignment scheme that recognizes objects using viewer-centered representations. The method requires only a small number of views to represent an object from all its possible views. 4. The Linear Combinations (LC) Scheme
The variability and richness of the visual input is overwhelming. An object can give rise to a tremendous number of views. It is not uncommon for humans to forget familiar views, perhaps because the visual system is incapable of storing and retrieving such huge amounts of information. Consequently, the visual system occasionally comes across novel views of familiar objects, whether these views have been forgotten, or they are entirely new. The role of the recognition process when a novel view is observed is to deduce the information that is necessary to recognize the object from its previously observed images. This relationship between the novel and the familiar views of objects is (implicitly) specified by the representation used by the recognition system. The linear combinations (LC) scheme relates familiar views and novel views of objects in a simple way. Novel views in this scheme are expressed by linear combinations of the familar views. This property can be used to develop a recognition system that uses viewer-centered representations: an object is modeled in this scheme by a small set of its familiar views. Recognition involves comparing the novel views to linear combinations of the model views. For such a representation to be feasible, the correspondence between the model views should first be resolved. Correspondence between views of objects is a source for understanding how the objects change between views. This information allows the system to track the location of feature points in the model images and predict their location in novel views of the object. A view in the LC scheme is represented by the locations of feature points (such as corners or contour points) in the image. A model is a set of views with correspondence between the points. As already mentioned, novel views are expressed by linear combinations of the model views. When opaque objects are considered, due to self occlusioli, different faces (“aspects”) of the object appear in different views. A number of models (not necessarily independent) would then be required to predict the appearance of such objects from all possible viewpoints. The LC method applies to rigid objects as well as to more complicated objects, such as objects that undergo affine transformations, rigid objects with smooth bounding surfaces, and articulated objects. In this section we describe the main properties of the LC scheme. A more thorough presentation can be found in [9]. 4.1. Rigid Objects
In this section we show that for rigid objects novel views can be expressed as linear combinations of a small number of views. We begin with the following
930 R . Basri
definitions. Given an image I with feature points, pl = (q,y l ) , . . . ,p, = (z,, y,), a view fi is a pair of vectors x, y E R" where x = ( ~ 1 , .. . ,z,)~and y = (yl, . . . ,Y,)~ contain the location of the feature points, P I , . . . , p , , in the image. A model is a set of views {Vl, . . . , Vb}. The location vectors in these views are ordered in correspondence, namely, the first point in Vl is the projection of the same physical point on the object as the first point in V2, and so forth. The objects we consider undergo rigid transformations, namely, rotations and translations in space. We assume that the images are obtained by weak perspective projection, that is, orthographic projection together with uniform scaling. The proof of the linear combinations property proceeds in the following way. First, we show (Theorem 1) that the set of views of a rigid object is contained in a four-dimensional linear space. Any four linearly independent vectors from this space can therefore be used to span the space. Consequently, we show (Theorem 2) that two views suffice to represent the space. Any other view of the object can be expressed as (two) linear combinations of the two basis views. Next, we show (Theorem 3) that not every point in this 4-D space necessarily corresponds to a legal view of the object. The coefficients satisfy two quadratic constraints. These constraints depend on the transformation between the model views. A third view can be used to derive the constraints.
Theorem 1. The views of a rigid object are contained in a four-dimensional linear space. Proof. Consider an object 0 with feature points pl = ( 2 1 , y1, z l ) , . . . , p , = (z,, y, 2,). Let I be an image of 0 obtained by a rotation R, translation t , and scaling s, followed by an orthographic projection II. Let q1 = (xi,y:), . . . ,qn = (z;, y;) be the projected location in I of the points p l , . . . , p , respectively. For every l < i l n qi = sII(Rpi) t .
+
More explicitly, this equation can be written as
where { r i j } are the components of the rotation matrix, and t,, t , are the horizontal and the vertical components of the translation vector. Since these equations hold for every 1 5 i 5 n, we can rewrite them in vector notation. Denote x = ( X I , . . . , z,)~, y = (y1, ..., Y,)~, z = (21 ,..., z , ) ~ , 1 = (1,..., l)T,x' = (zi,...,z;)~,and y' = (yi, . . .,Y;)~, we obtain that
+ u2y + u3z + a41 y' = blx + b2y + b 3 + ~ b41 x' = a l x
5.4
Viewer-Centered Representations in Object Recognition 931
where a1
= sql
bl
= sr21
a2 = sr12
b2 = sr22
a3 = ~
1b3 3= ~
a4
= t,
~
b4
~
2
3
= t,
The vectors x' and y' can therefore be expressed as linear combinations of four vectors, x, y, z, and 1. Notice that changing the view would result merely in a change in the coefficients. We can therefore conclude that
x', Y' E spanix, Y, z, 1) for any view of 0. Notice that if translation is omitted the views space is reduced to a three-dimensional one. 0
Theorem 2. The views space of a rigid object 0 can be constructed from two views of 0." Proof. Theorem 1 above establishes that the views space of a rigid object is fourdimensional. Any four linearly independent vectors in this space can be used to span the space. The constant vector, 1, belongs to this space. Therefore, only three more vectors remain to be found. An image supplies two vectors. Two images supply four, which already is more than enough to span the space (assuming the two images are related by some rotation in depth, otherwise they are linearly dependent). Let V1 = ( x 1 , y l ) and Vz = (x2,y2) be two views of 0, a novel view V' = (x',y') of 0 can be expressed as two linear combinations of the four vectors X I , y1, x2, and 1. The remaining vector, y2, already depends on the other four vectors. 0 Up to this point we have shown that the views space of a rigid object is contained in a four-dimensionallinear space. Theorem 3 below establishes that not every point in this space corresponds to a legal view of the object. The coefficients of the linear combination satisfy two quadratic constraints.
Theorem 3. The coefficients satisfy two quadratic constraints, which can be derived from three images. Proof. Consider the coefficients a l , . . . ,a4, bl, . . . ,b4 from Theorem 1. Since R is a rotation matrix, its row vectors are orthonormal, and therefore the following equations hold for the coefficients. a:
+ a; + a: = b: + bz + b: albl + a2bz + a& = 0 .
Choosing a different basis to represent the object (as we did in Theorem 2) will change the constraints. The constraints depend on the transformation that separates the model views. Denote by a l l . .. , c q , PI,.. . ,P4 the coefficients that "This lower bound was independently noticed by Poggio 1401.
932
R. Basri
represent a novel view with respect to the basis described in Theorem 2, namely
and denote by U the rotation matrix that separates the two model views. By substituting the new coefficients we obtain new constraints
alp1
+ a 2 P 2 + a 3 P 3 + (alp3 + Q3P1)1L11 + ( a 2 P 3 + a 3 0 2 ) u 1 2 = 0.
To derive the constraints the values of u11 and u12 should be recovered. A third view can be used for this purpose. When a third view of the object is given, the constraints supply two linear equations in ull and u12, and, therefore, in general, the values of u11 and u12 can be recovered from the two constraints. This proof suggests a simple, essentially linear structure from motion algorithm that resembles the method used in [41,42],but the details will not be discussed further here. 0 The scheme is therefore the following. An object is modeled by a set of views, with correspondence between the views, together with the two constraints. When a novel view of the object is observed the system computes the linear combination that aligns the model to the object. The object is recognized if such a combiriation is found and if in addition the constraints are verified. Figure 2 shows the application of the linear combination scheme to an artificially made object.
Fig. 2. Application of the linear combinations scheme to a model of a pyramid. Top: two model pictures of a pyramid. Bottom: two of their linear combinations.
For transparent objects a single model is sufficient to predict their appearance from all possible viewpoints. For opaque objects, due to self occlusion, a number of models is required to represent the objects from all aspects. These models are not
5.4 Viewer-Centered Representations in Object Recognition 933 necessarily independent. For example, in the case of a convex object as few as four images are sufficient to represent the object from all possible viewpoints. A pair of images, one from the “front” and another one from the “back” contains each object point once. Two such pairs contain two appearances of all object points, which is what is required to obtain a complete representation of all object points. Note that positive values of the coefficients (“convex combinations”) correspond to interpolation between the model views, while extrapolation is obtained by assigning one or more of the coefficients with negative values. This distinction between intermediate views and other views is important, since if two views of the object come from the same aspect then intermediate views are likely to also come from that aspect, while in other views other aspects of the objects may be observed. 4.2. Additional Views
In the previous section we have shown that two views of a rigid object are sufficient to represent an object from all possible viewpoints. All other views are linear combinations of the two views. In practice, however, because of noise and occlusion one may seek to use additional views to improve the accuracy of the model. In this section we present a method to build models from more than two views. The idea is as follows. Each view provides two vectors, one for the x-coordinate and the other for the y-coordinate. These vectors can be viewed as points in Rn. The space of views of the object is known to be four-dimensional. The objective, then, is to find the four-dimensional subspace of Rn that best approximates the input views. This subspace can be found using singular value decomposition. More formally, given 1 vectors, v1,.. . ,vl,we denote F = [vl,.. . ,vl]; F is an n x 1 matrix. The best k-dimensional space through these vectors (in a leastsquared sense) is spanned by the k eigenvectors of FFt that corresponds to its k largest eigenvalues. (This is shown in [9], Appendix B.) This method resembles the algorithm used by Tomasi and Kanade [43]to track features in motion sequences, with the exception that in our case the motion parameters do not need to be recovered since we are only interested in finding the linear space from which these views are depicted. A method that approximates the space of views of an object from a number of its views using Radial Basis Functions [44]was recently suggested [45].Similar to the LC method, the system represents an object by a set of its familiar views with the correspondence between the views. The number of views used for this approximation, between 10 to 100, is much larger than the number required under the linear combinations scheme. The system, however, can also approximate perspective views of the objects. 4.3. A B n e Objects
In this section we extend the LC scheme to objects that undergo general affine transformations in space. In addition to the rigid transformations a€Eine
934
R. Basri
transformations include stretching and shearing. They are important since tilted pictures of objects appear to be stretched [46]. This effect is known as the La Gournerie Paradox (see [47]). In order to extend the LC method to include affine transformations the same scheme can be used, but with the quadratic constraints ignored. Namely, the fourdimensional linear space contains all and only the affine views of the object. Two views are therefore sufficient to span the space with no further constraints. 4.4. Rigid Objects with Smooth Surfaces
In this section we extend the LC scheme to rigid objects with smooth bounding surfaces. These objects are considerably more difficult to recognize from their contour images than are objects with sharp edges (such as polyhedral objects). When objects with sharp edges are considered, the contours are always generated by those edges. With objects with smooth bounding surfaces, however, the silhouette (the boundary of the object) does not correspond to any particular edges on the object. That is, the rim (the set of object points that generates the contours) changes its position on the object with viewpoint, and its location therefore is difficult to predict (see Fig. 3).
V
‘d
Fig. 3. The change of the rim of an object with smooth bounding surface due t o rotation. Left: a horizontal section of an ellipsoid. p is a point on the rim. Right: the section rotated. p is no longer on the rim. Instead p’ is the new rim point. The method described in Section 4.4 approximates the position of p’ using the curvature circle at p . (See [48]for details.)
The position change of the rim depends largely on the 3-D curvature at the rim points. When this curvature is high the position change is relatively small. (In the case of a sharp edge, the curvature is infinite and the position change vanishes.) When the curvature is low the position change is relatively large. Following this observation a method to approximate the position change of the rim using the surface curvature was developed [48]. In the original implementation a model contained a single contour image of the object. Each point along the contour was associated with its depth coordinate and its radial curvature (the curvature
5.4 Viewer-Centered Representations in Object Recognition 935 at the section defined by the surface normal and the line of sight). It was shown that a small number of images (at least three) is sufficient to recover this curvature. Using this information the system could approximate the appearance of objects with smooth bounding surfaces for relatively large transformations. In a later paper Ullman and Basri [9] showed that this approximation method is linear in the model views. They concluded that objects with smooth bounding surfaces can be represented by linear combinations of their familiar views. The space of views in this case is six-dimensional (rather than four), and at least three views (rather than two) are required to span the space. Additional quadratic constraints apply to the coefficients of the linear combinations. It should be noted that in order to handle objects with smooth bounding surfaces the definition of correspondence should be modified since contour points no longer represent the same physical points on the object from all views. Under the modified version, silhouette points in one image are matched to silhouette points in the second image that lie along the epipolar line. Ambiguities are resolved in a straightforward manner. Note also that advance knowledge of the type of the object, whether it has sharp edges or smooth bounding surfaces, is not required. The views of a curved object span a larger space than the views of a polyhedral object. Thus, singular value decomposition can be used to distinguish between the two (see Section 4.2). Figure 4 shows the application of the method to real edge images of a car. It can be seen that the predictions obtained are fairly accurate even though the bounding contours are smooth. 4.5. Articulated Objects
An articulated object is a collection of links connected by joints. Each link is a rigid component. It can move independently of the other links when only its joints constrain its motion. The space of views of an articulated object with 1 links is at most (4 x 1)-dimensional. The joints contribute additional constraints, some of which may be linear, and they reduce the rank of the space, others are non-linear, in which case they are treated in the same way the quadratic constraints are treated in the rigid case. Consider, for example, an object composed of two links connected by a rotational joint (e.g. a pair of scissors). The views space of a two-link object is at most eightdimensional (four for each of the links). The rotational joint constrains the two links by forcing them to share a common axis. Denote by p and q two points along this axis, and denote by Ti and T2 the rigid transformations applied to the first and second links respectively, then the following two constraints hold:
936
R. Basri
Fig. 4. Application of the linear combination scheme to a ‘W car. Top: three model pictures o the car. Middle: matching the model t o a picture of the VW car. A linear combination of the three model images (left), an actual edge image (middle), and the two images overlaid (right). The prediction image and the actual one align almost perfectly. Bottom: matching the VW model to an image of another car. A linear combination of the three model images (left), an actual image of a Saab car (middle), and the two images overlaid (right). In this case, although the coefficients of the linear combination were chosen such that the prediction would match the actual image as much as possible, the obtained match is relatively poor.
These two constraints are linear, and therefore they reduce the dimension of the space from eight to six. In addition, there is one quadratic constraint that implies the two links are scaled by the same amount. To summarize, the space of views of an articulated object that is composed of two links connected by a rotational joint is contained in a six-dimensional linear space. Five additional quadratic constraints (two follow the rigidity of each of the two links and one follows the common scaling) apply to the coefficients. As in the case of objects with smooth bounding surfaces, advance knowledge of the number of links and the type of the joints is not required. When sufficiently many views are presented, the correct rank of the views space can be recovered using singular value decomposition. Figure 5 shows the application of the linear combinations scheme to a pair of scissors. The images in this figure were obtained by different rigid transformations as well as articulations. It can be seen that the predictions match the real images also in the presence of articulations.
5. Recognition Using the LC Scheme In the previous section we have presented a viewer-centered representation for object recognition. An object is modeled in this scheme by a small set of its views with the correspondence between the views. Novel views of the object are expressed
5.4 Viewer-Centered Representations in Object Recognition 937 ..............
I
---.
.
.
- - .......... -
........
.
.
...
.
- ........
.
I --- .. Fig. 5. Application of the linear combination method to a pair of scissors. Left: two linear combinations of the model views, Middle: actual edge images, Right: overlay of the predictions with the real images.
by linear combinations of the model views. In addition, the coefficients of these linear combinations may follow certain functional constraints. In this section we discuss how this representation can be used in a recognition system. The task assigned t o the recognition system is t o determine, given an incoming image, whether the image belongs to the space of views of a particular model. In this section we discuss two principal methods t o reach such a decision. The first involves alignment of the model to the image by explicitly recovering the coefficients of the linear combination, and the second involves the application of a “recognition operator”.
5.1. Recovering the Alignment Coeficients The alignment approach to object recognition identifies objects by first recovering the transformation that aligns the model with the incoming image, and then verifying that the transformed model matches the image. In the LC scheme, the observed image is expressed by linear combinations of the model views. The task is therefore to recover the coefficients of these linear combinations. In other words, given a view v’ and a model {vl, .... vk} we geek a set of coefficients for which
holds. (In practice, t o overcome noise, we may seek t o minimize the difference between the two sides of this equation.) To determine the coefficients that align a model to the image, either one of the two following methods can be employed. The first method involves recovering the correspondence between the model and the image, and the second method involves a search in the space of possible coefficients. In the first method correspondence is
938 R. Basri
established between sufficiently many points so as to recover the coefficients. For a model that contains k views, at least k correspondences are required to solve a system of 2k linear equations (k equations for recovering the coefficients for the x-values, and another k equations for recovering the coefficients for the y-values). In this way, for example, four correspondences between model and image points are required to recover the coefficients for a rigid object by solving a linear system. If in addition we consider the quadratic constraints, this number is reduced to three. This is similar to the three-point alignment suggested by Huttenlocher and Ullman (5,7]. Applications of this method usually try to match triplets of model points to all combinations of triplets of image points to guarantee recognition. An alternative approach to determine the coefficients involves a search in the space of possible coefficients. This method does not require correspondence between the model and the image. The idea is the following. Using global properties of the observed object, such as axes of elongation, an initial guess for the values of the coefficients can be made. This initial guess can be then improved by an iterative process. At every step in this process a new set of coefficients is generated. The model is transformed using these coefficients, and the result is compared to the actual image. If the two match, the observed object is recognized, otherwise the process is repeated until it converges. Minimization techniques such as gradient descent may be employed to reduce the complexity of the search. Such techniques, however, involve the risk of converging into a local minimum, which occasionally may be significantly worse than the desired solution. It is interesting to note that the phenomenon of mental rotation seems to be consistent with the idea of search. The evidence for mental rotation suggests that recognition is not attained in an instance, but rather the response time increases with the angular separation between the observed object and its stored representation (see the discussion in Section 2). 5 . 2 . Recognition Operator
A second approach to identify novel views of objects involves the application of “recognition operators” to these views. Such operators are essentially invariants for a given space of views, that is, they return a constant value for all views of the object, and different values for views of other objects. This method does not require the explicit recovery of the alignment coefficients. Still, it does require correspondence between the model and the image. In the LC scheme a view is treated as a point in R”. A view contains the appearance of an object if it belongs to the space of views spanned by the object’s model. A natural way to identify the object would be to determine how far apart the incoming view is from the views space of the object. The result of such a test would be zero if and only if the given view is a possible view of the object. By projecting the given view to the views space of the object we can generate a distance metric between the model and the view to be recognized.
5.4 Viewer-Centered Representations in Object Recognition 939 Let v1, . . . ,vk be the model views. Denote M = [vl, . . . , vk], M is a k x n matrix. Theorem 5 below defines a recognition operator L. L measures the distance of a view v' from the linear space spanned by the model views, v1,. . . , vk, and ignores the nonlinear constraints.
Theorem 4. Let
L=I-MM+ where M+ = (MTM)-lMT denotes the pseudo inverse of M . Then Lv' = 0 if and only if v' is a linear combination of v1, . . . ,Vk.
Proof. Lv' = 0 if and only if v' = MM+v'. M M + is a projection operator; it projects the vector v' onto the column space of M . Therefore, the equality holds if and only if v' belongs to the column space of M , in which case it can be expressed by a linear combination of v1,. . . ,vk. The matrix L is therefore invariant for all views of the object; it maps all its views to zero. 0 Note that L only considers the linear envelope of the views space of the object. It does not verify any of the quadratic constraints. To verify in addition the quadratic constraints a quadratic invariant can be constructed. Weinshall [49] has recently presented a quadratic invariant for four points. This invariant can be modified to handle more points in a straightforward manner, but the details will not be discussed here. The recognition operator can be made associative. The idea is the following. Suppose L is a linear operator that maps all model views to the same single vector, that is, q = Lvl = . . . = Lvk. Since L is linear it maps combinations of the model to the same vector (up to a scale factor). Let v' be a novel view of the object, v' = C:=, aivi, then
q serves as a name for the model, and it can be either zero (in which case we obtain an operator that is identical to the operator in Theorem 4 above) or it can be a familiar view of the object (e.g. vl). Note, however, that a special care should be given to the case that ai vanishes. A constructive definition of the associative operator is given below. Let (v1,. . . ,vn} be a basis for Rn such that the first k vectors are composed of the model views. Denote
940
R. Basri
(We filled the matrix Q with the vectors vk+l l . . . v, so that the operator L would preserve the magnitude of noise if such is added t o the novel view. These vectors can be replaced by any vectors that are linearly independent of 9.) We require that
LP=Q. Therefore
L = QP-’ (Notice that since P is a basis for R” its inverse exists.) We have implemented the associative version of the recognition operator and applied it to the pyramid from Fig. 2. The results are given in Fig. 6. It can be seen that when this operator is applied to a novel view of the pyramid it returns a familiar view of the pyramid, and when it is applied to some other object it returns a n unknown view.
Fig. 6. Top: applying an associative “pyramidal” operator to a pyramid (left) returns a model view of the pyramid (right, compare with Fig. 2, top left). Bottom: applying the same operator to a cube (left) returns an unfamiliar image (right).
Both versions of the recognition operator can be implemented by a simple neural network with one layer [50]. The network contains only input and output layers with no hidden units. The weights are set t o be the components of L (see Fig. 7 ) . The network operates on novel views of some object and returns either zero or a familiar view of the object, according t o the operator it implements. It should be noted that for such an operator to be applicable the correspondence between the image and model must first be resolved. 6. Summary
Visual object recognition is a process in which images are compared to stored representations of objects. While recent psychophysical and physiological studies indicate that the visual system uses viewer-centered representations, most computational approaches t o recognition use object-centered representations. The few existing methods that use viewer-centered representations require a large number of views to represent an object from all possible viewpoints.
5.4 Viewer-Centered Representations in Object Recognition 941
Fig. 7. A neural network architecture that implements the recognition operator L. The input to this net,work is composed of the elements of the novel view v’, and the output is the “name” vector q (up t > a scale factor a).
A scheme was presented in which objects are recognized using viewer-centered models. The scheme is based on the observation that the novel views of an object can be expressed as linear combinations of a small set of its familiar views. An object is modeled by a set of views with correspondence between the views and possibly with some functional constraints. A novel view is recognized if there exists a linear combination of the model views that aligns the model to the image, and if the coefficients of this combination satisfy the functional constraints. ‘The method was applied to rigid objects as well as to objects that undergo affine transformations, rigid objects with smooth bounding surfaces (in which case the method only approximates the appearance of these objects), and articulated objects. The number of views required to represent an object depends on the shape of the object, whether it has sharp edges or smooth surfaces, in the case of a rigid object, and on the type of joints that connect the links in the case of an articulated one. This number can be deduced from the set of views of the object. To recover the alignment coefficients, a small number of points in the image and their corresponding points in the model can be used, or a search can be conducted in the space of possible coefficients. Alternatively, if the complete correspondence between the model and the image can be recovered, a “recognition operator” can be applied to the image. This operator obtains as its input a novel view of the object and returns a constant value, either the zero vector or a familiar view of the object. Furthermore, the operator can be implemented in a neural network with simple structure. Finding the correspondence between the model and the image is the difficult problem in recognition. The phenomenon of apparent motion, however, demonstrates that the visual system can successfully solve the correspondence problem.
942
R. Basri
Acknowledgements I wish to thank Shimon Ullman without whom this work would not have been possible, and to T . D. Alter, S. Edelman, W. E. L. Grimson, T . Poggio, and A. Yuille for helpful comments at different stages of this work. This report describes research done at the Weizmann Institute of Science and at the Massachusetts Institute of Technology within the Artificial Intelligence Laboratory and the McDonnell-Pew Center for Cognitive Neuroscience. Support for the laboratory’s artificial intelligence research is provided in part by the Advanced Research Projects Agency of the Department of Defense under Office of Naval Research contract N00014-915-4038. Ronen Basri is supported by the McDonnell-Pew and the Rothchild postdoctoral fellowships.
References [l] I. Biederman, Recognition by components: a theory of human image understanding, Psychol. Rev. 94 (1987) 115-147. [2] R. T . Chin and C. R. Dyer, Model-based recognition in robot vision, Comput. Sum. 18, 1 (1986) 67-108. [3] P. Jolicoeur, Identification of disoriented objects: A dual-systems theory, Mind and Language 5 (1990) 387-410. (41 S. E. Palmer, Fundamental aspects of cognitive representation, in E. Rosch and B. B. Lloyd (eds.), Cognition and Categorization (Lawrence Erlbaum, Hillsdale, NJ, 1978) 259-303. [5] S. Ullman, Aligning pictorial descriptions: An approach to object recognition, Cognition 32,3 (1989) 193-254. [6] M. A. Fischler and R. C. Bolles, Random sample consensus: A paradigm for model fitting with application to image analysis and automated cartography, Commun. ACM 24,6 (1981) 381-395. [7] D. P. Huttenlocher and S. Ullman, Object recognition using alignment, in Proc. Int. Conf. on Computer Vision (ICCV), London, UK, 1987, 102-111. [8] D. G. Lowe, Perceptual Organization and Visual Recognition (Kluwer Academic Publishers, Boston, MA, 1986). [9] S. Ullman and R. Basri, Recognition by linear combinations of models, IEEE Trans. Pattern Anal. Mach. Intell. 13,10 (1991) 992-1006. (lo] R. C. Bolles and R. A. Cain, Recognizing and locating partially visible objects: The local feature focus method. Int. J. Robot. Res. 1, 3 (1982) 57-82. [ll] M. K. Hu, Visual pattern recognition by moment invariants, IRE Trans. Inf. Theory 8 (1962) 169-187. [12] T. 0. Binford, Visual perception by computer, in Proc. IEEE Conf. on Systems and Control, Miami, FL, 1971. [13] D. Marr and H. K. Nishihara, Representation and recognition of the spatial organization of three dimensional shapes, Proc. Royal Society B200 (1978) 269-291. [14] A. Requicha and H. Voelcker, Constructive solid geometry, Production Automation Project Tm-26, University of Rochester, NY, 1977. [15] L. G . Roberts, Machine perception of three-dimensional solids, in J. T. Tippett et al. (eds.), Optical and Electro-Optical Information Processing (MIT Press, Cambridge, MA, 1965). [16] R. N. Shepard and J. Metzler, Mental rotation of three dimensional objects, Science 171 (1971) 701-703.
5.4 Viewer-Centered Representations in Object Recognition 943 [17] L. A. Cooper, Demonstration of a mental analog to an external rotation, Perception and Psychophysics 1 (1976) 20-43. [18] P. Jolicoeur, The time to name disoriented natural objects, Memory and Cognition 13, 4 (1985) 289-303. [19] P. Jolicoeur and M. J. Landau, Effects of orientation on the identification of simple visual patterns, Canadian J. Psychol. 38, 1 (1984) 80-93. [20] R. Maki, Naming and locating the tops of rotated pictures, Canadian J. Psychol. 40 (1986) 368-387. [21] R. N. Shepard and J. Metzler, Mental rotation: effects of dimensionality of objects and type of task, J. Exper. Psychol.: Human Perception and Performance 14,1(1988) 3-11. [22] S . P. Shwartz, The perception of disoriented complex objects, in Proc. 3rd Conf. on Cognitive Sciences, Berkeley, CA, 1981, 181-183. [23] M. J. Tarr and S. Pinker, Mental rotation and orientation-dependence in shape recognition, Cognitive Psychology 21 (1989) 233-282. [24] M. C. Corballis, Recognition of disoriented shapes, Psychol. Rev. 95 (1988) 115-123. (251 P. Jolicoeur and B. Milliken, Identification of disoriented objects: Effects of context of prior representation, J. of Exper. Psychol.: Learning, Memory, and Cognition 1 5 (1989) 200-210. [26] M. J. Tarr and S. Pinker, When does human object recognition use a viewer-centered reference frame? Psychol. Sci. 1 (1990) 253-256. [27] M. J. Tarr, Orientation Dependence in Three-Dimensional Object Recognition, Ph.D. thesis, Massachusetts Institute of Technology, 1989. [28] S. Edelman and H. H. Bulthoff, Viewpoint-specific representations in threedimensional object recognition, Technical Report A. I. Memo 1239, The Artificial Intelligence Lab., M.I.T., 1990. (291 I. Rock and J. DiVita, A case of viewer-centered object perception, Cognitive Psychology 19 (1987) 280-293. [30] S. Edelman and H. H. Bulthoff, Orientation dependence in the recognition of familiar and novel views of 3d objects, Vision Research 32 (1992) 2385-2400. [31] D. I. Perret, P. A. J. Smith, D. D. Potter, A. J. Mistlin, A. S. Head, A. D. Milner, and M. A. Jeeves, Visual cells in the temporal cortex sensitive to face view and gaze direction, Proc. Royal Society B223 (1985) 293-317. [32] M. J. Farah and K. M. Hammond, Mental rotation and orientation-invariant object recognition: Dissociable processes, Cognition 29 (1988) 29-46. [33] R. J. Douglass, Interpreting three dimensional scenes: A model building approach, Comput. Graph. Image Process. 1 7 (1981) 91-113. (341 J. E. Hochberg and V. Brooks, Pictorial recognition as an unlearned ability: A study of one child’s performance, Am. J. Psychol. 75 (1962) 624-628. [35] C. H. Chien and J. K. Aggarwal, Shape recognition from single silhouette, in Proc. Int. Conf. on Computer Vision (ICCV), London, UK, 1987, 481-490. [36] 0. D. Faugeras and M. Hebert, The representation, recognition and location of 3-D objects, Int. J. Robot. Res. 5, 3 (1986) 27-52. (371 D. W. Thompson and J. L. Mundy, Three dimensional model matching from an unconstrained viewpoint, in Proc. IEEE Int. Conf. on Robotics and Automation, Raleigh, NC, 1987, 208-220. [38] Y. S. Abu-Mostafa and D. Pslatis, Optical neural computing, Sci. Am. 256 (1987) 66-73. [39] P. Van Hove, Model based silhouette recognition, in Proc. IEEE Computer Society Workshop on Computer Vision, 1987.
944
R. Basri
1401 T. Poggio, 3D object recognition: On a result by Basri and Ullman, Technical Report TR 9005-03, IRST, Povo, Italy, 1990. [41] S. Ullman, The Interpretation of Visual Motion, (MIT Press, Cambridge, MA, 1979). [42] T. S. Huang and C. H. Lee, Motion and structure from orthographic projections, IEEE Trans. Pattern Anal. Mach. Intell. 2, 5 (1989) 536-540. [43] C. Tomasi and T. Kanade, Factoring image sequences into shape and motion, in Proc. IEEE Workshop on Visual Motion, Princeton, NJ, 1991, 21-29. [44] T. Poggio and F. Girosi, Regularization algorithms for learning that are equivalent to multilayer aetworks, Science 247 (1990) 978-982. [45] T. Poggio and S. Edelman, A network that learns to recognize three-dimensional objects, Nature 343 (1990) 263-266. [46] D. W. Jacobs, Space efficient 3D model indexing, in Proc. C V P R Conference, Urbana, IL, 1992. (471 J. E. Cutting, Perception with A n Eye for Motion (MIT Press, Cambridge, MA, 1986). [48] R. Basri and S. Ullman, The alignment of objects with smooth surfaces, in Proc. 2nd Int. Conf. Computer Vision, Florida, 1988, 482-488. [49] D . Weinshall, Model based invariants for (3-D) visicn, Int. J. Computer Vision, lO(1) (1993) 27-42. [50] R. Basri and S. Ullman, Linear operator for object recognition, in J. E. Moody, S. J. Hanson and R. P. Lippmann (eds.), Advances in Neural Information Processing Systems 4 (Morgan Kaufmann, San Mateo, CA, 1991).
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 945-977 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 5.51 VIDEO CONTENT ANALYSIS AND RETRIEVAL
HONGJIANG ZHANG Hewlett-Packard Laboratories, 1501 Page Mill Road Palo Alto, Calijornia 94304, USA E-mail:
[email protected]. corn This chapter discusses a number of available techniques and state-of-the-art research issues in video content analysis and retrieval. It focuses on basic algorithms for video structure parsing, content representation, content-based abstraction and applications tools for content-based video indexing, retrieval and browsing.
Keywords: Video content analysis, video partition, visual abstraction, content-based indexing, content-based retrieval and browsing, digital video, multimedia, digital library.
1. Background and Motivations
With rapid advances in communication and multimedia computing technologies, accessing a vast amount of visual data is becoming a reality on information superhighways and in digital libraries. Though the most important investments have been targeted at the information infrastructure, it has been realized that video content analysis and processing are key issues in putting together successful applications [1,2]. The matter of fact is that interacting with multimedia data, video in particular, requests much more than just connecting users with data banks and delivering data via networks to customers’ homes or offices. It is simply not enough to just store and display video as in commercial video-on-demand services. The need, from the point of view of content, is that the visual data resources managed by such systems have to be structured and indexed based on the content before being accessed. However, when it is possible, the human production of such content descriptive data is so time consuming - and thus costly - that it is almost impossible to generate it for the vast amount of visual data available. Further, human-produced descriptive data is often subjective, inaccurate and incomplete. Thus, fundamentally, what we need are new technologies for video content analysis and representation to facilitate organization, storage, query and retrieval of mass collections of video data in a user-friendly way. When text is indexed, words and phrases are used as index entries for sentences, paragraphs, pages or documents. Similarly, video indexing will require partitioning of video documents into shots and scenes and extracting key frames or key sequences as entries for scenes or stories. Therefore, automated indexing of video will require 945
946
H. J. Zhang
the support of tools that can detect such meaningful segments, and extract content features of any video source. Figure 1 shows a system diagram of such a video content analysis process consisting of three major steps.
Parsing: This process will partition a video stream into generic clips of different levels of granularity and extract structural information of video. These clips will be the units for representation and indexing. Content analysis, abstraction and representation: Ideally, individual clips will be decomposed into semantic primitives based on which a clip can be represented and indexed with a semantic description. In practice, the abstraction process will generate a visual abstract of clips and low-level visual features of clips will be used to represent their visual content. Retrieval and browsing: Indices are built based on content primitives or meta- data through, for instance, a clustering process which classifies shots into different visual categories. Schemes and tools are needed to utilize these content representation and indices to query, search and brows large video databases for retrieving desired video clips.
Feature Extraction
Meta-data
Fig. 1. Process diagram for video content analysis and retrieval.
The temporal segmentation process is analogous to sentence segmentation and paragraphing in parsing textual documents, and many effective algorithms are now available for temporal segmentation [4-131, as to be described in detail later in Section 2. Since fully automated visual content understanding or mapping low level visual features to high level semantic content is not feasible in the near future, video content representation will be mostly based on low-level content features. Such retrieval approaches using low-level visual features as content representation have shown a great potential for retrieval of images in large image databases [14]. While we tend to think of indexing for supporting retrieval, browsing is equally significant for video source material since the volume of video data also requires techniques to present information landscape or structure to give an idea of what is out there. The task of browsing is actually very intimately related to retrieval. On the one hand, if a query is too general, browsing is the best way to examine the results. Hence, browsing also serves as an aid to formulate queries, making it easier for the user to just ask around in the process of figuring out the most appropriate
5.5 Video Content Analysis and Retrieval
947
query to pose. A truly content-based approach to video browsing also requires some level of analysis of video content, both structural and semantic, rather than simply providing a more sophisticated view of temporal context. The need for video content analysis tools as summarized above poses many research challenges to scientists and engineers across all multimedia computing disciplines. In this chapter, we will try to discuss the approaches to visual content analysis, representation and their applications and to survey some open research problems. Section 2 covers video content parsing algorithms and schemes, including temporal segmentation, video abstraction, shot comparison and soundtrack analysis. Section 3 presents briefly video content representation schemes in terms of visual features, objects and motions. Section 4 describes tools for content-based video retrieval and browsing and Section 5 reviews some research prototypes and application systems. Section 6 summarizes some current research issues. 2. Temporal Partition of Video Sequences
The basic temporal unit for indexing and manipulation of video is shots, consisting of one or more frames recorded contiguously and representing a continuous action in time and space. A collection of one or more adjoining shots that focus on an object or objects of interest may comprise a scene. Temporal partitioning is the process of detecting boundaries between every two consecutive shots, so that a sequence of frames belonging to a shot will be grouped together. Hence, temporal partitioning is the first step in parsing video, and has been one of the first issues addressed by many researchers in video content analysis. The temporal partitioning problem is also important for applications other than video indexing and editing. It also figures in performing motion compensated video compression, where motion vectors must be computed within segments, rather than across segment boundaries [15]. However, accuracy is not as crucial for either of these applications; for example, in compression, false positives only increase the number of reference frames. On the other hand, in video segmentation for indexing, such false positives would have to be corrected by manual intervention. Therefore, high accuracy is a more important requirement in automating the partitioning process for video content analysis. There are a number of different types of transitions or boundaries between shots. The simplest transition is a cut, an abrupt shot change which occurs between two consecutive frames. More sophisticated transitions include fades, dissolves and wipes, etc. [16]. A fade is a slow change in brightness of images usually resulting in or starting with a solid black frame. A dissolve occurs when the images of the first shot get dimmer and the images of the second shot get brighter, with frames within the transition showing one image superimposed on the other. A wipe occurs when pixels from the second shot replace those of the first shot in a regular pattern such as in a line from the right edge of the frames. A robust partitioning algorithm should be able to detect all these different boundaries with good accuracy.
948 H. J. Zhang
The basis of detecting shot boundaries is the fact that consecutive frames on either side of a boundary generally display a significant change in content. Therefore, what is required is some suitable quantitative measure which can capture the quantitative difference between such a pair of frames. Then, if that difference exceeds a given threshold, it may be interpreted as indicating a shot boundary. Hence, establishing suitable metrics is the hey issue in automatic partitioning. The optimal metric for video partitioning should be able to detect the following three different factors of image change: Shot change, abrupt or gradual; Motion, including those introduced by both camera operation and object motion; Luminosity changes and noise. In this section, a number of algorithms for temporal partitioning of video data either in its original format or in compressed domain representations will be presented .
2.1. Shot Cut Detection Metrics using Original Format of Digital Video Data Figure 2 illustrates a sequence of three consecutive video frames with a clit occurring between the second and third frames. The significant difference in content is readily apparent. If that difference can be expressed by a suitable metric, then a segment boundary can be declared whenever that metric exceeds a given threshold. The major difference among a variety of automatic video partitioning is the difference metrics used to quantitatively measure changes between consecutive frames and schemes to apply these metrics. Difference metrics used in partitioning can be divided into two major types: those based on local pixel feature comparison, such as pixel values and edges, and those based on global features such as pixel histograms and statistical distributions of pixel-to-pixel change. These types of metrics may be implemented with a variety of different modifications to accommodate the idiosyncrasies of different video sources and have been successfully used in shot boundary detection.
Fig. 2. Three frames across a sharp cut.
5.5 Video Content Analysis and Retrieval
949
2.1.1. Pixel value comparison
A simple way to detect a qualitative change between a pair of frames is t o compare the spatially correspondi.ng pixels in the two frames t o determine how many have changed. This approach is known as pair-wise pixel comparison. In the simplest case of monochromatic images, a pixel is judged as changed if the difference between its intensity values in the two frames exceeds a given threshold t . This algorithm simply counts the number of pixels changed from one frame to the next. A shot boundary is declared if more than a given percentage of the total number of pixels (given as a threshold T ) have changed. A potential problem with this metric is its sensitivity to camera movement. For instance, in the case of camera panning, a large number of objects will move in the same direction across successive frames; this means that a large number of pixels will be judged as changed even if the pan entails a shift of only a few pixels. To make the detection of camera breaks more robust, instead of comparing individual pixels, we can compare corresponding regions (blocks) in two successive frames. One such approach applies motion compensation at each block [8]. That is, each frame is divided into a small number (e.g. 12 in [8]) of non-overlap blocks. Block matching within a given search window is then performed t o generate a motion vector and match value, normalized to lie in the interval [0, 11 with zero representing a perfect match. A common block matching approach used in video coding is that for each block in frame j, find the best fitting region in image ( j 1) in a neighborhood of the corresponding block according to the matching function
+
where:
Fj(k, E ) represents the value at pixel (k,1) of a ( m x n) block in the current frame j ; Fj+l(k,l)represents the value at pixel (k,E)of the same ( m x n) block in the next frame ( j+ 1); ( d z , d y ) is a vector representing the search location and the search space is d x = { - p , + p } and dy = { - p , + p } ; F,, is the maximum pixel value of frames and is used to normalized the matching function. The set of (dz,d x ) that produces the minimum value of Diff defines the motion vector from the center of block i in the current frame to the next frame. Then, the difference between two frames can be defined as K i=l
950
H. J. Zhang
where i is the block number in a frame, K is the total number of blocks, Ri is the match value for block i which equals to the minimum Diff in a given searching space, and ci is a set of predetermined weights for each block. To eliminate the effect of noise, the 2 highest and 2 lowest match values are discarded in calculating D,. A cut is declared if D, exceeds a given threshold. 2.1.2. Histogram comparison An alternative t o comparing corresponding pixels or blocks in successive frames is to compare some statistic and global features of the entire image. One such feature is the histogram of intensity levels. The principle behind this algorithm is that two frames having an unchanging background and objects will show little difference in their respective histograms. The histogram comparison algorithm should be less sensitive to object motion than the pair-wise pixel comparison algorithm, since it ignores the spatial changes in a frame. Let H f ( i )denote the histogram value for frame f , where i is one of the G possible pixel levels, then, the difference between frame f and its successor (f 1) may be given by the following X2-test formula,
+
If the overall difference Dh is larger than a given threshold T , a segment boundary is declared. To be more robust to noise, each frame is divided into a number of regions of same size, e.g. 16 regions. That is, instead of comparing global histograms, histograms for corresponding regions in the two frames are compared and the 8 largest differences are discarded to reduce the effects of object motion and noise [4].
2.1.3. Edge pixel comparison Edges in images provide useful information about the image content and changes in edge distributions between successive frames is a good indication of content changes. When a cut or a graduation transition occurs, new intensity edges a p pear far from the locations of old edges and similarly, old edges disappear far from the locations of new edges. Based on this observation, a n effective video partitioning algorithm which can detect both cuts and gradual transitions has been developed as described below [9]. First, we define an edge pixel in a current frame that appears far from (a distance r ) an existing edge pixel in the last frame as a n entering edge pixel, and an edge pixel in the last frame that disappears far from an existing edge pixel in the current frame as an exiting edge pixel. Then, by counting the fraction of the entering edge pixels (pi,) and that of the exiting edge pixels (pout) over the total number of pixels in a frame, respectively, we can detect transitions between two shots. That is, the difference metric between frames f and f 1 can be defined as:
+
5.5 Video Content Analysis and Retrieval 951
D,(f, f + 1) will assume a high value across shot boundaries and generate peaks in the time sequence. Once a peak is detected, it can be further classified as corresponding to a cut or gradual transitions since cuts usually correspond to sharp peaks occurring over 2 or 3 frames, while gradual transitions usually correspond to low but wide peaks over a larger number of consecutive frames. Experiments have shown that this algorithm is very effective in detecting both sharp cuts and gradual transitions. However, since it requires more computation, this algorithm is slower than others. 2.2. Gradual Transition Detection using Original Format of Digital Video Data Cuts are the simplest shot boundary and easy to detect using the difference metrics described above. Figure 3 illustrates a sequence of inter-frame difference resulting from the histogram comparison. It is easy to select a suitable cutoff threshold value (such as 50) for detecting the two cuts represented by the two high pulses. However, sophisticated shot boundaries such as dissolve, wipe, fade-in, and fade-out are much more difficult to detect since they involve more gradual changes between consecutive frames than does a sharp cut. Furthermore, changes resulting from camera operations may be of the same order as that from gradual transitions, which further complicate the detecting. Figure 4 shows five frames from a typical dissolve: the last frame of the current shot just before the dissolve begins, three frames within the dissolve, and the frame in the following shot immediately after the dissolve. The actual dissolve occurs across about 30 frames, resulting in small changes between every two consecutive frames in the dissolve. The sequence of inter-frame differences of this dissolve defined by the histogram calculation is displayed in the inset of the graph shown in Fig. 3, whose values are higher than those of their neighbors but significantly lower than the cutoff threshold. This sequence illustrates that gradual transitions will downgrade the power of a simple difference metric and a single threshold for camera break detection algorithms. 80
70 60 Y
50 e
!
f
40
30 20 10 0
_ - - _ - - _- -2 _'f a_' z , _ 2 _x _ _ N
7
10
w
0
N
Frame Number
Fig. 3. A sequence of histogram based inter-frame differences.
952
H. J. Zhang
Fig. 4. An example of dissolve sequences.
The simplest approach t o this problem would be to lower the threshold. Unfortunately, this cannot be effectively employed, because noise and other sources of changes often introduce the same order of difference between frames as of gradual transitions, resulting in “false positives.” In this subsection, we discuss four video partition algorithms which are capable of detecting gradual transitions with acceptable accuracy while achieving very high accuracy in detecting sharp cuts. They are: Zhang’s twin cornparison algorithm, Agrain’s algorithm based on distribution of pixel change, Zabih’s edge comparison algorithm and Hampapur’s editing model based algorithm. 2.2.1. Twin- comparison approach This algorithm was the first published one which achieves high accuracy in detecting both cuts and gradual transitions. As shown in Fig. 4 it is obvious that the first and the last frame across the dissolve are different, even if all consecutive frames are very similar. In other words, the difference metric with the threshold as shown in Fig. 3 would still be effective as it was applied t o the comparison between the first and the last frames directly. Thus, the problem becomes one of detecting these first and last frames. If they can be determined, then the period of gradual transition can be isolated as a segment unto itself. If we look at the inset of Fig. 3, it can be noticed that the difference values between most of the frames during the dissolve (as well as wipes and fades) are higher, although only slightly, than those in the preceding and following segments. What is required is a threshold value which can detect this sequence and distinguish it from a n ordinary camera shot. Based on this observation, the twin comparison algorithm was developed which introduces two comparisons with two thresholds. The algorithm uses two thresholds: Tb for sharp cut detection in the same manner as was described in the last subsection; and the second and lower threshold Ts is introduced for gradual transition detection. As illustrated in Fig. 5, whenever the
5.5 Video Content Analysis and Retrieval
953
Fig. 5. Twin-comparison approach.
difference value exceeds Tb, a camera cut is declared, e.g. F s in Fig. 5. However, the twin-comparison also detects differences which are smaller than Tb but larger than Ts, Any frame which exhibits such a difference value is marked as the potential start (F,) of a gradual transition. Such a frame is labeled in Fig. 5. This frame is then compared against subsequent frames, which is called a n accumulated comparison. The end frame ( F E )of the transition is detected when the difference between consecutive frames falls below threshold Ts while the accumulated difference exceeds Tb. Note that the accumulated comparison needs only t o be computed when the difference between consecutive frames exceeds Ts. If the consecutive difference value drops below Ts before the accumulated comparison value exceeds Tb, then the potential start point is dropped and the search continues for other gradual transitions. A potential problem with this algorithm is that camera panning and zooming and large object motion may introduce similar gradual changes as gradual transitions, which will result in “false positives”. This problem can be solved by global motion analysis as presented later in Section 2.4. That is, every potential transition sequence detected will be passed t o a motion analysis process to further verify if it is actually a global motion sequence [5]. Experiments show that the twin comparison algorithm is very effective and achieves a very high level of accuracy. 2.2.2. Pixel change classification Another algorithm for detecting both cuts and gradual transitions is derived based on the statistic of pixel value changes between two consecutive frames [6]. It is assumed that the inter-frame pixel values change due t o a combination of sources: First , a small amplitude additive zero-mean Gaussian noise, modeling camera, tape and digitization noise sources; second, changes of pixels resulting from object or camera operation and lighting within a given shot; and third, changes caused by cuts and gradual transitions. According to analytical models for each of these
954 H . J . Zhang
changes, cuts can be found by looking at the number of pixels whose change of value between two consecutive frames falls in the range (128,255). Dissolves and fade to/from whitelblack can be identified by the number of pixels which change in value between two consecutive frames in range {7,40) for %bit coded gray-level images. However, changes of pixel value resulting from wipes are also in the range of (128,255); thus, wipes may not be detected reliably. This may be solved by further looking at the spatial distributions of pixels whose range of changes are in (128,255) since during wipes, each frame will have a portion of the current shot and a portion of the new shot, thus, the changes usually occur in the boundary areas of the two portions. Based on these statistics, sharp cuts and gradual transitions can be detected by examining the difference ranges of corresponding pixels in two consecutive frames. This algorithm is not designed to detect cuts or transitions based on only two consecutive frames, but incorporating a temporal filtering over a sequence of consecutive frames. Also, histogram equalization needs to be applied to frames for wipe and cut detection, which slows down the detection process. 2.2.3. Edge pixel comparison
The edge pixel comparison algorithm as defined by Eq. (2.4) can also be applied in detecting gradual transitions. This is because gradual transitions usually introduce relatively lower but wide peaks of D e ( f ,f') values over a number of consecutive frames, different from sharp and narrow peaks for cuts. Fades and dissolves can be distinguished from each other by looking at relative values of local regions. During a fade in, pi, will be much higher than pout since there should be more entering edge pixels and fewer exiting edge pixels while a new shot progressively appears into the frames from back. In contrast, during fade out, poutwill be much higher than p i , since the current shot progressively disappears into the back frames. A dissolve, on the other hand, consists of an overlapping fade in and fade out: during the first half of dissolve, pi, will be greater while during the second half, pout will be greater. Wipes can be distinguished from dissolves and fades by looking at the spatial distribution of entering and exiting pixels, since frames of a wipe sequence usually have a portion of the current shot and a portion of the new shot. Therefore, if we take the location into the analysis when calculating the fraction of changed edge pixels, wipes can be detected and distinguished from other types of transitions. 2.2.4. Editing model fitting
Hampapur et al. have studied algorithms for detecting different types of gradual transitions by fitting sequences of inter-frame changes to editing models, one for a given type of gradual transition [lo]. However, the potential problem with such model based algorithms is that as more and more different types of editing effects (which still fall in mainly three basic classes: dissolve, wipe and fade) become available, it is hard to model each one of them. Furthermore, transition sequences may
5.5 Video Content Analysis and Retrieval 955 not follow any particular editing model, due to possibly noises and/or combination of editing effects. Such problems may exist in other detection algorithms as well, though this particular algorithm may be more prone to this problem. 2.3. Video Partitioning Algorithm using Compressed Domain Representation
As JPEG, MPEG and H.26X [15] have become industrial standards, more and more video data have been and will continue to be stored and distributed in one of these compressed formats. It would, therefore, be advantageous for the tools we envisage to operate directly on compressed representations, saving on the computational cost of decompression. More importantly, the compressed domain representation of video defined by these standards provide features which may be more effective in detecting content changes. DCT (Discrete Cosine Transform) coefficients are the basic compressed domain features encoded in JPEG, MPEG and H.26X. Another important feature encoded in the latter two standards is motion vectors. These are the two main features to be utilized in algorithms for video partitioning, with the most effective one being the one that combines bot-h features. In this subsection, we discuss in detail three basic types of algorithms for video partitioning using compressed video data. 2.3.1. DCT coeficient-based comparison
In general, compression of a video frame begins by dividing a frame into a set of 8 x 8 pixel blocks [15], as shown in Fig. 6. The pixels in the blocks are then transformed by the forward DCT (FDCT) into 64 DCT coefficients. That is,
where C ( k ) = 1 / a if k = 0 and 1 otherwise. F ( u , w )are DCT coefficients and p(2, y) is value of pixel (2,y) in a block. F(0,O) is the DC term or DC coefficient of a 8x8 block
+ X
Y Original pixels
DCT coefficients
Fig. 6. 8 x 8 block based DCT and zig-zag encoding.
956
H. J. Zhang
block, which is the average value of the 64 pixels, and the remaining 63 coefficients are termed the AC coefficients. These DCT coefficients are then quantized and encoded in a zig-zag order by placing the low frequency coefficients before the high frequency coefficients as shown in Fig. 6 . The coefficients are finally Huffman entropy encoded. The process can then be reversed for decompression. Since the DCT coefficients are mathematically related to the spatial domain and represent the content of each frame, they can be used to detect the difference between two video frames. Based on this idea, the first DCT comparison metric for partitioning JPEG videos was developed by Arman et al. [ll].In this algorithm, a subset of the blocks in each frame (e.g. take out the boundary blocks of a frame and use every other one of the rest) and a subset of the DCT coefficients for each block were used as a vector representation (V’) for each frame. That is,
V’ = (co,c1,. . . , c i , . .. ,cm)
(2.6)
where ci is the ith coefficients of the selected subset. The members of V’, are randomly distributed among all AC coefficients. One way to choose the subset is to use every other coefficient. The members of V’ remain the same throughout the video sequence to be segmented. The difference metric between frames is then defined by content correlation in terms of a normalized inner product:
DDCTC= 1 - IV’ .VS+rpl/lVfl lVj+rpl
(2.7)
where cp is the number of frames between the two frames being compared. It has been observed that for detecting shots boundaries, DC components of DCT’s of video frames provide sufficient information [18]. Based on the definition of DCT, this is equivalent to a low resolution version of frames, averaged over 8 x 8 non-overlap blocks [15]. Applying this idea makes calculation of (2.5) much faster while maintaining similar detection accuracy. Using DC sequences extracted from JPEG or MPEG data also makes it easy to apply histogram comparison. That is, each block is treated as a pixel with its DCT value as the pixel value, then, histograms of DCT-DC coefficients of frames calculated and compared using metrics (2.3). This algorithm has been approved to be very effective, achieving both high detection accuracy and speed in detecting sharp cuts [19]. The DCT based metrics can be directly applied to JPEG video, where every frame is intra-coded. However, in MPEG, temporal redundancies are reduced by applying block-based motion compensation techniques, while spatial redundancies are reduced by block-based intra-coding as in JPEG. Therefore, as shown in Fig. 6, in MPEG, there are three types of frames used: intra(1)-framed, predicted (P) frames and bi-directional predicated and interpolated (B) frames. Only the DCT coefficients of I frames are transformed directly from original images, while for P and B frames DCT coefficients are in general residual errors from motion compensated prediction. This means that DCT based metrics can only be applied in
5.5 Video Content Analysis and Retrieval
957
comparing I frames in MPEG video. Since only a small portion of frames in MPEG axe I frames, this significantly reduces the amount of processing which goes into computing differences. On the other hand, the loss of temporal resolution between I frames will introduce a large fraction of false positives in video partitioning which have to be handled with subsequent processing. 2.3.2. Motion vector based segmentation Apart from pixel value, motion resulting from either moving objects, camera operations or both represents another important visual content in video data. In general, the motion vectors should show continuity between frames within a camera shot and show discontinuity between frames across two shots. Thus, a continuity metric for a field of motion vectors should serve as a n alternative criterion for detecting segment boundaries. The motion vectors between video frames are in general obtained by block matching between consecutive frames, which is a n expensive process. However, if the video data are compressed using either MPEG standards, motion vectors can be obtained from the bit streams of the compressed images. In MPEG data streams, as shown in Fig. 7, there is one set of motion vectors associated with each P frame representing the prediction from the last or next I frame; and there may be two sets of motion vectors associated with each B frame, forward and backward. That is, each B frame is predicted and interpolated from its preceding and succeeding I/P frames by motion compensation. If there is a significant change (discontinuity) in content between two frames, either two B frames, a B frame and an I/P frame or a P frame and an I frame, there will be many blocks in the frame in which the residual error from motion compensation is too high to tolerate. For those blocks, MPEG will not apply the motion compensation prediction but instead intra-code them using DCT. Subsequently, there will be either no or only a few motion vectors associated with those blocks in the B or P frames. Therefore, if there is a Forward Predktion
I t t t l t t t l t t t l Bidirectional Prediction Fig. 7. Frame types in MPEG video.
958
H . J. Zhang
es
500
1500
1000
2000
Fig. 8. Camera break detection based on motion vectors - A sequence of numbers of motion vectors associated with B frames from a documentary video compressed in MPEG.
shot boundary falling in between two frames, there will be a smaller number of inter-coded, but a larger number of intra-coded blocks, due t o the discontinuity of the content between the frames. Based on this observation, we can detect a shot boundary by counting the number of inter-coded blocks in P and B frames [12]: That is, if NinterlNintra
< Tb
is lower than a given threshold, then a shot boundary is declared between the two frames. Figure 8 illustrates values of N i n t e r l N i n t r a for a video sequence. In this case, the camera cuts are accurately represented as valleys below the threshold level. However, this algorithm may fail t o detect gradual transitions because the number of inter-coded blocks in frames during such a transition sequence is often much higher than the threshold. 2,3.3. Hybrid approach Combining the DCT based and motion based metrics into a hybrid algorithm will improve the detection accuracy as well as processing speed in partitioning MPEG compressed video. That is, DCT-DC histograms of every two consecutive I frames are first compared to generate a difference sequence [12]. Since there is a large temporal distance between two consecutive I frames, it is assumed that if we set the threshold relatively low, all shot boundaries, including gradual transitions, will be detected by looking at the points where there is a high difference value. Of course, this first pass will also generate some false positives. Then, B and P frames between two consecutive I frames, which have been detected as potentially containing shot boundaries, will be examined in a second pass using motion-based metrics (2.8). That is, the second pass is only applied to the neighborhood of the potential boundary frames. In this way, both high processing speed and detection accuracy
5.5 Video Content Analysis and Retrieval 959 are achieved at the same time. The only potential problem with this hybrid algorithm is that it may detect false motion sequences as transition sequences, just like the twin-comparison algorithm, which requires a motion analysis based filtering process as discussed later. In summary, experimental evaluations have shown that compression domain feature based algorithms perform with at least the same order of accuracy as those using video data in the original format, though the detection of sharp cuts are more reliable than that of gradual transitions. On the other hand, compression feature based algorithms achieve much high processing speed, which make software only real time video partitioning possible. 2.4. Camera Operation and Object Motion Analysis
Camera operation and object motion are another important source of information and attributes for shot content analysis and classification. It is well known that camera operations or framing of shots are elaborately done by directors/camera operators to present certain scenes or objects and to guide the viewer’s attention. Object motions usually represent human activity and major events in video shots. Detection of camera works and object motion need to be performed in at least the following five processes in video content analysis and representation: 0 0
0
Motion-based temporal partitioning; Filtering false positives result from motion in gradual transitions detection; Recovering global motion to construct salient stills for representing video contents; Selection of key-frames; Video content representation and motion based shot retrieval.
The first two applications have already been discussed in detail in the description of motion based video segmentation algorithms in the last section. The other three applications will be described in the next two sections. The scientific problem of camera work analysis resides in the discrimination between camera work-induced apparent motion and object motion-induced apparent motion, followed by analysis to identify particular camera works and describe object motion. These are classical and unsolved problems in computer vision. However, for our needs in video content analysis and representation, several algorithms can solve this problem with satisfactory accuracy and speed. Camera works include panning and tilting (horizontal or vertical rotation of camera), zooming (focal length change), in which camera position does not change; and tracking and booming (horizontal and vertical transverse movement of camera), and dollying (horizontal lateral movement of camera) in which position of camera does change, as well as combination of these operations. The specific feature which serves to classify camera works is the motion field, as each particular camera operation will result in a specific pattern of motion field [ 6 ] . Based on this, Zhang et al.
960
H. J . Zhang
have developed a simple, yet effective approach to camera operation analysis which distinguishes the gradual transition sequences and classifies camera pan and zoom operations [6]. A more sophisticated quantitative approach to detecting camera operation uses the transformation relation between a point in the space and its coordinates in image space [20]. The transformation is then used to derive the pixel coordinate changes when there is a pan or zoom. A simple zoom is caused by a change in the camera's focal length and there is no camera body movement. Camera pan, on the other hand, is caused by rotating the camera about an axis parallel to the image plane. The combination effects of panning and zooming can be expressed as
u'=fzU+p
(2.9)
where fz is called the zoom factor, and p the pan vector. We can derive f z and p using a motion field calculated from two frames and an iterative algorithm [20]. That is, (2.10)
(2.11)
This approach not only detects camera pans, zooms, and combinations but also describes them quantitatively. The price paid for this information is a significant increase in computation time. We can combine the above two algorithms together, meaning that the first algorithm is used to detect a potential camera operation while the second algorithm is applied only to the potential frames. A limitation of the two algorithms or their combination is that when a sequence of frames is covered by a single moving object, a panning will be detected falsely. More sophisticated motion detection algorithms for video content parsing include those based on discrete tomography for camera work identification [21] and visual icon construction [22]. The distribution of the angles of edges in 3-D tomography images resulting from video can be matched to camera work models, and camera motion classification and temporal segmentation can be obtained directly. Discrimination between pan and lateral traveling and between zoom and booming can be achieved only through a complete projective model including parallax analysis [21].
3. Visual Abstraction of Video Considering the large amount of data video, it is critical to offer means for quick relevance assessment of video documents. How can we spend only few minutes to
5.5 Video Content Analysis and Retrieval
961
view an hour of video and still have a fairly correct perception of its contents? In other words, how can we map an entire segment of video to a small number of representative frames or images? Obviously, what we need is a representation t o present information on landscape or structure of video in a more abstracted manner. We call this the video content abstracting problem. In this section, we discussed three approaches to visual abstraction of video data: key-frames, video icons and skimmed highlights.
3.1. K e y - h m e Extraction Key-frames are still images extracted from original video data which best represent the content of video shots in an abstract manner. Key-frames have been frequently used to supplement the text of a video log [23], but there has been little work done in identifying them automatically. Apart from browsing, key-frames can also be used in representing video in retrieval: video index may be constructed based on visual features of key-frames, and queries may be directed at key-frames using query by image content techniques [25,26]. In some prototype systems and commercial products, the first frame of each shot has been used as the only keyframe to represent the shot content. However, while such a representation does reduce the data volume, its representation power is very limited since it often does not give sufficient clue as what actions are presented by a shot, except for shots with no change or motion. Key-frame based representation views video abstraction as a problem of mapping an entire segment (both static and motion content) to some small number of representative images. The challenge is that the extraction of key-frames needs to be automatic and content based so that they maintain the important content of the video while removing all redundant information. In theory semantic primitives of video, such as interesting objects, actions and events should be used. However, such general semantic analysis is not currently feasible, especially when information from soundtracks and/or close caption is not available. In practice, we have t o rely on low-level image features and other readily available information instead. An approach to key-frame extraction based on low-level video features has been proposed by Zhang et al. [12,26,27]. This approach determines a set of key-frames for each shot according to the following steps.
Segementation: Key-frames will be extracted at shot level based on features and content information of a shot. Given a shot, the first frame will always be selected as the first key-frame; but, whether more than one key-frame needs t o be chosen will depend on the following two criteria.
Color feature based frame comparison: After the first key-frame is selected, following frames in the shot will be compared against the last key-frame sequentially as they are processed, based on their similarities defined by a color histogram. If a significant content change occurs between the current frame and the last
962
H. J . Zhang
key-frame, the current frame will be selected as a new key-frame. Such a process will be iterated until the last frame of the shot is reached. In this way, any significant action in a shot will be captured by a key-frame, while static shots will result in only one key-frame.
Motion based selection: Dominant or global motion resulting from camera operations and large moving objects are the most important source of content changes and thus an important input for key-frame selection. Color histogram representation often does not capture such motion quantitatively, due to its insensitivity to motion. For key-frame extraction, it is necessary to detect sequences involving two types of global motions: panning and zooming. In reality, there are more than just these two types of motion in video sequences, such as tilting, tracking, and dollying. But, due to their similar visual effects, at least in key-frame selection, camera panning, tilting and tracking, as well as horizontal and vertical motion of large objects are treated as one type of motion: panning-like. Similarly, camera zooming and dollying and perpendicular (to the imaging plan) motion of large objects are treated as another type: zooming-like. To select key-frames representing these two types of motion, there are two criteria: For a zooming-like sequence, at least two frames will be selected - the first and last frames since one will represent a global, while the other will represent a more focused view. For a panning-like sequence, the number of frames to be selected will depend on the scale of panning: ideally, the spatial context covered by each frame should have little overlap, or each frame should capture a different, but sequential part of object activities. The overlap ratio can be varied to determine the density of key-frame selection. Figure 9 shows an example in which three key-frames from a tilt-up shot were extracted automatically. It is obvious that from the three key-frames one can see clearly that it is a tilt-up sequence, which is impossible to see from any single keyframe. In this respect, extracting three key-frames is a more adequate abstraction than only a single key-frame. Although choosing any one of the 3 key-frames will capture the visual content of the shot: business district buildings, it will not be able to show the camera movement which is important for users (especially producers and editors) who want to choose some shots from stock footages.
Fig. 9. Examples of key-frame extracted automatically from a shot.
5.5 Video Content Analysis and Retrieval 963 3.2. Video Icon
Another effective way of representing shot content of video for browsing is using static icons, which has attracted much research work, with two major approaches: Construction of a visual icon based on a key-frame, supplemented with pseudodepth for the representation of the duration of the shot, and perhaps arrows and signs for the representation of object and camera motion; Synthesis of an image representing the global visual contents of the shot. The first approach has been favored when the emphasis is on building a global structured view of a video document, fitted for quick visual browsing, such as in the IMPACT system [5]. Some researchers have used icon spacing or image size instead of pseudo-depth for representing the duration of the shot, but this does result in an efficient screen space use. Teodosio and Bender [6] have proposed methods for the automatic construction of an overview image representing the visible contents of an entire shot. Using camera work analysis and the geometrical transformations associated with each camera motion, successive images are mapped into a common frame, and the synthetic image is progressively built. This image is not generally rectangular. Recently, Irani et al. have perfected this type of method on two points [7]: 0 0
They use a more complete projective model, including parallax; They have shown that it is possible to compute what they called dynamic mosaic images with emphasis being given to the moving parts of the image (action) instead of background oriented images.
The resulting images have been termed salient stills [26], mosaic images [27], micon (motion icon) or VideoSpacelcons [22]. 3.3. Video Skimming Video skimming is the scheme for answering the request of abstracting an hour of video, for instance, into 5-minute highlights with a fair perception of the video contents. This is a relatively new research area and requires a high level of content analysis. A successful approach is to utilize information from text analysis of the video soundtrack. Researchers [e.g. [30]] working on documents with textual transcriptions have suggested producing video abstracts by first abstracting the text using classical text skimming techniques and then looking for the corresponding parts in the video. A successful application of this type of approach has been the Informedia project, in which text and visual content information are fused to identify video sequences that highlight the important contents of video [29]. More specifically, low-level and
964 H. J . Zhang
mid-level visual features, including shot boundary, human face, camera and object motion and subtitles of video shots are integrated with keywords, spotted from text obtained from close caption and speech recognition, using the following procedures: 0 0
0 0
Keyword selection using the well-known TF-IDF technique to skim audio; Sequence characterization by low-level and mid-level visual features; Selecting a number of keywords according to the required skimming factor; Prioritizing image sequences located in close proximity to each selected keyword: (1) Frames with faces or text; (2) Static frames following camera motion; (3) Frames with camera motion and human faces or text; (4) Frame at the beginning of the scene;
0
Composite a skimmed highlight sequence with selected frames.
Experiments using this skimming approach have shown impressive results on limited types of documentary video which have very explicit speech or text (close caption) contents, such as education video, news or parliamenary debates [31]. However, satisfatory results may not be achievable using such a text (keyword) driven approach to other videos with a soundtrack containing more than just speech, or stock footage without soundtrack. 4. Video Content Representation and Similarity
After partitioning and abstraction, the next step in video analysis is to identify and compute representation primitives, based on which the content of shots can be indexed, compared, and classified. Ideally these should be semantic primitives that a user can employ to define interesting or significant events. These semantic primitives include constituent objects’ names, appearance and motion, as well as relationships among different objects at different times and the contributions of all these attributes and relationships to the story being presented in a video sequence [32]. However, automatic extraction of such primitives is not feasible; so that we have to build content representation based on low-level features, such as color, texture and motion statistics of shots. The first set of low-level visual primitives for video content representation should be extracted from key-frames. However, such a representation alone will be insufficient to support event-based classification and retrieval, since key-frame based features capture most the spatial information while motion is an essential and unique feature of video. Therefore, the second type of primitives should be based on temporal variation and motion information in shots. With these two types of representations, we can then index video shots and define shot similarity used in video retrieval.
5.5 Video Content Analysis and Retrieval
965
4.1. Key-Frames Based Features for Shot Content Representation
Key-frames based representation of video content uses the same features as those for content based still image retrievals. These features include color, texture, and shape, which may be defined in different formats and extracted by different operations [3,14,31].
Color features: Color has been one of the first choice for image content representation and similarity measuring since it has excellent discrimination power in measuring image similarity (See Chap. 2.4 for a detailed discussion): It is very rare that two images of totally different objects will have similar colors [32]. The popular representation schemes for color include histogram [14,32,33],dominant colors [32,33], and statistical moments [34]. To make the representation effective and invariant to illumination conditions, different color spaces have been evaluated. It is concluded that L*u*v* color space tends t o be the best one [33]. Also, it is noticed that a small number of color ranges capture the majority of pixels in most of the images, thus, a few dominant colors lead to a good approximated representation of color distribution. Texture features: Texture has long been recognized as being as important a property of images as is color, if not more so, since textural information can be conveyed as readily with gray-level images as it can in color. A detailed discussion on definition and calculation of a variety of texture measures is given in Chap. 2.1. Among many alternatives, the most popular and effective texture models used in image retrieval are: Tamura features (contrast, directionality, and coarseness) [35] and the Simultaneous Auto-regressive (SAR) model [36]. To define a rich perceptive space, Picard and Liu [37] have shown that it is possible t o do so by using the Wold decomposition of the texture considered as a luminance field. One gets three components (periodic, evanescent and random) corresponding t o the bi-dimensional periodicity, mono-dimensional orientation, and complexity of the analyzed texture. Shape features: Dominant objects in key-frames represent important semantic content and are best represented by their shapes, if they can be identified by either automatic or semi-automatic spatial segmentation algorithms. A proper definition of shape similarity calls for the distinctions between shape similarity in images (similarity between actual geometrical shapes appearing in the images) and shape similarity between the objects depicted by the images, i.e. similarity modulo a number of geometrical transformations corresponding to changes in viewing angle, optical parameters and scale. In some cases, one wants to include even deformations of non-rigid bodies [38]. Even for the first type of similarity, it is desirable to use shape representations which are scale independent, based on curvature, angle statistics and contour complexity. Systems such as QBIC [14] use circularity, eccentricity, major axis orientation (not angle-independent) and algebraic moment.
As for color and texture, the present schemes for shape similarity modeling face serious difficulties when images include several objects or background. A
966
H. J. Zhang
preliminary segmentation as well as modeling of spatial relationships between shapes is then necessary (are we interested in finding images where one region represents a shape similar to a given prototype or to some spatial organization of several shapes?). 4.2. Temporal Features for Representation of Shot Content
Though a set of key-frames will represent temporal content of shots to some extent, more precise representation of shot content should incorporate motion features, apart from features of static images discussed in above. In response to such a requirement, a set of statistical measures of motion features of shots has been proposed and applied in news anchor detection and shot clustering for browsing and annotation [3,39]. However, defining more quantitative measures of shots similarity that capture the nature of motion video still remains a challenging research topic.
Temporal variation and camera operations: The means and variances of average brightness and a few dominant colors calculated over all frames in a shot may be used to present quantitative temporal variations of brightness and colors. An example is that such temporal variations have been used to classify news video clips into anchorperson shots and news shots [41]. Motion information resulting from algorithms presented in Section 3 can be used to classify video sequences into static and motion sequences. Statistic motion features: Since motion features have to roughly match human perception and it is still not clear how humans describe motions, one may have to base the motion representation on statistical motion features, rather than object trajectories [3]. More specifically, these features include directional distributions of motion vectors and average speed in different directions and areas, which may be derived from optical flow calculated between consecutive frames [3,39].
To obtain localized motion information, we can also calculate the average speed and its variance in blocks uniformly divided in frames. That is, instead of calculating average speeds in M directions for the entire frame, we calculate a set of motion statistics in M blocks of each frame. Then, the motion based comparison of shot contents will be based on the motion statistics comparison of corresponding blocks in consecutive frames. 4.3. Shot Similarity
The visual features presented above provide content representations of shots, but the goal is to define shot similarity based on these representations, to enable shot comparison or clustering for video retrieval and browsing, as discussed in Section 5. When key-frames are used as the representation of each video shot, we can define video shot similarity based on the similarities between the two key-frame sets. If two shots are denoted as Si and S j , their key-frame sets as Ki = { f i , m , m = 1 , . . .,M }
5.5 Vadeo Content Analysis and Retrieval
967
and Kj = { f j , n , n = 1 , . . . , N } , then the similarity between the two shots can be defined as
where 8 k is a similarity metric between two images defined by any one or a combination of the image features; and there are totally M x N similarity values, from which the maximum is selected. This definition assumes that the similarity between two shots can be determined by the pair of key-frames which are most similar, and it will guarantee that if there is a pair of similar key-frames in two shots, they are considered similar. Another definition of key-frame based shot similarity is: 1
Sk(Si,S j ) = M
c M
m=[Sk(fi,m, fj,l), s k ( f i , m , f j , 2 ) , . . .
m=l
7
S k ( f i , m , fj,”
.
(4.2)
This definition states that the similarity between two shots is the sum of the most similar pairs of key-frames. When only one pair of frames match, this definition is equivalent to the first one. The key-frame based similarity measure as defined above can be further combined with that of motion feature based to make the comparison more meaningful for video. 4.4. Summary
This section has discussed a set of visual features for shot content representation and similarity computation for shot comparison. It should be noted that each of the features presented above represents a particular property of images or image sequences and is effective in matching those in which the particular feature is the salient one. Therefore, it is important to identify salient features for a given image or shot and to apply the appropriate similarity metric. Developing such algorithms remains a long term research topic. Also, how to integrate the different features optimally into a feature set that has the combined representation power of each feature is another challenging problem. Finally, how to link the low level features to high level semantics (presence of object and faces, and their actions) remains and will continue to remain for a long time an open problem in computer vision. 5. Video Scene Analysis and Shot Clustering There can be hundreds of shots in one hour of a typical video program. Thus, the production of a synoptical view of the video contents usable for browsing or for quick relevance assessment calls for the recognition of meaningful time segments of longer duration than a shot, or for grouping similar shots. In media production, the level immediately higher than shots is called sequence or scene, a series of consecutive shots constituting a unit from the narrative point of view, because they are shot
968 H. J. Zhang
in the same location, or they share some thematic visual content. The process of detecting these video scenes is analogous to paragraphing in text document analysis and requires a higher level content analysis. 5.1. Scene Analysis
Two different kinds of approaches have been proposed for the automatic recognition of sequences of programs. Aigrain et al. have used rules formalizing medium perception in order to detect local (in time) clues of macro-scope change [40]. These rules refer to transition effects, shot repetition, shot setting similarity, apparition of music in the soundtrack, editing rhythm and camera work. After detection of the local clues, an analysis of their temporal organization is done in order to produce the segmentation in sequences and to choose one or two representative shots for each sequence. Zhang et al. have used structure models of specific types of programs such as TV news [41]. They recognize specific shot types such as shots with an anchor person, and then use the model to analyze the succession of shot types and produce a segmentation in sequences. Such model or knowledge based approaches can also be applied to, for instance, sport video parsing [42]. However, when we extend the application domain, we are facing the same difficulties as in computer vision. In summary, video scene analysis requires higher level content analysis and one cannot expect that it can be fully automated based on visual content analysis using current image processing and computer vision techniques. Fusion of information from video, audio and close caption or transcript text analysis may be a solution and a successful example is the I n f o m e d i a Project [29]. 5 . 2 . Shot Clustering
Clustering shots into groups, each of which contains similar content, are essential to building an index of shots for content-based retrieval and browsing. There are mainly two types of clustering approaches: partitional clustering arranges data in separate clusters; and hierarchical clustering leads to a hierarchical classification tree [44]. For the purpose of clustering a large number of shots to allow classbased video indexing and browsing with different levels of abstractions, partitional methods are more suitable since they are capable of finding optimal clustering at each level and are more suitable to obtain a good abstraction of data items. An approach of such type was proposed for video shot grouping, which is very flexible such that different feature sets, similarity metrics and iterative clustering algorithms can be applied at different levels [39]. One implementation of this approach is to use an enhanced K-means clustering algorithm incorporating fuzzy classification, which allows assignment of data items at the boundary of two classes to both of them according to membership function of the data item to all the classes. This is useful especially at higher levels of hierarchical browsing, where users expect all similar data items to be under a smaller number of nodes. The fuzzy clustering algorithm is as follows.
5.5 Video Content Analysis and Retrieval 969 (1) Get N classes using the K-means algorithm. (2) For every data item vi, i = 1 , . . . ,M
calculate its similarity with each class k(k = 1,. . . ,N ) as
where dij is the distance between data item vi and the reference vector of class j ; 4 is the fuzzy exponent (4 > 1.0). if S i k 2 p where p is the threshold set by users (0 < p < l ) ,add item vi into class k. if vi is not assigned to any class in above step, assign it to the miscellaneous class. The clustering can be based on key-frames and motion features of video shots.
A comprehensive evaluation of this approach using color and motion features can be found in [3,39]. Implementation and evaluation of this clustering approach using the Self-organization Map (SOM) method can be found in [39,45]. The advantage of SOM is its learning ability without prior knowledge and good classification performance, which have been shown by many researchers. Another benefit of using SOM is that the similarities among the extracted classes can be seen directly from the two-dimensional map. This will allow horizontal exploring as well as vertical browsing of the video data, which is very useful when we have a large number of classes at lower levels. 6. Content-Based Video Retrieval and Browsing Tools Once a video sequence has been segmented and a scheme for representation of video content has been established, tools for content-based retrieval and browsing can be built upon that representation. This section discusses briefly a set of such tools.
6.1. Retrieval Tools With the representation and similarity measures described in Section 4, querying a video database to retrieve shots of interest can be performed based on the metadata, including key-frame features and motion features, or a combination of the two. The retrieval process using these features needs to be interactive and iterative, with the system accepting feedback to narrow or reformulate searches or change its link-following behavior, and to refine any queries that are given. In case of using key-frame based features for retrieval, the index schemes and query and retrieval tools developed for image databases, such as QBIC [14], can be applied directly to video retrieval. That is, each shot is linked to its key-frames, and search of particular shots becomes a matter of identifying those key-frames from the
970 H. J. Zhang
database which are similar to the query, according to the features and similarities defined in Section 4. Similar to QBIC image database, to accommodate different user requests, three basic query approaches should be supported: query by template manipulation, query by object feature specification, and query by visual examples. Also, the user can specify a particular feature set to be used for a retrieval [3,14,15]. The retrieved video shots in such systems will be visually represented by their key frames. The user may then view the actual shots by clicking a “Video” button, as implemented in the SWIM system [3]. That is, a video player, or other types of viewers, will be initialized and play the selected segment to allow the user verifying the retrieval results. Retrieval may thus be followed by browsing as a means to examine the broader context of the retrieved key-frame. On the other hand a query can also be initiated from browsing. That is, a user may select an key-frame while browsing and offer it as a query: a request to find all key-frames which resemble that image. Figure 10 shows an example of sketch based retrieval of key-frames and a player window launched by clicking a key-frame to allow the users to view the shot represented by the first key-frame.
(b) Fig. 10. Example of key-frame-based retrieval: (a) A sketch based query; (b) Top five candidates from a database of key-frames.
Shot based temporal features provide another set of primitives that captures the motion nature of video and are used in our system as another retrieval tool to improve the retrieval performance. For detailed examples of shot based retrieval, see [3]. It should be pointed out that shot example based video retrieval is far less reliable than image example based image retrieval. This fact calls for more robust and effective representation schemes for video. Though there have been many research efforts, the development of image feature based video and image retrieval algorithms is still in its early stage and will need much more progress before they can be applied to the full search of a complete video archival. Among other things, developing efficient indexing schemes based on
5.5 Vzdeo Content Analysis and Retrieval 971 similarity features for managing large data volume is a critical problem for retrieval. It has been shown that traditional database indexing techniques like R-trees, etc. fail in the context of content based image search, and currently there is no technique that allows retrieval of similar objects in multi-dimensional space. Ideas from statistical clustering, multi-dimensional indexing, and dimensionality reduction may be useful in this area. 6.2. Browsing Tools
Interactive browsing of full video contents is probably the most essential feature of new forms of interactive access to digital video. Content-based video browsing tools should support two different nonlinear approaches to accessing video source data: sequential and random access. In addition, these tools should accommodate two levels of granularity, overview and detail, along with an effective bridge between the two levels. Such browsing tools can only be built by utilizing structure and content featured obtained in the video parsing, abstraction and feature extraction processes as discussed in the previous sections. There are mainly five types of browsing tools built based on different structural and content features of video as follows, (1) time-line display of frames; (2) content-based video player; (3) light table of video icons; (4) hierarchical browser; (5) graph based story board.
Time-line based browsers have been favored by users in video production and editing systems, for which time-line interfaces are classical. Some browsers rest on a single shot-based image component line [4,46]; but the multidimensional dimension character of video, calling for multi-line representation of the contents has been stressed by researchers working in the frame of the Muse toolkit [47,48]. This has been systematized in the strata models proposed by Aguierre-Smith and Davenpoprt [49]. The limitation of time-line browsers is that, since it is difficult to zoom out while keeping a good image visibility, the time-scope of what is actually displayed at a given moment on screen is relatively limited. Another sequential browser takes an improved VCR-like interface illustrated in Fig. 11. Apart from the conventional video play functions, this browser makes use of the meta-data of video structure and content extracted from the parsing process [26]. That is, a user can select to view key-frames at a selected rate, skip between shots such as “go to the next similar shot”, and play a skimmed version of the original video. In this way, the user is no longer constrained to linear and sequential viewing such as fast forward available in most conventional VCR’s. The light-table kind of video browser is often called clip window, in which a video sequence is spread in space and represented by video icons which function rather like a light table of slides [4,42]. In other words the display space is traded for time
972
H. J. Zhang
Fig. 11. Smart video player.
to provide a rapid overview of the content of a long video. The icons are constructed with methods presented in Section 3.3. A window may contain a sequentially listed shots or scenes from a video program, a sequence of shots from a scene, or a group of similar or related shots from a stock archival. A user can get a sense about the content of a shot from the icon, especially when salient stills are used. They can also be used to display the results of retrieval operations. The clip window browser can also be constructed hierarchically just like the window file systems used in P C operating systems nowadays. That is, each icon in a clip window can be zoomed in to open another clip window, in which each icon represents a next level and finer segment of video sequences [42]. A first attempt at building hierarchical browsers - called the Video Magnifier [50] simply used successive horizontal lines each of which offered greater time detail and narrower time scope by selecting images from the video program. To improve the content accessibility of such browsers, the structural content of video obtained in video parsing are utilized (3,261. As shown in Fig. 12, videos are accessed as a tree. At the top of the hierarchy, an entire program is represented by five key-frames, each corresponding to a sequence consisting of an equal number of consecutive shots. Any one of these segments may then be subdivided to create the next level of the hierarchy. As we descend through the hierarchy, our attention focuses on smaller groups of shots, single shots, the representative frames of a specific shot, and finally a sequence of frames represented by a key-frame. We can also move to more detailed granularity by opening a video player to view sequentially any particular segment of video selected from this browser at any level of the hierarchy. As shown in Fig. 12, icons displayed at the top level of the hierarchy are selected based on a clustering process using shot similarities. That is, programs or clips will be first clustered into groups of similar shots or sequences as described in Section 5.2; then, each group is represented by an icon frame determined by the centroid of the group, which is then displayed at the higher levels of the hierarchical browser. As one can see from this example, with such a similarity based hierarchical browsing,
5.5 Video Content Analysis and Retrieval 973
LEVEL 3 Mct
Fig. 12. Similarity-based hierarchical browser.
Fig. 13. Similarity-based hierarchical browser.
974
H. J. Zhang
the viewer can get a sense of the content of all shots in a group roughly even without moving down to a lower level of the hierarchy [2]. An alternative approach to hierarchical browsers is the class based transition graph, proposed by Yeung et al. (511. Using the clustering of visually similar shots, a directed graph whose nodes are clusters of shots is constructed, as shown in Fig. 13. Cluster A is linked to cluster B if one of the shots in A is immediately followed by a shot in B. The resulting graph is displayed for browsing, each node being represented by a key-frame extracted from one of the shots in the node. The graph can be edited for simplification by a human operator. The drawbacks of this approach lie in the difficulty of the graph layout problem, resulting in poor screen space use, and in the fact that the linear structure of the document is no longer perceptible. With the browsing tools, another advantage of using key-frames is that we are able to browse the video content down to the key-frame level without necessarily storing the entire video. This is particularly advantageous if our storage space is limited. Such a feature is very useful not only in video databases and information systems but also to support previewing in VOD systems. What is particularly important is that the network load for transmitting small numbers of static images is far less than that required for transmitting video. Through the hierarchical browser, one may also identify a specific sequence of the video which is all that one may wish to LLdemand”.Thus, the browser not only reduces network load during browsing but may also reduce the need for network services when the time comes to request the actual video.
7. Video Content Analysis - A Challenging Area
In this chapter, we have discussed a number of techniques for video content analysis for computer-assisted video parsing, abstraction, retrieval and browsing. A majority of these techniques address problems in recovering structure and low level information of video sequences, though these techniques are basic and very useful in facilitating intelligent video browsing and search. Successful efforts in extracting semantic content are still very limited and there is a long way to go before we can achieve our goal of automatic video content understanding and content-based retrieval. Many of the bottlenecks and challenges in video content analysis are the classic ones in pattern recognition and computer vision. On the other hand, it is believed that low level feature-based retrieval algorithms and tools will provide us a bridge to a more intelligent solution and will improve our productivity in image and video retrieval, though the current tool set is far from satisfactory. When we strive for visual content analysis and representation, it should be pointed out that integration of different information sources, such as speech, sound and text is as important as visual data itself in understanding and indexing visual data. Keywords and conceptual retrieval techniques are and will always be an important part of visual information systems.
5.5 Video Content Analysis and Retrieval
975
An application-oriented approach is critical to the success of visual data representation and retrieval research and will prevent us from being too theoretical. By working on strongly focused applications, the research issues reviewed in this chapter can be addressed in the context of well-defined applications and will facilitate the applications; while achieving general solutions remain a long term research topic. On the other hand, it should be pointed out that, one of the most important points is that video content analysis, retrieval and management should not be thought of as a fully automatic process. We should focus on developing video analysis tools to facilitate human analysts, editors and end users to manage video more intelligently and efficiently. It is clear that the number of research issues and their scope are rather large and expanding rapidly with advances in computing and communication [1,2]. As a result, visual data is becoming the center of multimedia computing and image analysis research, and more and more researchers from different fields are attracted and have started to explore these issues. References [l] R. Jain, A. Pentland and D. Petkovic (eds.), Workshop report: NSF-ARPA workshop on visual information management systems, Cambridge, Mass., USA, Jun. 1995. [2] P. Aigrain, H. J. Zhang and D. Petkovic, Content-based representation and retrieval of visual media: A state-of-the-art review, Int. J. Multimedia Tools and Applications 3, 3 (1996). (31 H. J. Zhang, J. H. Wu, D. Zhong and S. W. Smoliar, Video parsing, retrieval and browsing: an integrated and content-based solution, Pattern Recogn., 1996. (41 A. Nagasaka and Y. Tanaka, Automatic video indexing and full-search for video appearances, in E. Knuth and I. M. Wegener (eds.), Visual Database Systems (Elsevier Science Publishers, Vol. 11, Amsterdam, 1992) 113-127. [5] H. Ueda, T. Miyatake and S. Yoshisawa, IMPACT: An interactive natural-motionpicture dedicated multimedia authoring system, Proc. CHI’91, ACM, 1991, 343-350. [6] H. J. Zhang, A. Kankanhalli and S. W. Smoliar, Automatic partitioning of full-motion video, Multimedia S y s t e m , ACM-Springer 1, 1 (1993) 10-28. [7] P. Aigrain and P. Joly, The automatic real-time analysis of film editing and transition effects and its applications, Computers €4 Graphics 18,1 (1994) 93-103. [8] B. Shahraray, Scene change detection and content-based sampling of video sequences, SPIE Proceedings Digital Video Compression: Algorithm and Technologies, San Jose, February, 2419 (1995) 2-13. [9] R. Zabih, K. Mai and J. Miller, A robust method for detecting cuts and dissolves in video sequences, Proc. ACM Multimedia’95, San Rancisco, Nov. 1995. [lo] A. Hampapur, R. Jain and T. E. Weymouth, Production model based digital video segmentation, Multimedia Tools and Applications 1, 1 (1995) 9-46. [ll] F. Arman, A. Hsu and M. Y. Chiu, Feature management for large video databases, Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases I , SPIE, 1908,Feb. 1993, 2-12. [12] H. J. Zhang, C. Y. Low, Y. Gong and S. W. Smoliar, Video Parsing Using Compressed Data, Proc. SPIE’94 Image and Video Processing 11, San Jose, CA, USA, Feb. 1994, 142-149.
976
H. J. Zhang
[13] J. Meng, Y. Juan and S.-F. Chang, Scene change detection in an MPEG compressed
video sequence, IS&T/SPIE’95 Digital Video Compression: Algorithm and Technologies, San Jose, 2419,Feb. 1995, 14-25. [14] M. Flickner et al., Query by image and video content, IEEE Computer, Sept. 1995, 23-32. [15] B. Furht, S. W. Smoliar and H. J. Zhang, Image and Video Processing in Multimedia Systems (Kluwer Academic Publishers, 1995). [16] D. Bordwell and K. Thompson, Film Art: A n Introduction (McGraw-Hill, New York, 1993). (171 Dynamic Vision, in Computer Vision: Principles, R. Kasturi and R. Jain (eds.) (IEEE Computer Society Press, Washington) 469-480. [18] B.-L. Ye0 and B. Liu, A unified approach to temporal segmentation of motion JPEG
and MPEG compressed video, Proc. IEEE Int. Conf. Multimedia Computing and Networking, Washington D.C., May 1995, 81-88. [19] I. S. Sethi and N. Patel, A statistical approach to scene change, Proc. SPIE Conf. Storage and Retrieval for Video Databases III, San Jose, CAI USA, Feb. 1995. [20] Y. T. Tse and R. L. Baker, Global zoomjpan estimation and compensation for video compression, 1991, May, Proc. ICASSP’91, Vol. 4. [21] A. Akutsu and Y. Tonomura, Video tomography: An efficient method for camerawork extraction and motion analysis, Proc. ACM Multimedia Conference, San Francisco, Oct. 1993. [22] Y. Tonomura, A. Akutsu, K. Otsuji and T. Sadakata, VideoMAP and VideoSpaceIcon: Tools for anatomizing video content, Proc. InterChi’9.9, ACM (1994) 131-136. [23] B. C. O’Connor, Selecting key frames of moving image documents: A digital environment for analysis and navigation, Microcomputers for Information Management 8 , 2 (1991) 119-133. [24] H. J. Zhang, S. W. Smoliar and J. H. Wu, Content-based video browsing tools,
Proc. IS& T/SPIE’95 Multimedia Computing and Networking, San Jose, 2417, Feb. 1994. [25] H. J. Zhang, C. Y. Low, S. W. Smoliar and J. H. Wu, Video parsing, retrieval and
browsing: an integrated and content-based solution, Proc. ACM Multimedia’95, San Francisco, Nov. 1995, 15-24. [26] L. Teodosio and W. Bender, Salient video stills: Content and context preserved, Proc. ACM Multimedia’93, Anaheim, CA, USA, Aug. 1993. [27] M. Irani, P. Anandan, J. Bergen, R. Kumar and S. Hsu, Mosaic based representations of video sequences and their applications, Image Communication special issue on Image and Video Semantics: Processing, Analysis and Application, 1995. [28] A. Takeshita, T. Inoue and K. Tanaka, Extracting text skim structures for multimedia browsing, in Mark Maybury (ed.), Working Notes of IJCAI Workshop on Intelligent Multimedia Information Retrieval, Montreal, Aug. 1995, 46-58. [29] A. G. Hauptmann and M. Smith, Text, speech and vision for video segmentation: The informedia project, Working Notes of IJCAI Workshop on Intelligent Multimedia Information Retrieval, Montreal, Aug. 1995, 17-22. [30] M. Davis, Media streams: An iconic visual language for video annotation, Proc. Sym. Visual Languages, Bergen, Norway, 1993. [31] A. Pentland, R. W. Picard and S. Sclaroff, Photobook: Content-based manipulation of image databases, Proc. Storage and Retrieval for Image and Video Databases II, 2185,San Jose, CA, USA, Feb. 1994. [32] M. J. Swain and D. H. Balllard, Color indexing, Int. J. Computer Vision 7 (1991) 11-32.
5.5 Video Content Analysis and Retrieval
977
[33] H. J. Zhang et al., Image retrieval based on color features: An evaluation study, Proc. SPIE Phonics East, Conf. Digital Storage €9 Archiving, Philadelphia, Oct. 1995. [34] M. Stricker and M. Orengo, Similarity of color images, Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases III, 1995, February, SPIE Conf. Proc. 2420, San Jose, CA, USA, Feb. 1995, 381-392. [35] H. Tamura, S. Mori and T. Yamawaki, Texture features corresponding to visual perception, IEEE Trans. Syst. Man Cybern. 6, 4 (1979) 460-473. [36] J. Mao and A. K. Jain, Texture classification and segmentation using multiresolution simultaneous autoregressive models, Pattern Recogn. 25, 2 (1992) 173-188. [37] R. Picard and Fang Liu, A new word ordering for image similarity, Proc. Int. Conf. Acoustic Signals and Signal Processing, Adelaide, Australia, 5, Mar. 1994, 129. [38] S. Sclaroff and A. Pentland, Modal matching for correspondence and recognition, IEEE R a m . Pattern Analy. Mach. Intell. 17, 6 (1995) 544-561. [39] D. Zhong, H. J. Zhang and S.-F. Chang, Clustering methods for video browsing and annotation, Proc. Storage and Retrieval for Image and Video Databases I V , San Jose, CA, USA, Feb. 1995. (401 P. Aigrain and P. Joly and V’eronique Longueville, Medium-knowledge-based macrosegmentation of video into sequences, Working Notes of IJCAI Workshop on Intelligent Multimedia Information Retrieval, Montreal, Aug. 1995, 5-14. [41] H. J. Zhang, S. Y. Tan, S. W. Smoliar and Y. Gong, Automatic parsing and indexing of news video, Multimedia Systems, 2, 6 (1995) 256-265. [42] Y . Gong, L. T. Sin, H. C. Chuan, H. J. Zhang and M. Sakauchi, Automatic parsing of TV soccer programs, Proc. Second IEEE Int. Conf. Multimedia Computing and Systems, Washington D.C., May 1995, 167-174. [43] H. J. Zhang and S. W. Smoliar, Developing power tools for video indexing and retrieval, Proc. SPIE’94 Storage and Retrieval for Video Databases, San Jose, CA, USA, Feb. 1994. [44] A. K. Jain and R. Duber, Algorithms f o r Clustering Data (Prentice Hall, 1988). [45] H. J. Zhang and D. Zhong, A scheme for visual feature based image indexing, Proc. IS&T/SPIE Conf. Image and video Processing III, San Jose, CA, 1995, 36-46. [46] P. Aigrain and V. Longueville, A connection graph for user navigation in a large image bank, Proc. RIA0’91, Barcelona, Spain, 1991, 1, 67-84. (471 M. E. Hodges and R. M. Sassnett and M. S. Ackerman, A construction set for multimedia applications, IEEE Software (1989) 37-43. [48] W. E. Mackay and G. Davenport, Virtual video editing in interactive multimedia applications, Communications of the ACM 32, 9, July, 1989. [49] T. G . Aguierre-Smith and G. Davenport, The stratification system: A design environment for random access video, Proc. 3rd Int. Workshop on Network and Operating System Support for Digital Audio and Video, La Jolla, CA, USA, Nov. 1992, 250-261. [50] M. Mills, J. Cohen and Y . Y . Wong, A magnifier tool for video data, Proc. INTERCHI’92, 1992, ACM, May, 93-98. [51] M. M. Yeung, B.-L. Yeo, W. Wolf and B. Liu, Video browsing using clustering and scene transitions on compressed sequences, IS& T/SPIE’95 Multimedia Computing and Networking, San Jose, 2417, Feb. 1995, 399-413.
Handbook of Pattern Recognition and Computer Vision (2nd Edition), pp. 979-1002 Eds. C. H. Chen, L. F. Pau and P. S. P. Wang @ 1998 World Scientific Publishing Company
1 CHAPTER 5.6 I VLSI ARCHITECTURES FOR MOMENTS AND THEIR APPLICATIONS TO PATTERN RECOGNITION
HENG-DA CHENG, CHEN-YUAN WU and JIGUANG LI Computer Science Department, Utah State University, Logan, Utah 84321, USA Moment is one of the most popular techniques for image processing, pattern classification and computer vision, and has many applications. In this chapter, we apply moments t o extract the features of breast cancer biopsy images and pavement distress images. The resulting features are input into neural networks for classification. The satisfactory outcomes demonstrate the usefulness of moments. However, the high computational time complexity of the moment limits its application for pattern recognition, especially, for real-time tasks. The rapid advance of hardware technology makes VLSI implementation of pattern recognition algorithms more feasible and attractive. We propose a one-dimensional and two two-dimensional systolic arrays for computing regular and central moments. The operations of the proposed architectures which are much faster than the existing ones and their computational time complexities are studied. The important issue of VLSI design, algorithm partition, is also discussed. The basic idea of this article can be easily extended to design VLSI architectures for other kinds of moments as well.
Keywords: Moments, breast cancer detection, pavement distress detection, VLSI, algorithm partition.
1. Introduction
The task of recognizing an object independent of its position, size, or orientation is very important for many applications of pattern recognition, image processing, and computer vision. In every aspect of developing a pattern recognition system, we should always carefully determine and extract the characteristics of the pattern to be recognized. When the pattern undergoes rotation, translation, or scaling, the extracted features are more crucial for the recognition result. Many methods have been proposed to describe and extract the features of digital images [I],and among them, moment is one of the most popular techniques for extracting rotation-scalingtranslation-invariant features. In the early 1960’s, Hu [2] published the first paper on moment invariant for two-dimensional pattern recognition based on the method of algebraic invariants. Since then, many researchers have applied the moment invariants for pattern classification, image processing, and image description successfully. Teh and Chin [3] 979
980 H.-D. Cheng, C.-Y. W u & J. La
evaluated a number of moments for pattern recognition, such as regular moments, Legendre moments, Zernike moments, pseudo-Zernike moments, rotational moments, and complex moments. Teague [4] summarized some well-known properties of the zeroth-order, first-order and second-order moments, discussed the problems of image reconstruction from the inverse moments, and suggested using the orthogonal moments to recover an image. Abu-Mostafa and Psaltis [5] discussed the image recognitive aspects of moment invariants, and focused on the information loss, suppression, and the redundancy encountered in the complex moments. Dudani et al. [6] used the same set of moment invariants to recognize different types of aircrafts. Chou and Chen [7] proposed a two-stage pattern matching method called “moment-preserving quantization” that reduced the complexity of computation with quite a good accuracy, and proposed a low cost VLSI implementation. In [8], Wong and Hall used a set of moments which are invariant to translation, rotation, and scale changes to perform the scene matching of radar t o optical images. Casey [9] used the second-order moments to specify the transformation of the coordinate wave forms. By using moments, the original pattern was mapped into a variation-free domain, and the linear pattern variation of the hand printed characters was removed. Cash and Hatamian [lo] used two-dimensional moments to extract the pattern features of the optical character, and showed that the pattern features extracted from the moments provided good discrimination between characters, and 98.5% t o 99.7% recognition rates were achieved for the tested fonts. Ghosal and Mehrotra proposed a subpixel edge detection method based on a set of orthogonal complex moments called Zernike moments of the image [ll]. Khotanzad and Hong [12] also proposed a n invariant image recognition method by using Zernike moments. Saradana et al. [13] applied the second-order moments t o extract the feature vectors that can describe objects efficiently in an n-dimensional space. Liu and Tsai [14] proposed a corner detection method based on the moments. They showed that the moments of images were big factors choosing the threshold value. Belkasim et al. gave a detail study of the efficiencies of different moment invariants in pattern recognition applications [15]. They proposed a new method for deriving Zernike moments with a new normalization scheme, and obtained a better overall performance even when the data contained noise. Reeves et al. [16] presented a procedure using moment-based feature vectors t o identify a three-dimensional object from a two-dimensional image recorded at an arbitrary angle and range. They compared two methods used for this task: moments and Fourier descriptors, and proved that the standard moment gave a slightly better result than the Fourier descriptors. Though moments have been widely applied to pattern recognition and image processing, the high computational complexity limits their applications, especially for the tasks requiring real-time processing. VLSI technology provides a feasible, effective and efficient solution by implementing algorithms on hardware chips which utilize the space and time concurrencies extensively and expedite the computation greatly.
5.6 VLSI Architectures for Moments and . . . 981
2.
VLSI Architectures for Moments
Many VLSI architectures have been developed to implement algorithms for image processing and pattern recognition 117-281. By using VLSI architecture, we can speed up the computation tremendously. In order to improve the performance, using both parallelism and pipelining, a VLSI architecture must have the following characteristics [23,32]:
(1) There are only a few different types of processing elements, and the communication between them is local, simple and regular. (2) Data flow is simple and regular. In the best case, the data flow is linear (only one direction). In this chapter, we will study VLSI architectures for computing regular and central moments, and the important issue for VLSI design, algorithm partition. 2.1. Regular Moments and Central Moments The regular or geometric two-dimensional moments of order ( p
+ q ) of an area
A , for a continuous image f(z,y), is defined as F
F
where p , q E { 0 , 1 , 2 , . . .}. For a digital image of area A , the moments are: MPP
=
c
In order to obtain translation invariance, we can first locate the centroid of the image, such that
We can define the central moments that are translation invariant as follows:
From above, we can see that it takes a lot of additions and multiplications to compute the regular and central moments. Several efficient methods have been proposed to speed up the computation. Some researchers 129,301 discussed the superiority of boundary-based computation and proposed a simpler algorithm based on Green’s theorem. They converted the double integrals into a linear integral around the boundary of the polygon. Both algorithms are only suitable for binary images, i.e. f(z,y) is either 0 or 1. Jiang and Bunke [29] did not include the time for finding the edge (vertices) which could make up the majority of the total computation time. Li and Shen [30] gave the consideration of the entire process of the
982
H.-D. Cheng, C.-Y. Wu €9 J . Li
computation and proposed the use of an upper triangular systolic structure to speed up the computation, but did not include the time for calculating the lower triangular matrix required by the algorithm. Chen [31] developed a parallel algorithm for computing moments based on decomposing a two-dimensional moment into vertical and horizontal moments. He used a so-called cascade-partial-sum method to compute the summation of the partial results. The time complexity for calculating the moments is O ( ( p+ l)(q + 1 + cl)Zog(n) + (q + 1 c 2 ) n ) for a linear array, where c1 and c2 are the times spent for intermediate summations of the cascade-partial-sum method, and the values will be increased irregularly when p and q increase. Chen [31] assumed the data were preloaded in the processor array. It would take at least n time units to load the data which were not included in the time analysis. The cascade-partial-sum method will take more time for the intermediate summations of the higher-order moments; the memory of each processing element had t o be varied for different orders of moments. Furthermore, the data flow and control flow are irregular and complicated causing difficulty in achieving correct timing and designing the array structure. Therefore, this algorithm is not suitable t o be implemented using VLSI architecture [23,32].
+
2.2. One-Dimensional
VLSI Architecture for Computing
Regular Moments F'rom Eq. (2.2), for a n n x n image, we can calculate the moments of order ( p + q ) using the following algorithm: Given that i) f ( z , y ) is the gray level of the image; ii) a: and y are corresponding coordinates of the pixel of the image; iii) Mp,qare the moments of order ( p q ) ,
+
let z := 1; M,, := 0 while (x < n) begin y := 1 O U t p r e ( Z , y) := 0 while (y < n ) begin
+ zP x y q x f(x, Y) + 1) := O U t n e z ( Z ,Y)
O U t n e z ( 2 ,Y) := O U t p r e ( Z , Y) O ~ t p r e (Y ~1
y:=y+1
end /* while(y) y:=y-1 Mpq := M p q
*/
+ OUtnez(Z, Y)
a::=a:+l end /* while(x)
*/
...
5.6 VLSI Architectures for Moments and
983
The sequential algorithms with loops can be implemented using VLSI architectures [23,32]. Based on the above algorithm, we can use a one-dimensional systolic array which has two types of processing elements to calculate the regular moments. The first type of processing elements has: Five inputs -- p , q , f(z,y),control signal, and the output from the left processing element; Four outputs - p , q, the current output to the right processing element, and a signal which controls each processing element to start the calculation; Three multipliers - one for calculating zp,one for calculating yQ, and another for computing x p x yQ x f(z,y); Two registers - to store the corresponding coordinates of the processing element; One adder - to add the output from the left neighbor processing element and the currently calculated value; Some logical gates which are necessary for controlling the timing of calculation. Figure 1 shows the structure of the processing element. The symbolic representation for the processing element is shown in Fig. 2. From Fig. 1, it is clear that within a processing element, the operation for calculating zp x yQ x f(z,y) Outpre must be done concurrently with other operations. The computation time needed
+
M.1 proc...or
P
.t&
to tho noxt procomaor
lia Out,,One
ttmr unrr delay
lrpvrl Multiplexer
Tho initid vducm that stored in x' md'y UC1. will Bcloct tho data. At tho initi.lt h o . it will scloct p m d q.
LEI
And M o r that. it sclocts tho data torn tho register. Check 2 0 or not ~~t
Fig. 1. Structure of PE for One-Dimensional VLSI Architecture.
984 H.-D. Cheng, C.-Y. Wu €4 J. Li
I
f&Yl p'-p with o m t h o unit delay. 9'-q with on0 timo unit delay. #tut'-#tut with ouo timo unit dolay.
Fig. 2. Symbolic Representation of Processing Element.
+
for each processing element is max(p, q ) 2. The second kind of processing element is an accumulator which is the (n 1)th processing element.
+
2.2.1. The operations of PEs Each processing element will receive p , q, f(2,y), and the output from its previous processing element as the inputs. The data, f ( z , y ) , will be input to the one-dimensional array row by row (or column by column). At the first time unit, processing element (1,l) will receive p and q, store these values in the registers, and start to calculate 1P and 19. At the same time, p and q will be sent to the delays connected with the next processing element. Since 1 P and 1 9 can be computed simultaneously, it takes max(p, q ) time units to finish the computation. Meanwhile, the start signal takes max(p,q) time units delay to input f ( 1 , l ) and to calculate Outnez = Outp,., l p x 1 9 x f ( 1 , l ) . It takes two more time units to perform the multiplication and addition (refer to Fig. 1). Thus, the total time to calculate Out,,, = Out,,., 1 P x 1 9 x f ( 1 , l ) is max(p,q) 2 time units. The start signal will be input to processing element (1,2) after one time unit, i.e. at the time unit max(p, q ) + 1. Note for PE(1, l ) , the value of Out,,, is zero. At time unit max(p, q ) + 2, processing element (1,l) will produce the output and pass it to processing element (1,2). The values of p and q will arrive at processing element (1,2) at the second time unit, which starts to compute 1 P and 29. It also takes max(p,q) time units to calculate 1P and 29. At the (max(p, q ) 1)th time unit, the calculation for 1 P and 29 is finished, and f(1,2) is input to processing element (1,2). Therefore, at time unit max(p, q ) + 1, the required data for for processing element (1,2) are ready, and the start signal arrives to initiate the calculation. When the output from processing element ( l , l ) , Outpre,is received by processing element (1,2) at the next time unit, processing element (1,2) will perform the multiplication and addition, and produce the output Out,,, = Outpre 1 P x 29 x f ( l , 2 ) at the (max(p, q ) 3)th time unit. According to the above description, the adjacent data of the same row needs one time unit delay in order to match the timing requirement. Now let us consider the data of the next row. In order to calculate Out,,, = Outpre 2P x 1 9 x f(2, l ) , the 2-coordinate of processing element (1,l) has to be
+
+
+
+
+
+
+
5.6 VLSZ Architectures for Moments and
...
985
ca
t A
ON8 nm. un:t d d a y
U
Fig. 3. One-Dimensional VLSI Architecture.
increased by one after l p is calculated. At the pth time unit, x will be increased from 1 t o 2. Because max(p,q) time units are needed to calculate 2P and 1 9 , max(p,q) time units delay is necessary t o input the first data of the second row. Here we want to indicate that the data, f ( x , y), in the same column have the same y4 value, which was computed for f(1,y). We can input f ( x , y) without computing yq again, where x = 2 , 3 , . . . ,n. However, we have to make a trade-off between the regularity and uniformity of the processing elements and the simplicity of the structure. Since for most VLSI designs, the regularity is more important, we adapt this principle here. Figure 3 shows the VLSI structure and the data arrangement. For a n n x n image we need n 1 processing elements to form the one-dimensional structure. The operations of this structure are summarized as follows:
+
(1) p and q will be input t o and stored in the first processing element at the first time unit. Then they are passed to and stored in the next processing element after one time unit, and so on. (2) A start signal with max(p,q) time delay is needed for inputting the first data of the first row, and it will be passed to the next processing element after one time unit. (3) A max(p, q ) time delay is needed to input data of adjacent rows. (4) The adjacent data in the same row will be input after one time unit delay, i.e. the data are skewed. (5) The x-coordinate of each processing element will be increased by one after the xp is computed according to the corresponding data arrangement, and the ycoordinate is fixed. Certainly, if we want to input data column by column, the roles of 2 and y will be switched.
986
H.-D. Cheng, C.-Y. Wu & J . La
Based on the above description, the time complexity is O(max(p, q ) + 2 + n 1 max(p, q ) x (n- 1) 1) = O(max(p, q ) x n n 2) for computing the ( p q)th moment of the entire image. If there are k images, the time complexity will be k[max(p,q ) x n n 21.
+
+ +
+
+
+ +
2.3. Two-Dimensional VLSI Architecture for Computing Regular Moments
+
Although One-Dimensional VLSI architecture needs only n 1 processing elements to compute the moments of order ( p q ) for an n x n image, it needs max(p, q ) x n + n + 2 time units to complete the entire calculations. We can use a two-dimensional VLSI architecture to perform the calculation of the moments with a smaller time complexity. The basic algorithm for the two-dimensional architecture is as follows:
+
Let Mpq := 0 while not finished if start = 0 then for y := n to 1 do for x := 1 to n do begin store f ( x , y) in the register of the (x,y)th processing element calculate x p and yq end /*for x */ end /*for y */ if start = 1 then x := 1 while (x < n) begin y:= 1 Outpre(x,y) := 0; while (y < n) begin O u t n e z ( x ,Y) := O u t p r e ( 2 , Y) + x p x 'Y x f(x,Y) O u t p r e ( 2 , Y 1) := Outnez(x, Y) y:=y+l end /*while y */ y:=y-1 M p q := Mpq + Outnez(x,Y) x:=x+1 end /* while x */
+
+
For this architecture we need n x (n 1) processing elements. The structure of each processing element is about the same as the one in one-dimensional VLSI
5.6 VLSZ Architectures f o r Moments and . . . 987
4 2P' 6i n
a
w
; e B
$
start -
star#*
to the next processor
MulnpNar Mulnplexar The initid values that stored in 9 a d y' YO 1. (AYD( The ffx,y) in the register will bo output to the next proceiior when start u 0. tively. A n d will iekct the data. At tho initid time. it will ickct p Md 4. re# After that. it sekoti tho data from the corrciponding registers. Chack 2 0 or not N o t c _
fl
Tz="
Fig. 4. Structure of P E for Two-Dimensional VLSI Architecture.
fcsc.Yl out,,
start
f,fX.Yj
,, *
q'
arart'
out".? atart
p ' - p with one time unit delay. 4 ' - 9 with one time unit delay. start'-start with one time unit dclay.mdI'fx,y)-f~s.y) with one time unit delay.
Fig. 5 . Symbolic Representation of Processing Element.
architecture, but more outputs and inputs are needed. Figure 4 shows the structure of each processing element, and Fig. 5 is the symbolic representation of Fig. 4. 2.3.1. The operations of PEs In this architecture, we need t o input the data from the last column (or row) to make sure that when the calculation is started, the data are in the corresponding processing elements. A start signal is needed to initiate the calculation, and is issued until f ( 1 , l )is input to the ( 1 , l ) t h processing element. It takes n time units for f(1,l)to arrive at processing element (1,l). Meanwhile, p , q, and f(z,y) will be input to and stored at the corresponding processing elements. Once the processing element receives p , q, and the corresponding f(s, y), it starts t o calculate z p , yQ, and stores f(z, y) in the register.
988 H.-D. Cheng, C.-Y. Wu 6 J. Li
After the start signal is issued, processing element (1,l)will need two more time units to perform the calculation of Out,,, = Outpre xp x yQ x f(z,y). In the next time unit, the start signal will be passed to the next processing element at its right hand side, i.e. processing element (1,2), and to the processing element below, processing element (2,1), and so on. Since the processing elements will calculate x p and yQ immediately after p and q arrives, when the start signal is input, both x p and yQ have been calculated. The processing elements will then perform the calculation and produce the output using f(z,y) stored in the register. It is clear that after the start signal arrives, the time delay for the output data is equal to one time unit. To calculate the moments for an n x n image, we need n time units delay for issuing the start signal. Then, another 2 n (n - 1) 1 = 2n + 2 time units are needed to complete the entire calculation. Totally, it takes 3n + 2 time units to calculate the moments of order ( p q ) . Figure 6 shows the entire two-dimensional VLSI architecture. From Fig. 6 we can see that the last column of data, f(1, n),f ( 2 , n), . . .,f(n, n), are input first while the first column of data, f(1, l),f ( 2 , l ) ,. . . ,f(n, l ) , are input last. To ensure the correct data flow and timing, the data are skewed as shown in Fig. 6. Cheng and Tong [22] have a brief description about how to arrange the data to ensure the correct data flow and timing.
+
+ +
+
+
Tho start rignd wiU bo initialized with mu(p.p)+n time delay, u shown in tho figuro. Assumo tho initid vduo of M , u zoro and tho data travel between processing elements OIL0
timo
onit.
Fig. 6. Two-Dimensional VLSI Architecture.
2.4. Two-Dimensional
VLSI Architecture for the Computation of
Central Moments We propose a two-dimensional VLSI architecture for calculating the central moments. From Eq. (2.3), for calculating Z and jj, we have to compute MOO,Mol
5.6 VLSZ Architectures for Moments and
...
989
and M ~ oFrom . the definition of moments we know:
let ~ o o ( x:= ) Mol(x) := Mio(x) := O let Moo := Mol := Mi0 := 0 for y := n to 1 do for x := 1 to n do begin Moo(x) := Moo(x) + f(GY) MOl(2) := M01(x) + Y x f ( x , y ) MlO(2) := MlO(2) + x x f(x7 Y) end /* for x*/ end /* for */ for x := 1 to n do begin Moo := Moo + Moo(x) MOl := MOl MOl(X) MlO := MlO + MlO(Z) end /* for */
+
A two-dimensional VLSI structure that will calculate central moments shown in Fig. 7 consists of two subsystems: Subsystem A is to calculate z and jj and Subsystem B is to calculate the central moments. Subsystem A consists of 3 x n processing elements. The first column is to compute MOO,and the structure of the processing element is shown in Fig. 8. The second column is to compute M I O ,and the structure of the processing element is shown in Fig. 9. The value of x will be pre-stored in each processing element. The third column is for computing Mol, and the structure of the processing element is shown in Fig. 10. We need to store the value of y, which will be decreased by one after each calculation, in each processing element of this column. In order to calculate the central moments, we can combine this architecture with a two-dimensional architecture similar to the one in Section 2.3 to form a new architecture as shown in Fig. 7.
990 H.-D. Cheng, C.-Y. W u F3 J . Li
m
3n+3 time
PP.
@
max(p.q) +Zn+J time uniin dday
unit# delay Smbmyrtem B of the two-dimensional VLSI structltre for cakuktmg central moment
Fig. 7. Two-Dimensional VLSI Structure for Central Moment.
Fig. 8. Structure of PE of the First Column of Subsystem A.
Fig. 9. Structure of PE of the Second Column of Subsystem A.
5.6 VLSI Architectures for Moments and
...
991
Check 2 1 or not
Fig. 10. Structure of PE of The Third Column of Subsystem A.
2.4.1. The operations of PEs Data f ( x , y ) will be input in a column-major manner. At the first time unit, the data will be input to subsystem A, and at the third time unit, the data will be input to subsystem B. In order to perform the calculation correctly, we have to skew the input data to meet the time requirement. The data will arrive at the corresponding processing elements of subsystem B at the (n 3)th time unit, be stored in the registers and wait for the start signal to perform the calculation. According to the data arrangement and Fig. 7, at the 2nth time unit, Moo will be calculated and sent to the next column to calculate z. At the next time unit, M10 will be calculated and used for computing ??, MOOwill be sent to the next column to perform the calculation of jj. At the next time unit, the (2n + 2)th time unit, Mol will be calculated and used for computing 8. Then z and jj will be input to subsystem B. We can start to calculate (x - z)P and (y - j j ) q at the (2n 3)th time unit, and it takes max(p, q ) time units to calculate them. We have to make the data and the start signal ready at the (271 3 max(p,q))th time unit. Hence, the start signal will initiate the calculation of subsystem B at the (2n 3 max(p, q ) 1)th time unit. The structure of the processing element of subsystem B and its symbolic representation are shown in Fig. 11and Fig. 12, respectively. From the data arrangement and Fig. 7, we can see that it takes 2n+3+max(p, q ) time units for issuing the start signal, and 2n - 1 2 1 = 2n 2 time units to perform the moments computation. Thus, it takes a total of 2n 3 max(p, q ) 2n 2 = 4n max(p, q ) 5 time units to finish the entire calculation.
+
+
+ +
+ +
+ + +
+ +
+ +
+
+
+
2.5. Algorithm Partition When a computational task size is larger than the VLSI architecture size, we have to partition the task into smaller subtasks to calculate the moments on a fixed-size VLSI architecture. If we have an image with the size k x 1 and the twodimensional VLSI architecture with size m x n, and k, 1 are dividable by m and n respectively, we can easily partition the image and calculate the moments using
992
H.-D. Cheng, C.-Y. Wu €9 J. La
the next pmcesior
Mdtiplerar The initial v&ci that stored in 9 and fl u c 1. Th&y.) in the register will be output to the next proceisor when s l u t u 0. WIU i e b c t the data. At tho initid time. it will sebct p and q. reipectiveiy. Not After that. it i o b c t i the data fiom the corresponding reguterr.
Fig. 11. Structure of P E of Subsystem B.
cia
PkYj
<= *F + if;di#tactY
._
: ; ;Z& GI
Check ~0 or not
i'
e
p* g' atart9 Adder p ' - p , 9 ' - q #tart*-atart , ~ ~ , y ) - f ~ x , y j . ~ ' - ~ a n d ~ ' - ~ with on0 time unit dcky. And
.
Fig. 12. Symbolic Representation of Processing Element.
the m x n architecture. If k and 1 cannot be divided by m and n, we can fill zeros into the last columns and the last rows of the image, such that it can be divided by m and n. Now let us discuss how t o implement the partition algorithm using a two-dimensional VLSI architecture with size m x n to calculate the regular moments. Assume the dimension of the image is k x Z and [k i ml = s, [Z i nl = t. We can divide the image into s x t subimages with size m x n. We use ( s , t ) to index the subimage. Each subimage can be input t o the two-dimensional array to perform the calculation. In order t o perform the calculation, two sets of sequential pulses are needed. The first set of sequential pulses with n m max(p, q ) time units delay is needed t o increase the y index of the coordinates (z,y) of the
+ +
5.6 VLSI Architectures for Moments and
...
993
processing element of the two-dimensional VLSI structure. Another set of sequential pulses with n + m + m a @ , q ) + t delay is needed to reset the y index and to increase the x index of the coordinates (x,y) of the corresponding processing element. Therefore, at the first time unit, subimage ( 1 , l ) of the image is input to the VLSI structure. While the two-dimensional structure is calculating the regular moment of the (1,l)th subimage, the indices of the coordinates of the processing elements are: (1, l), (1,2), . . . , (l,n), (2, l ) , . . . , (2,n), (m,l),. . ., (m,n). When the processing elements have finished calculating x p and yg, y can be increased to y + (t - 1) x n to match the corresponding indices of the coordinates for the next subimage (1,2), which are: (1, n l),(1,n 2), . . ., (1, n n),(2, n l),. . . , (2, n n ) ,(m, l),. .. ,(m,n + n). Clearly when the subimage (1,t) is input to the VLSI structure, the indices of the coordinates (x,y) are: (1, (t-1) x n + l ) , . . ., (1, (t-1) x n+n),(2, (t-l)xn+l), . . ., (2, (t-l)xn+n), (m,(t-l)xn+l), . . . , (m,(t-l)xn+n). At the next time unit, the first pulse of the second set of sequential pulses is input, the processing elements reset the y index to 1,2,. . . ,n and increase the x index to x + (s - 1) x m. This procedure will divide the k x I image into s x t subimages with the size m x n. The VLSI architecture will produce the regular moments of each subimage, and an accumulator will add the s x t moments, that is the regular moment of the k x 1 image. Similarly, the method can be use to calculate the central moments on a fixed-size VLSI architecture.
+
+
+
+
+
3. Experimental Results
We have employed central moments as the features for breast cancer biopsy images and pavement distress images. The features are input into neural networks for classification. Based on the nature of neural networks, after training, neural networks can operate extremely fast, and the feature extraction (moment computation) becomes the bottleneck of the entire process. VLSI architectures for moments can solve such problems.
3.1. Breast Cancer Detection Breast cancer continues to be a significant public health problem in the United States. It is estimated that 182,000 new cases of breast cancer will have been diagnosed and 46,000 women will have died of breast cancer in 1995. One out of eight women will develop cancer at some point during her lifetime in this country. The incidence of breast cancer has been rising steadily. The increase was largely for smaller earlier stage tumors. This pattern suggests that this apparent increase in incidence may have reflected earlier detection in a period of time when screening mammography was becoming more widely popularized. The earlier stage tumors are more easily and less expensively treated. Because of this fact, and because of the high incidence of breast cancer in this country, any improvement in the process of diagnosing the disease may have a significant impact on years of life saved and cost to the health care system.
994
H.-D.Cheng, C.-Y. Wu €4 J. La
Here, we will grade the breast cancer in biopsy images. It is necessary for a physician to distinguish between benign lesions and the various degrees of malignant lesions from mammography or biopsy images. Bloom and Richardson (BR) [33] proposed a breast cancer grading method which provides a useful and reliable indicator of breast cancer. Physicians need to visually inspect microscope-slide biopsy images to grade different grades of breast cancer. In this method, the number of tubular structure indicates different degrees of breast cancer. Three criteria are used to detect the tubular structure in a biopsy image: (1) brightness - tubular structures are brighter than the surrounding tissues and the background; (2) bright homogeneity - the density inside the tubule boundary is uniform; and (3) the dark boundary surround the tubule area [20]. Normally, a classification of three degrees is used for the biopsy images of breast cancer. When there is a majority of tubular structures in the image, a high score of three is given. When there are only a little or no tubular structure in the image, a score of one is given. Intermediate differentiation obtains a score of two. To speed up the process and increase the accuracy, we use 25 central moments ( p , q = 0 , 1 , 2 , 3 , 4 ) as the features of the biopsy images, and a neural network as a classifier to grade the tubules of the images. We have done some experiments of tubule grading using breast cancer biopsy images, and pavement distress detection using pavement digital images. We first perform feature extraction on 591 breast cancer images. Figures 13,14 and 15 are the sample breast cancer images with scores
Fig. 13. Case 907408A with Score 1.
5.6 VLSI Architectures for Moments and . . . 995
Fig. 14. Case 8912809B with Score 2.
Fig. 15. Case 8912809A with Score 3.
996 H.-D. Cheng, C.-Y. Wu B J . La Table 1. The First 25 Central Moments of Fig. 13,14 and 15. Order of Moment
Fig. 13
Fig. 14
moo
0.8889 0.0000 0.8644 -0.0190 0.8284 0.0000 -0.0352 0.0005 -0.0815 0.0010 0.8707 -0.0255 0.8429 -0.0306 0.8042 -0.2534 -0.1086 -0.2380 -0.2498 -0.2280 0.8468 -0.0074 0.8151 -0.0034 0.7772
0.9791 0.0000 0.9967 -0.0190 1.0000 0.0000 -0.4442 -0.2889 -0.6459 -0.0468 0.9858 0.3696 1.0000 0.0488 1.0000 -0.3456 -0.6451 -0.3734 -0.8744 -0.2954 0.9909 0.1853 1.0000 -0.1613 1.0000
mo1 m02 m03 m04
m1o mll m12
m13 m14
m2o m21 m22
m23 m24
m30 m31 m32 m33 m34 m40
m41 m42 m43
m44
Fig. 15 0.9004
0.0000 0.9163 -0.5169 0.9188 0.0000 -0.5646 0.0200 0.0038 -0.0022 0.8697 0.1468 0.8822 0.0626 0.8815 -0.2952 -0.1421 -0.2880 -0.0522 -0.3043 0.8343 0.1603 0.8424 0.0679 0.8416
Table 2. Comparison of the Results by Computer and Physicians. ~~
Case number
Score rated by physicians
Score rated by computer
8912809A 8912809B 8922992A 90336T3A 90336T3B 905840A 905840B 907408A 907408B 909511A
3 2 2 1 1 3 3 1 1 2
3 2 2 1 1
3 3 1 1 2
5.6 VLSI Architectures for Moments and
.. .
997
of 1, 2 and 3, respectively, and their corresponding moments are listed in Table 1. We construct a training set which contains the central moments of 472 images from the entire breast cancer images set, and the other 119 images as a testing set. After two hours of training on a Pentium 200 PC, the artificial neural network successfully converged. Then, we use the resulting neural network to classify the testing images into three classes without any mis-classification. Table 2 shows part of the results from the experiments. 3.1.1. Pavement distress detection
Statistics published by the Federal Highway Administration indicates that maintenance and rehabilitation of highway pavements in the United States requires over $17 billion dollars a year. Currently, maintenance and repairs account for nearly one-third of all federal, state, and local government road expenditures. According to the US. Department of Transportation (DOT), more than $315 billion dollars will be needed through the year 2000 to maintain current road conditions [34]. Pavement analysis as an essential element of pavement management plays a very important role in reducing the cost and increasing the efficiency and effectiveness. Conventional visual and manual pavement distress analysis approaches which use manual visual inspection are very costly, time-consuming, dangerous, laborintensive, tedious, subjective, having a high degree of variability, unable to provide
Fig. 16. Pavement Image without Cracks.
998
H.-D.Cheng, C.-Y. Wu €4 J. Li
I .
.
- ,. .
.. j-
m
Fig. 17. Pavement Image with Vertical Crack.
Fig. 18. Pavement Image with Horizontal Crack.
i
5.6 VLSI Architectures for Moments and . . . 999
meaningful quantitative information, and almost always leading to inconsistencies in distress detail over space and across evaluations. Automated pavement distress analysis systems, especially automated real-time pavement distress analysis systems attract more and more attention from the transportation agencies and researchers in United States and abroad. We use 25 central moments as the features to represent the images and input them into a neural network for classification. Figures 16-18 are the pavement images without cracks, and with vertical crack and horizontal crack, respectively, and their corresponding central moments are shown in Table 3. We randomly select 70% of the images from 280 pavement images as the training set, and 30% of the images as the testing set. We just want to determine if the image contains crack(s) or not, and do not classify it into whether the crack is vertical, horizontal or diagonal, etc, and it is only a part of the task of a large automated real-time pavement distress classification system investigated by us now. After about a hour, the neural network converged which was used to classify the pavement images in Table 3. The First 25 Central Moments of Figs. 16, 17 and 18. Order of Moment
Fig. 16 0.9304 0.0000 0.9303 -0.4166 0.9302 0.0000 -0.4457 -0.1817 -0.3429 0.1144 0.9132 0.7900 0.9129 0.5850 0.9130 0.3263 -0.5276 0.2939 -0.4144 0.3222 0.8956 0.8350 0.8953 0.6112 0.8953
Fig. 17 0.9617 0.0000
0.9615 0.4368 0.9609 0.0000 -0.3401 -0.0182 -0.2638 0.0015 0.9669 -0.3263 0.9661 -0.0995 0.9651 -0.7318 -0.4280 -0.7532 -0.3493 -0.7284 0.9654 -0.3355 0.9644 -0.1539 0.9631
Fig. 18 0.9677 0.0000 0.9670 0.3944 0.9660 0.0000 -0.3494 -0.2615 -0.2367 -0.0632 0.9613 -0.5389 0.9586 -0.2778 0.9575 0.2632 -0.4248 0.2015 -0.3211 0.2000 0.9524 -0.6792 0.9481 -0.4235 0.9464
1000 H.-D. Cheng, C.-Y. Wu d J. Li Table 4. Comparison of the Results by Humans and Computer. Case number
Crack detection by Human
Crack detection by Computer
0 1 0 0 0 1 1 1 1 1
0 1 0 0 0 1
~
Picture 1 Picture 2 Picture 3 Picture 4 Picture 5 Picture 6 Picture 7 Picture 8 Picture 9 Picture 10
1
1 1 1
the testing set. Table 4 shows part of the results from the experiments, and a 100% classification rate has been achieved. 4. Conclusions
Moments have many applications to pattern recognition. We have employed moments as the features for breast cancer images and pavement distress images, and input these features into neural networks for classifications. Satisfactory results have been achieved. In order to speed up the process, we proposed a one-dimensional structure, which needs fewer processing elements, and has a total calculation time of m a @ , q ) x n n 2 time units for computing the moments of order ( p + q). For the two-dimensional structure, it needs more processing elements, but the total calculation time is 3n + 2 time units. If there are lc images, the computational time for the one-dimensional structure is lc[max(p,q ) x n (n l)].For the twodimensional architecture, it takes (lc-1) x 2n+3n+2 = 2nlc+n+2. If a uniprocessor is used, the time complexity is ( p q)n2.We also present a two-dimensional VLSI architecture for calculating the central moment. It takes 4n 5 + max(p, q ) time units to compute the central moments of order ( p q). The comparison of the time complexities of the existing faster algorithms with the one of our proposed approaches is shown in Table 5. The important issue in
+ +
+ +
+
+
+
Table 5. Time Complexities for Faster Algorithms. Method
The time complexity Single Processor
Multiprocessor Linear array
Jiang’s Method [29] Li’s Method [30] Chen’s Method [29] Our Method
O(mn) O(mn)
X
X
O(n) O(n)
X
VLSI
2-D array
X
O(n) O(n)
Y Y N Y
5.6 VLSI Architectures for Moments and
...
1001
VLSI design, algorithm partition, is also studied. The proposed VLSI architectures will find wide applications in the areas of pattern recognition, image processing and computer visions, especially, for real-time processing.
References (11 R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rd edn. (Addison-Wesley, MA, 1992). [2] M. K. Hu, Visual pattern recognition by moment invariants, I R E %ns. Information Theory IT-8 (1962) 179-187. [3] C. H. Teh and R. T. Chin, On image analysis by the methods of moments, IEEE Trans. Pattern Anal. Mach. Intell. 10 (1988) 496-512. [4] M. R. Teague, Image analysis via the general theory of moments, J. Opt. SOC. Am. 70 (1980) 920-930. (51 Y. S. Abu-Mostafa and D. Psaltis, Recognitive aspects of moment invariants, IEEE h n s . Pattern Anal. Mach. Intell. PAMI-6 (1984) 698-706. [6] S. A. Dudani, K. J. Kenneth and R. B. McGhee, Aircraft identification by moment invariants, IEEE Trans. Comput. 26 (1977) 39-46. [7] C. H. Chou and Y . C. Chen, Moment-preserving pattern matching, Pattern Recogn. 23 (1990) 461-474. [8] R. Y. Wong and E. L. Hall, Scene matching with invariant moments, Computer Graphics and Image Processing 8 (1978) 16-24. [9] R. G. Casey, Moment normalization of handprinted characters, IBM J. Res. Develop. (1970) 548-557. [lo] G. L. Cash and M. Hatamian, Optical character recognition by the method of moments, Computer Vision, Graph, and Image Processing 39 (1987) 291-310. [ll]S. Ghosal and R. Mehrotra, Orthogonal moment operators for subpixel edge detection, Pattern Recogn. 26 (1993) 295-306. [12] A. Khotanzad and Y . H. Hong, Invariant image recognition by Zernike moments, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 489-497. [13] H. K. Saradana, M. F. Daemi, A. Sanders and M. K. Ibrahim, A novel momentbase shape description and recognition technique, in IEE the 4th Int. Conf. Image Processing and Its Applications, Apr. 1992, 147-150. [14] S. T. Liu and W. H. Tsai, Moment-preserving corner detection, Pattern Recogn. 23 (1990) 441-460. [15] S. 0. Belkasim, M. Shridhar and M. Ahmadi, Pattern recognition with moment invariants: A comparative study and new results, Pattern Recogn. 24 (1991) 1117-1138. [IS] A. P. Reeves, R. J. Prokop and S. E. Andrews, Three-dimensional shape analysis using moments and Fourier descriptor, IEEE Trans. Pattern Anal. Mach. Intell. 10 (1988) 937-943. [17] K. S. Fu, VLSI for Pattern Recognition and Image Processing (Springer-Verlag, New York, 1984). [18] T. Y. Young and K. S. Fu, Handbook of Pattern Recognition and Image Processing (Academic Press, New York, 1986). [19] H. D. Cheng and C. Xia, A novel parallel approach to character recognition and its VLSI implementation, Pattern Recogn. 29 (1996) 97-119. [20] H. D. Cheng, X. Q. Li, D. Riordan, J. N. Scrimger, A. Foyle and M. A. MacAulay, Parallel approach for tubule grading in breast cancer lesions, Information Sciences, An International Journal, Applications 4 (1995) 119-141.
1002 H.-D.Cheng, C.-Y. Wu €d J. Li [21] H. D. Cheng and X. Cheng, Parallel shape recognition and its implementation on a fixed-size VLSI architecture, Information Sciences, A n International Journal 2 (1994) 35-59. [22] H. D. Cheng and C. Tong, Clustering analyzer, IEEE Trans. Circuits Syst. 38 (1991) 124-1 28. [23] H. D. Cheng, Y. Y. Tang and C. Y . Suen, Parallel image transformation and its VLSI implementation, Pattern Recogn. 23 (1990) 1113-1129. [24] H. D. Cheng, C. Tong and Y. J. Lu, VLSI curve detector, Pattern Recogn. 23 (1990) 35-50. [251 H. D. Cheng, H. Don and L. Kou, VLSI architecture for digital picture comparison, IEEE Rans. Circuits Syst., Special Issue on VLSI Implementation for Digital Image and Video Processing Applications, 36 (1989) 1326-1335. [26] H. D. Cheng and K. S. Fu,VLSI architectures for string matching and pattern matching, Pattern Recogn. 20 (1987) 121-141. [27] H. D. Cheng and K. S. F’u, Algorithm partition and parallel recognition of general context-free languages using fixed-size VLSI architecture, Pattern Recogn. 19 (1986) 361-372. [28] H. D. Cheng and K. S. Fu,VLSI afchitecture for dynamic time-warp recognition of handwritten symbols, IEEE Rans. Acow., Speech, and Signal Proc. 34 (1986) 603-613. [29] X.Y . Jiang and H. Bunke, Simple and fast computation of moments, Pattern Recogn. 24 (1991) 801-806. [30] B. C. Li and J. Shen, Fast computation of moment invariants, Pattern Recogn. 2 (1991) 807-813. [31] K.Chen, Efficient parallel algorithms for the computation of two-dimensional image moments, Pattern Recogn. 23 (1990) 109-119. [32] H. T. Kung, Why systolic architectures? Computer (1982) 37-46. [33] H. J. G. Bloom and W. W. Richardson, Histological grading and prognosis in breast cancer: A study of 1409 cases of which 359 have been followed for 15 years, Br. J. Cancer 11 (1957) 359-377. [34] S. Tsao, N. Kehtrarnavaz, P. Chan and R. Lytton, Image-based expert-system approach to distress detection on CRC pavement, J. Transport. Eng. 120 (1994) 52-64.
INDEX
(MCE/GPD), 489 (ROC), 171 2-D, 390, 408, 409 3-D, 388, 390, 397, 398, 400, 410, 411, 413-415 animation, 445 continuous, 341 motion capturing, 426 motion estimation, 341 motion sensors, 427 object recognition, 431 positioning, 328 reconstruction, 428, 804 shape representation, 805
multivariate, 630 pattern recognition approach, 630 remotely sensed image, 625, 627 structural, 643, 659 terrain, 635, 658 aperiodic, 208 application, 668 approximate transformations, 780 areal entity, 631 areas of interest (AOI), 697 arity, 741 array grammar, 184, 185 universal, 185 array shaped sensor, 690 ART, 109 articulated object, 935 Artificial Neural Net (ANN) computing, 105 artificial neural networks, 479 aspects, 929, 932 associative memory, 125, 535, 536 associative operator, 939 attribute, 625, 633, 639, 654 aspatial, 632, 633, 654 locational, 625, 626, 631, 632, 638, 651, 652 nonlocational, 625, 626, 631, 632, 638, 651, 652 attributed grammar, 72 attributed string, 73 auto-associative memories, 129 auto-component selection (ACS) method, 617 autocorrelation features, 221 automated face recognition (AFR), 668 automated inspection, 213 Automatic Vehicle Guidance, 818 autonomous driving, 839 Autonomous Land Vehicle In a Neural Network (ALVINN), 824 autonomous mobile robots, 765 AVIRIS, 518 data, 520
aberration, 693 acoustic emission NDE, 463 feature vectors, 474 model, 474 signal spectral ordering, 114 active vision, 864 adaptive clustering neural net, 884 adaptive threshold feedback, 903 address block location, 813 af€ine objects, 933 transformations, 933 agglomerative merging, 722 AGVs (Autonomous Guided Vehicles), 807 algorithm, 112 partition, 991 alignment, 926, 927, 937 coefficients, 937, 941 transformation, 928 amplitude ratio, 459 analog map, 629, 634 analogue/digital interface, 699 analysis autocorrelation, 654, 657 combined, 626, 634 image, 627, 630, 637-639, 641, 647, 652, 659 map data, 625, 628, 651 map-guided image, 630
backlighting, 691, 701 backpropagation, 515, 516 1003
1004 Index algorithm, 480 backpropagation-of-error learning algorithm, 116 backtracking, 748 Bailey profile, 25 bar code, 217 Bar Code Sorter (BCS), 801 base knowledge, 625, 628, 630, 631, 634, 639, 651, 661 baseline distributions, 25 basic model, 581, 582 basic object, 584, 585 basic projective invariant, 317 basis, 316 Bayes classifier, 36 for minimum cost, 36 decision rule, 474, 476, 648 theory, 475 error, 36, 38, 50 risk, 36 Bayesian estimate, 255 Bayesian rule-based classifier, 649 Bayesian statistical foundation, 783 bending, 401, 404 between-class scatter matrix, 41 Bhat tacharyya bound, 40 distance, 47 bias of the voting kNN classification, 57 bigram model, 475 binary, 6 binocular stereo, 778 biological neuronal spatial structures, 114 signals, 613 biophysical parameters, 627, 638 bloodspots, 705 body rotation angular frequency, 377, 379 Boltzmann machine, 117, 435 bone detection, 692 boundary, 642, 646, 652 detection, 213, 265 length, 654 oriented approaches, 646 Boundary Contour System, 250 box dimension, 230
plots, 14 breast cancer detection, 919, 993 browsing tools, 971
C index, 28 CAD model, 790 Calinski-Harabasz index, 28 camera, 690 calibration, 321 problem, 771 operation, 959 cart , 778 cartography, 627-629, 633 cartometry, 651 categorical data, 658, 659 causal graph, 741, 745 links, 759 processes, 742 causality, 738 CCD camera, 537 cell structure models, 240 certainty grid-based methods, 782 change detection, 630, 635, 638, 659, 660 character recognition, 812 characteristic dimension, 241 charge density, 390 distribution, 390, 395 charge distribution, 394 Charge Transfer Devices (CTD), 690 Chasles, 320 Chernoff faces, 14 chromosomes classification, 568 circadian cortisol pattern, 618 class, 754, 758, 762 classification, 54, 249, 474, 698 accuracy, 629, 641, 649, 661 confusion matrix, 649 algorithms, 36 of pulmonary disease, 215 classifier, 474, 629, 648, 649, 659 classifier design, 33, 41 clauses, 762 cluster analysis, 3, 239 prototype, 112 validity, 27 clustering, 4, 16, 630, 645 algorithms, 16
Index 1005 co-occurrence matris, 219, 220, 249, 250, 640 coarse grid, 347 coefficient equation, 369, 370, 372 collineation, 316 collisions, 364 color, 283-285, 635, 639, 643, 649, 689 camera, 704 coding schemes, 908 constancy, 294, 296 filtering, 694 grader, 704 images, 694 indexing, 302 inspection, 703 colorimeter, 704 colorimetry, 288 combinations of classifiers, 567 combinatorial optimization problem, 132 compactness, 639, 642, 661 comparison map, 627, 657 complete-link method, 19 completeness method (see complete-link method) complexity, 413 composite materials, 462 object, 584, 585 compression, 633 computation, 325 computer generated hologram (CGH), 870 computer vision, 283, 296, 391, 464, 625, 627, 638, 650, 765, 857 system, 647 techniques, 858 computing optical flow, 252 condensed nearest neighbour (CNN) classifier, 916 conditional probability, 475, 477 risk, 476 Condorcet’s jury theorem (CJT), 574 conformal motion, 438 confusion matrix, 649 conic, 320, 333 conjugate-gradient optimization, 515 connected component labelling (CCL), 903
Connected Components Analysis (CCA), 602 connectedness method (see single link method) consensustheory, 516 consistency with unknown samples, 477, 478 constraint, 742, 750, 751, 762, 763, 859 logic programming, 742, 745, 762 resolution, 753 solving, 763 context, 762 adaptation, 737, 743, 746 analysis module, 812 causal graph, 745 description, 740 graph, 743 model, 742, 751 modeling, 739 recognition, 737 separation, 737, 739, 743, 745, 751 unstationarity, 738, 744 continuous, 6 contour function, 642 contrast, 221 convex combinations, 933 object, 933 cooperative processing, 251 cophenetic dissimilarity, 17 corner, 748 detection, 755 cornerness, 342, 343 corrective training, 487 correlated voting, 574 correlation, 221, 537 classifier, 41 filters, 879 matrix, 11 correlator, 879, 881, 883 solid optics, 872 correspondence, 929 cortisol time series, 619 covariance matrix, 11 coverage models, 240 criterion, 11 cross ratio, 317, 320, 333, 335 crossovers, 21 crystal, 539 cues, 739
1006 Index Culler Facer Canceller (CFC) machine, 801 curvature, 934 computation, 390, 407 Cushing’s syndrome patterns, 620 Dammann grating, 556 data, 4, 518 association, 436 coherence, 634 fusion, 516 layer, 632, 634 mining, 631 quality, 633, 634 sources, 529 type, 6 database, 628, 631, 632, 661 updating, 630 Decision Boundary Feature Extraction, 513 deconvolution, 458 decorrelation methods, 250 defect characterization, 456 detection, 213, 729 deformable models, 438, 805 Delaunay graph, 224 triangles, 654 triangulation, 805 delegation, 758 Delivery Bar Code Sorter (DBCS), 801 Demster-Shafer classifier, 649 dendrogram, 16, 19 density changes, 241 function, 52 depth of view (DOV), 693 Derin-Elliot model, 228 description language, 594 design, 477 objective, 500 destriping, 901 detection probability, 37 deterministic relaxation, 250, 256 Dickmanns’ 4-D approach, 834 difference of Gaussian (DOG) filter, 225 differential geometry, 431 Digital Elevation Map (DEM), 790
Digital Elevation Model (DEM), 629, 630, 634, 638, 651, 652, 658, 659 digital mammography, 166 map, 626, 627, 634, 644 radiography, 468 dilation, 875 dimensionality, 10 Direct Connection Graph, 399 disambiguation, 760 discrete, 6 3-D motion estimation, 340 discriminant analysis, 512 discriminant function, 491, 494, 499, 630, 648 Discriminative Feature Extraction (DFE), 490, 497 Discriminative Metric Design (DMD), 490 discriminative training, 475, 477, 479, 499 disk shaped sensors, 690 displacement field, 342, 345, 352 dissimilarity matrix, 8 distance classifier, 41, 510 metric, 938 distanced labelled skeleton, 698 distortion-invariant filters, 879 distributed content-addressable memories, 126 distribution-based classification, 713 document analysis, 581, 582 processing, 217 structures, 583 understanding, 581, 582, 593, 594 dynamic, 474 programming, 250, 482, 486, 491 range, 690 scene analysis, 828 ECHO classifier, 510 eddy current NDE, 462 edge density, 231, 640, 644 detection, 389, 839 detectors, 697 thresholding, 644, 646
Index Edge Visibility Regions (EVRs), 792 junction graphs, 390 edgeness, 342, 349 effect of design samples, 48 effect of sample size on estimation, 46 effect of test samples, 48 eigen-problem, 440 eigenvalues, 11 eigenvector transformation, l(t 12 elastic matching, 673, 675 electrical equilibrium, 395 electrical potential, 395, 396 electromagnetic radiation, 639 electrostatics , 393 elongatedness, 642 energy, 221 density, 615, 617 entity, 633, 647 epipolar lines, 428 constraint, 361 geometry, 323, 325 equivocation, 40 erosion, 875 Error Correction Training, 478 error-reject curve, 38 essential energy density, 620 estimation, 52 of classification errors, 46 sensitivity, 477 Euclidean distance, 110, 391, 392, 404 metric, 109 Euler's equations, 365 Expectation Maximization (EM), 161 method, 475 Expert Prospector, 638 expert system, 638, 659 exploratory data analysis, 10 extended Kalman filtering technique, 780 exterior orientation parameters, 771 external, 24 extrapolation capabilities, 123 Extrema, 640 face databases, 675 finder, 676 finding, 670
1007
recognition, 667 sensitive cells, 927 facial expression, 430 false alarm, 37 feature, 6, 154, 302, 306, 747 correspondence, 442 distributions, 712, 716 extraction, 474, 459, 511, 639, 641, 650, 877, 993, 994 extractor, 474, 497 interactions, 251 feeding system, 700 Field-Of-View (FOV), 692, 697 figureground separation, 211 filter, 880 synthesis, 879 filtering, 694 finite element, 397 finite representation, 187 first-order neighbors, 228 fish industry, 688, 701 Fisher linear discriminant, 462 Fisher's criterion, 42 flats sorting machines, 802 floor map, 778 focal length, 692 food industry, 687 environment, 695 foreshortening effect, 241 form description language, 596 form document processing, 595, 597 form registration, 597 formal logic, 649 knowledge, 638 representation, 637 rule-based classifier, 649 formal specification, 749 formatting knowledge, 594 forward looking IR imagery, 899 Fourier domain, 408, 409 filtering, 233 Fourier filtering method, 240 Fourier power spectrum, 250 Fourier transform, 43, 222, 408, 537, 538, 545, 614, 618, 870 features, 249 space, 644 fractal, 229, 631, 643 Brownian motion, 216 dimension, 229, 230
1008 Indez autoregressive Markov random field model, 218 model, 210 texture features, 216 Fractional Difference Models (FDM), 275 fractional models, 250 frame buffer, 699 representation, 637, 638 frequency analysis, 231 and orientation selective filter, 235 ratio, 459 front lighting, 691 functional mapping, 116, 120 functional-link net, 119, 123 approach, 120 fusion technique, 464 fuzzy classifier, 649 logic, 658 Fuzzy-C-Means (FCM) clustering, 159 Gabor and wavelet models, 234 Gabor filter, 214, 219, 235 Gabor function, 235 gathering geographic data, 625 Gauss’s law, 395 Gaussian Markov random field (GMRF), 151 model, 210, 240, 249, 253 Gaussian maximum likelihood classifier, 509 Gaussian random field models, 250 Gaussian surface, 431 gelatin filters, 695 general expanded finite-state grammar, 66 generalization, 117 capability, 477 generalized delta rule (GDR) multilayer feedforward net, 117 Generalized Probabilistic Descent (GPD), 485 method, 488 geographic data, 625627, 629-634, 638, 651, 658, 659, 662 analysis, 625, 627 mapping, 627
geographic database, 625, 626, 630-634, 638 geographic feature, 625, 630, 631, 634, 635, 637-639, 643 geographic information, 625-628, 658 Geographic Information Systems (GIs), 625, 626 geographic knowledge, 630, 631, 634, 635, 637, 638, 650, 658, 661 geographic space, 627, 628, 630-633 categorisation, 625, 638, 662 segmentation, 625, 628, 638, 662 geometric, 581, 585 complexity, 585 feature, 432, 641 geometrical methods, 223 geomorphology slope, 658 watershed basin, 644, 658 geomorphometric variables, 658 geons, 390 Gibbs distribution, 153 energy function, 250 random field, 228 sampler, 259 glass filters, 695 global constraint, 388, 391, 393 goal, 742, 745, 750 Goodman-Kruskal 7 statistic, 28 gradient descent algorithm, 13 gradient search task, 132 gradual transition detection, 951 grammatical inference, 65 graph matching, 698 graph-based approaches, 780 graphic semiology-rule, 652 graphical representation of data, 14 gray level, 639, 640, 644, 645, 661 group average, 20 grouping, 265 hand thermal characteristics, 910 handwritten address reading, 801 heuristic rule, 637 Hidden Markov Models (HMMs), 475 hidden-layer node, 116 hierarchical browsers, 972 clustering, 4, 16, 17
Index filter, 882 space representation, 633 splitting, 721 hill-climbing, 133 search, 784 histogram, 14, 639-641, 645, 648, 660, 661 hit-or-miss, 875 transform, 698 holdout, 50 homogeneity, 221 Hopfield net, 126 auto-associative memory, 131 Hopfield type network, 251 Horizon Line Contour (HLC), 791 Hough transform, 599, 870, 877, 878 Hughes phenomenon, 514, 521, 524 human perception, 392 human-computer interaction, 672 hybrid, 516 hyperdimensional data, 508, 515, 531 hyperspherical regions, 109 hypothesis tests, 36 ideal texture, 227 identification codes, 789 ill-definedness, 861 ill-posedness, 860 illumination independent, 674 illusory contours, 266 image acquisition, 689, 701, 704 analysis, 217 and classification, 629, 637, 640, 647-650, 658, 659 attributes, 342, 345 classification, 627, 634, 647 compression, 227 coordinates, 353 data, 625, 628, 630, 631, 633, 634 database, 625, 631, 633, 668, 670 enhancement, 696 feature, 639, 640, 648, 649 intensity, 342 interpretation, 630, 637, 638, 642, 643, 649-651 Matching, 342 Modeling, 151 pixel, 647 processing, 627, 629, 705
1009
for food products, 696 hardware, 698 rectification, 325 restoration, 227, 252, 253 segment, 639, 640, 642, 647 segmentation, 4, 227, 144, 645 space, 630, 634, 640, 646 standard, 674 structural features detection, 643 texture, 208 understanding, 629, 637, 650, 651, 738, 739, 741, 742, 759 system, 627, 629, 630, 645, 650, 651, 660 vector, 353 inequality triangle, 8 independent invariants, 333 voting, 574 index of proximity, 8 overlay, 658 indexing, 302 indices of validity, 26 inference of attributed grammar, 74, 82 of expansive tree grammar, 92 of finite-state grammar, 65 influence, 742-744 influencing domain, 741, 753 informat ion extraction, 625, 626 processing, 105 visualization, 627 infra-red, 672 (IR) technology, 891 detectors, 892 hescan imaging, 893 inheritance, 755, 757, 759 inspection, 213 instance, 762 instantaneous field of view (IFOV), 639 power, 615 signal power, 615, 617 instantiation of rule, 639 interactions, 738 interactive video, 672 Interconnect Matrix Hologram (I.M.H.), 539, 540, 546, 561, 562
1010 Indez interest operator, 778 interference filters, 695 interferometry, 638 interior orientation parameters, 771 intermediate views, 927, 933 internal, 24 interpolation, 123 interpretation of geographic data, 630 of maps, 217 interval, 5 intra-layer node-tc-node interactions, 118 intractability, 862 intrinsic dimensionality, 54 property, 294 invariant, 334, 335 feature, 641, 642 theory, 332 inverse problem, 456 ISODATA algorithms, 111 isoline, 633 isotropic metric, 109 Iterated Conditional Mode algorithm, 257 ith body vector, 376 joint disease diagnosis, 907 Ic nearest neighbor (kNN), 44 approach, 52 classifier, 55 voting, 55 density estimate, 52 K-means clustering algorithm, 22, 111, 462 pass, 23 Kalman filter, 779 approach, 383 Kappa coefficient of agreement, 649 Karhunen-Loeve expansion, 618 transformation, 10 KEN, 680 kernel function, 52 key-frame extraction, 961 Kittler & Young method, 912 knowledge based approach, 740
discovery, 631 representation, 637 models, 651 -based interpretation, 630 Kriging, 654 labeling problem, 430 lacunarity, 230 land cover, 633, 637, 638, 640, 648 use, 632, 635, 637, 642, 659, 661 classification, 218 landmark-based methods, 766, 767 disadvantages, 771 Laplacian operator, 232 Laplacian-of-Gaussian (LOG) filter, 224 laser range finder, 778 lattice, 632, 633 Laws features, 249 LC Scheme, 936 LCAM model, 364, 367, 378 leakage, 37 learning a discriminant function, 115 Learning Vector Quantization (LVQ), 484 algorithm, 112 least squares method, 254 leave-one-out (L) method, 50 lens quality, 693 leukemic malignancy, 215 lexical knowledge, 802 lighting techniques, 691, 701 line detection, 644, 660 line drawings, 391 linear classifier, 41, 648 combinations, 926, 941 scheme, 929 entity, 631, 633 equations, 413 opinion pool, 516 projections, 10 linked regions, 781 Local Binary Pattern (LBP) texture operator, 715 dimensionality, 54 morphological dilation, 903 localisation, 811 road modeling, 821
Zndez
locally constant angular momentum (LCAM) model, 341, 364 logarithmic opinion pool, 517 logical, 586 structure, 586, 581 loss, 478, 480, 491 luminous sensitivity, 690 magnification, 692 mail processing, 798 sorting, 797 majority vote, 570 man-machine interaction, 804 map, 781 data, 630 edition, 627 overlay, 626, 658 updating, 626, 627, 651, 660 guided (M-G) approach, 659 making, 766, 778, 783 sonar, 782 MAP solution, 257 mapping, 626 Markov filters, 145 Markov random field model, 227, 228 Match Primitive Measure (MPM), 79 matched filter, 537 matching, 670, 677 algorithm, 352 mathematical features, 459 morphology, 698 matrix associative memory, 126 matrix-vector multiplier, 872 maximum a posteriori (MAP) estimate, 250 a posteriori segmentation, 163 frequency difference (MFD), 238 likelihood (ML) method, 475 classifier, 462, 648, 658, 659 estimate, 242 method, 19 Mutual Information (MMI), 487 neuron,133 Posterior Marginal (MPM) algorithm, 251 MAXNET, 111 MCE/GPD, 496
1011
measure of the similarity, 110 measurement, 626 error, 697 median filter, 901 medical image analysis, 215 processing, 213 mental rotation, 938 effect, 926 merger compatibility, 744 Merger Importance (MI), 722 metadata, 634 Method, 50, 478, 516, 755, 762 metric, 52, 639 Minkowski, 8 microcalcification detection, 167 midpoint displacement method, 240 military infra-red imaging, 892 military surveillance: downward looking IR imagery, 893 minimax test, 37 minimum, 510 classification error (MCE), 485, 489, 495 distance classifier, 462 distance GECTA, 98 distortion principle, 475 error rate classification, 476 Euclidean distance classifier, 648 noise and average correlation energy (MINACE), 880 square error method, 19 Distance Error-Correcting Earley’s Parsing, 72, 76 Finite-State Parsing, 70 parser (MDECP), 63 length descriptions, 861 misclassification measure, 492, 501 mixture scatter matrix, 41 model, 585, 586 based methods, 227 calibration, 627 computer, 632 computer-aided design (CAD), 632 edge fitting, 644 environment, 627 fitting, 393, 406 geographic data, 631 mosaic, 240 multiple spatial resolution, 633
1012 Indez object-oriented, 632 pyramid raster, 633 raster, 631-633 recovery, 389, 391, 413 relational, 632 topological, 632 validation, 38 vector, 631433 based compression, 669 modeling, 284, 626 image radiance, 638 predictive spatial, 630 prescriptive spatial, 630 spatially dynamic processes, 630 Modified Maximum-Likelihood SPECTA, 96 modular hardware structure, 699 moments, 639, 642, 979, 981 of area, 225 of the kNN distance, 53 monitoring, 626, 627, 630, 638, 658 monocular obstacle detection, 836 vision, 380 monotonicity, 21 morphological closing, 902 operators, 644 processors, 874 transform, 875 motion, 364 estimation, 353 algorithm, 355 modeling, 364 prediction, 341 nonrigid, 437 of a rigid body, 364 synchroniser, 690 tracking, 431 with precession, 373 without precession, 367 MPM algorithm, 259 MRF, 253 mug shot, 670 multi-channel filtering approaches, 213 multi-grid, 347 Multi-Layer Perceptron (MLP), 480, 515 multi-resolut ion, 347 analysis, 631
imagery, 633 processing, 234 multi-scale blob detection, 226 multidimensional scaling, 13 multifocus Fourier transform device, 545 multilayer backpropagation network, 462 multilevel image context representation, 746 multilevel image understanding, 738, 740 multimedia, 668 multiple channel, 233 multiple scales, 242 multiplestate TDNN (MS-TDNN), 483 multiresolution wavelet analysis (MWA), 144 multisensory Optoelectronic Feature Extraction, 536 neural associative retriever, 538, 555 multisource data, 508, 515 multispectral, 304 multivariate analyses, 626 multiview, 399, 400, 416 naming tasks, 926 error rates, 927 natural language, 759 navigation, 767, 859 line, 787 nearest neighbor analysis, 630, 652 decision rule, 462 method, 19 network drainage, 630, 643, 654 representation, 632 road, 632, 643, 652, 654, 659, 660 tracking, 634 neural associative retriever, 551 neural net, 883 classifier, 884 production system, 886 neural network, 252, 255, 514, 515, 630, 648, 649, 658, 660, 698, 940 classifiers, 462, 463 neutral density filtering, 694 neutron radiography testing, 467 Neyman-Pearson test, 37
Index noise, 696, 861 filtering, 758 nominal, 5 nondestructive evaluation (NDE), 455 testing (NDT), 455 nonlinear projections, 12 nonparametric procedures, 51 normalization, 9, 349 normalized camera model, 353 novel view, 929 object characteristics, 689 decomposition, 399 featurebased methods, 778 motion, 959 oriented context representation, 754 oriented design, 738, 740 pattern representation, 183 recognition, 387, 391, 667 representation, 925, 926 segmentation, 389 size, 689 centered representations, 925, 926, 940 context, 738 interaction, 753 object interaction, 738, 743, 751 oriented representation, 637 objective, 477 function, 403, 404, 406, 415 objects with sharp edges, 934 objects with smooth bounding surfaces, 934 occlusion, 351, 744 map, 342, 351 octrees, 805 one-class classifier, 38 ontology, 759 opaque objects, 932 operating characteristics, 37 operation arithmetic, 658 Boolean, 658 operator, 779 global, 644 gradient, 644 Laplace, 645
1013
local, 644, 645, 660
LOG, 644 Nagao, 644 Non-maximum suppression, 644 Sobel, 644 optical correlator, 870, 871 flow, 341 networks, 674 pattern recognition, 535, 536, 869 processing, 869 architecture, 870, 872, 873 optics, 692 optimization, 130, 382, 478, 515 method, 477 problem, 435 ordering process, 113 ordinal, 5 ordination, 10 orthographic projection, 930 overall risk, 476 overdetermination, 373, 378 parabolic line, 389 parallel implementation, 258 processing, 183 parallelism, 189 parametric geons, 388, 400, 401, 403, 406, 413 parasites detection, 692 Part boundaries, 395 identification, 389, 390 localisation, 389 model, 388, 403 segmentation, 388, 389, 410 -based descriptions, 389 partial description, 863 partition, 16 partitional clustering, 4, 21 Parzen approach, 52 classifier, 54 density estimate, 52 pattern, 6 areal, 654 areal analysis, 654 classification, 250, 459, 462 line, 654
1014 Index analysis, 654 matrix, 6, 9 recognition, 509, 979 approaches, 658, 659 of plasma cortisol signal, 618 statistical, 629, 630, 648 syntactic, 61, 630 representation, 185 visual, 639 pavement distress detection, 997 Pearl Bayes networks, 893 pencil of lines, 318 perception, 283, 290, 739 engineering, 865 perceptual, 288 context modeling, 758 context separation, 740 periodic, 208 perspective imaging system, 322 projection, 326, 335 transformation matrix, 773 n-point, 773 phase conjugate beam, 539 mirror, 539, 540, 541, 547, 561 photogrammetry, 314, 771 photography, 314 photometric stereo, 865 p hotorefractive, 539 crystal, 535, 542, 547, 549, 554, 556, 557 physics-based vision, 305, 306 piecewise classifiers, 44 pixel, 741 classifiers, 510 image, 639, 640 -oriented approaches, 645 pixelwise classification, 722 placement rule, 207, 223, 226, 227 planar rigid motion invariant (PRMI), 344 point patterns, 652 processes, 744 polarisation, 639 filtering, 694 polygon, 654, 657, 661 Thiessen, 654 polyhedron, 397
pose, 766 position, 766 estimation, 766 postal address recognition, 217 automation, 797 mechanisation, 798 posture estimation, 670 potential functions, 228 power spectrum, 222 transforms, 40 pre-processing, 696 preattentive segmentation, 249, 261 preattentively discriminable, 211 precession, 364, 371, 374 precessional angular frequency, 375, 377, 379 motion, 366 vector, 375, 377 predicate, 741, 747, 749, 750 binary, 639 prediction, 364 preloaded world model, 767 preprocessing, 349 presentation error, 697 principal component analysis, 511, 670 transformation, 10 probabilistic context-free grammar, 475 Descent Method (PDM), 489 descent theorem, 492 relaxation labeling, 238 probability a posteriori, 474, 477 a priori, 477 maps, 527 of rejection, 38 processing attribute, 651 high-level, 650 intermediatelevel, 650 low-level, 650 projection, 600 matrix, 323 projective coordinates, 319 system, 319 geometry, 313-315, 335
Indez invariant , 332 plane, 319 space, 315, 316 transformation, 316, 326 Prolog, 741, 758, 762 proximity matrix, 6, 8 psychophysics, 211 punctual entity, 633 pyramid, 192 quad-tree, 633 quadrat, 652 quadratic classifier, 43, 648 constraints, 931 invariant , 939 qualitative method, 783 shape, 393 quality control, 704 quantization, 639-641 quasi-rigid, 431, 437 radial alignment constraint, 774 basis functions, 933 curvature, 934 radiant sensitivity, 690 radiographic NDE, 464 RALPH, 825 Rand index, 25 random field models, 227, 249 range image, 390, 410 map, 431 method, 9 RANSAC paradigm, 772 ratio scales, 5 RBC, 391, 392 real-time, 668 processing, 980 receiver operating characteristic, 171 recognition, 294, 302-304, 474, 859 of facial expressions, 446 operat or , 937-94 1 Recognition-by-Components (RBC), 390 recovery, 859 recursive blurring, 350 recursive parameter estimation, 828
1015
recursive-batch approach, 383 reflectance, 284-287, 296, 301, 689 reflection, 285, 300, 306 region based random fields, 250 growing, 389 reject, 38 segmentation, 897 -oriented approaches, 645 regions, 781 of interest (ROIs), 877 regularization, 860 relative positioning, 328, 335 relaxation labeling, 645 remote sensing, 4, 213, 218, 507, 626, 637, 639, 641, 644, 646, 648, 651, 657 image analysis, 638 imagery, 627-630 remotely sensed data, 625, 628 Remotely Sensed Image Analysis System (RSIAS), 625, 627 resection space, 772 resolution, 763 spatial, 625, 628, 632, 633, 638, 640-643, 649 response time in recognition tasks, 926 restricted expanded finite-state grammar, 68 resubstitution methods, 50 retrieval tools, 969 reversals, 21 rigid motion, 437 objects, 929 with smooth surfaces, 934 transformations, 930 road following, 828 Robert's operator, 232 robot navigation, 766 vision, 859 robustness, 477, 496 rotation center motion, 367 matrix, 774 -scaling-t ranslat ion-invariant features, 979 rotational invariant texture classification, 267 rotational joint, 935
1016 Indez run-length encoding, 633 Run-Length Smoothing Algorithm (RLSA), 601 running fix, 767 S program, 10 SAHN algorithm, 18 Sammon projection, 13 SAR images, 218 satellite, 625, 627429, 631, 634, 638, 640, 660-662 scalar feature, 639 scale-space, 643 scaling effect, 241, 242 scatter measures, 40 scene analysis, 968 modeling, 833 monitoring, 746 scheme, 926 search, 742, 744, 745, 748 security, 671 segmental t-means clustering, 475, 497 segmentation, 213, 304, 305, 605, 638, 641, 643, 645, 647, 650, 659, 812 boundary-oriented, 646 map-guided, 629, 634 method, 720 pixel-oriented, 645 region-oriented, 646 self-location, 766, 790 self-similarity, 229 sensitivity, 406, 690 sensor, 690 fusion, 745, 749, 756, 760 resolution, 690 sequential classifiers, 45 shape description, 641, 642 discrimination, 392 extraction, 638 from texture, 236, 240 problem, 209 information-nonpreserving representation, 642 information-preserving representation, 642 measurement, 433, 641 object, 689 verification, 393
-based recognition, 925 shift-tolerant LVQ, 485 short-time Fourier transform (see window Fourier transform) shot clustering, 968 similarity, 966 sigmoidal nonlinearity, 235 signal processing functions, 457 methods, 231 signature spatial, 641, 649 spectral, 641 temporal, 648 silicon retina, 673 similarity matrix, 8 video content representation, 964 simulated annealing, 13, 117, 251, 257, 406, 436, 478 simulation, 759 single-link method, 19 size, 632, 633, 64&642, 652, 661 parameters, 701 sorting of fish, 701 skeleton algorithms, 698 slant tilt angles, 241 slider stereo, 779 smoothness, 344, 345 of motion, 364 solutions, 370, 372 sonar-based range finding, 778 spaceborne image, 631 sparsely sampled short-time series, 618 spatial aggregation entity, 633 analysis, 625-627, 651, 652 circuit, 654 Geary and Moran coefficients, 654 general tendency, 654 joint-count, 654 classifier, 511 data, 627, 628, 630, 632, 658 domain filters, 231 entity, 630, 632, 633 feature, 630, 631, 640, 641 light modulator, 536, 537, 539-541, 543, 545, 546, 556, 560
moments, 232 pattern, 627, 643, 652 analysis, 627 description, 625, 627 detection, 627 representation, 777 system, 781 vector, 353 special purpose arithmetic units, 700 spectral band, 641, 644, 649 ranges, 507 sensitivity, 690 speech, 474 recognition, 446, 485 Spot, 634, 638, 642-644, 646, 659-661 square-error, 11, 22, 27 standard pattern, 784 standard reference pattern, 766 static, 474 statistical, 509 enhancement, 513 methods, 219 pattern classification, 698 pattern recognition, 33 statistics, 639, 652 central tendency, 639 difference, 250 dispersion, 639, 641 enhanced, 508, 514, 521 first-order, 211, 639 original, 523 run-length, 250 second-order, 212, 639, 640 stereo correspondence, 428 matching, 865 triangulation, 778 stereopsis, 252, 839 stereovision, 323, 428 stochastic algorithms, 257 approximation, 478 relaxation algorithms, 251 stopping rule, 27 strength of structure, 584 stress criterion, 12 structural approaches, 658 features, 250, 644
methods, 226 structure, 581 mechanics, 440 structured light, 429, 691 subclass, 762 subsumption, 744 supervisedjunsupervised training, 514 super-quadric ellipsoid, 439 superellipsoid, 391, 392 globally-deformed, 391 supervised classification, 113 learning net, 122 recognition approach, 271 surface concavity, 388 features, 388 geometry, 242, 431 inspection, 213, 691, 726 mean, 431 normal, 432 smoothness, 388 tessellation, 398 trend, 654 surveillance, 672 symbolic data, 645, 650 models, 387 symbols, 633 symmetrical polynomials, 334 synthetic discriminant function (SDF) filters, 879 system cartographic, 627, 629, 631 error, 116 integrated image and map analysis, 627 integrated RSIAS/GIS, 625, 631, 633 knowledge-based, 638 pattern recognition, 645, 647 tapering, 388, 401, 404 target, 116 elimination, 903 tracking, 746 temperature parameter, 118 temperature stress technique, 919 template matching, 928
1018 Index temporal features, 966 image sequence processing, 828 integration, 850 partition, 947 terrain, 627 tessellation, 240, 632 regular or semiregular, 227 testing, 674 text segmentation, 604 textile inspection, 213 texture, 216, 219, 249, 304, 308 analysis, 207, 711 applications, 213 classification, 209, 227, 236, 239, 712 coarseness, 639-641, 643 definitions, 207 directionality, 640, 641 discrimination, 250 elements, 223, 242 extraction, 242 feature, 215, 221, 231, 236, 250 measures, 713 models, 219 perception, 211 segmentation, 209, 249, 720 boundary-based, 239 approaches, 236 region-based, 239 approach, 236 spectrum (TS), 640, 641 synthesis, 209, 227, 236, 240 unit, 640 theory of textons, 212, 250 thermograms, 906 threat assessment, 746 three-dimensional vision, 183 three-point alignment, 938 threshold, 400 thresholding, 645, 648, 697, 902 Time Delay Neural Network (TDNN), 480 time-bandwidth product, 234 time-frequency analysis, 458 tissue scattering parameters, 216 toeplitz form, 43 token tracking, 435 tone, 631, 639, 643, 648, 649 top-down approach, 590
top-down parsing using MPM, 83 topologic model, 632 topologically-correct mapping, 113 trace, 755 training, 514, 515 phase, 702 trajectory integration and dead reckoning, 766, 777 transformation data, 645 error, 63, 73 translation vector, 774 transparency, 689 transparent objects, 932 transversality, 393, 399 traveling salesman problem, 131 tree automata, 84, 86 grammar, 84 transformation, 593 -structured wave-packets, 144 triangular mesh, 390, 397-399 Triangulated Irregular Networks (TIN), 633 triangulation, 398 trinocular stereo, 779 tube cameras, 690 tumbling, 364 two-dimensional display, 43 two-view angular frequency, 379 motion analysis, 368 rotation angle, 376 rotation axis vector, 376 translation vector, 376 ultrametric inequality, 17 ultrasonic methods, 456 NDE Systems, 456 spectroscopy, 457 unification, 748, 763 uniformity, 689 unsupervised learning, 3, 109 segmentation, 161 unweighted pair group method using arithmetic averages, 20 UPGMA, 20
Index 1019 UPGMC, 20 upper and lower bounds, 50 validation, 23, 24 validity, 117 paradigm, 25 variogram, 631 vector data, 652 Very Fast Simulated Re-annealing (VFSR), 393, 406 very large number of classes, 54 video icon, 963 skimming, 963 viewer-centered representations, 925, 926, 928, 940 viewframes, 783 viewing technique, 691 visibility list, 792 vision, 818 engineering, 865, 866 visual object, 940 recognition, 925 visual texture, 207 VLSI algorithm, 983 architectures, 979, 981 volumetric kNN procedure, 56 model, 391
shapes, 391, 400 Voronoi tessellation features, 223 voxel, 390, 399 wafer inspection, 177 Ward’s method, 19 wavelet, 251, 263, 675 decomposition, 154 transform, 234, 643, 644 weak perspective projection, 930 wedge ring detector sampled FT, 877 weight selection schemes, 517 weight space, 117 weighted distance, 64 links, 743 minimum-distance SPECTA, 93 Widrow-Hoff, 112 Wigner distribution, 614 discrete pseudo, 619 discrete-time, 615 pseudo, 613, 616 window Fourier transform, 234 winner-takes-all circuit, 256 within-class scatter matrix, 41 world, 746, 749 model, 766 z-score, 9 normalization, 9