JANUARY 2011
VOLUME 22
NUMBER 1
ITNNEP
(ISSN 1045-9227)
EDITORIAL
One Year as EiC, and Editorial-Board Changes at TNN ........................................................................
1
REGULAR PAPERS
Signature Neural Networks: Definition and Application to Multidimensional Sorting Problems .......................... ............................................................................ R. Latorre, F. de Borja Rodríguez, and P. Varona Adaptive Dynamic Programming for Finite-Horizon Optimal Control of Discrete-Time Nonlinear Systems with ε-Error Bound ................................................................................ F.-Y. Wang, N. Jin, D. Liu, and Q. Wei Solving Nonstationary Classification Problems with Coupled Support Vector Machines ................................... ............................................................ G. L. Grinblat, L. C. Uzal, H. A. Ceccatto, and P. M. Granitto Optimum Spatio-Spectral Filtering Network for Brain–Computer Interface .................................................. ................................................................... H. Zhang, Z. Y. Chin, K. K. Ang, C. Guan, and C. Wang 24-GOPS 4.5-mm2 Digital Cellular Neural Network for Rapid Visual Attention in an Object-Recognition SoC ...... ............................................................................ S. Lee, M. Kim, K. Kim, J.-Y. Kim, and H.-J. Yoo An Augmented Echo State Network for Nonlinear Adaptive Filtering of Complex Noncircular Signals ................ ...................................................... Y. Xia, B. Jelfs, M. M. Van Hulle, J. C. Príncipe, and D. P. Mandic Learning Pattern Recognition Through Quasi-Synchronization of Phase Oscillators ........................................ ...................................................................... E. Vassilieva, G. Pinto, J. A. de Barros, and P. Suppes ELITE: Ensemble of Optimal Input-Pruned Neural Networks Using TRUST-TECH ..... B. Wang and H.-D. Chiang Approximate Confidence and Prediction Intervals for Least Squares Support Vector Regression ......................... .................................................... K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor Super-Resolution Method for Face Recognition Using Nonlinear Mappings on Coherent Features ...................... .............................................................................................................. H. Huang and H. He Minimum Complexity Echo State Network ............................................................ A. Rodan and P. Tiˇno Bounded H∞ Synchronization and State Estimation for Discrete Time-Varying Stochastic Complex Networks Over a Finite Horizon ............................................................................... B. Shen, Z. Wang, and X. Liu
8 24 37 52 64 74 84 96 110 121 131 145
BRIEF PAPERS
Extended Input Space Support Vector Machine ........... R. Santiago-Mozos, F. Pérez-Cruz, and A. Artés-Rodríguez Robust Stability Criterion for Discrete-Time Uncertain Markovian Jumping Neural Networks with Defective Statistics of Modes Transitions ............................................................. Y. Zhao, L. Zhang, S. Shen, and H. Gao
158 164
ANNOUNCEMENTS
Call for Papers—The IEEE T RANSACTIONS ON N EURAL N ETWORKS Special Issue: Online Learning in Kernel Methods .............................................................................................................................
171
Call for Participation—The 2011 International Joint Conference on Neural Networks .....................................
172
IEEE TRANSACTIONS ON NEURAL NETWORKS IEEE TRANSACTIONS ON NEURAL NETWORKS is published by the IEEE Computational Intelligence Society. Members may subscribe to this TRANSACTIONS for $22.00 per year. IEEE student members may subscribe for $11.00 per year. Nonmembers may subscribe for $1,750.00. For additional subscription information visit http://www.ieee.org/nns/pubs. For information on receiving this TRANSACTIONS, write to the IEEE Service Center at the address below. Member copies of Transactions/Journals are for personal use only. For more information about this TRANSACTIONS see http://www.ieee-cis/org/pubs/tnn.
Editor-in-Chief DERONG LIU Institute of Automation Chinese Academy of Sciences Beijing 100190, China Dept. of Electrical and Computer Engineering University of Illinois Chicago, IL 60607 USA Email:
[email protected]
Associate Editors HOJJAT ADELI Ohio State Univ., USA
PABLO A. ESTEVEZ Univ. of Chile, Chile
CESARE ALIPPI Politecnico di Milano, Italy
HAIBO HE Univ. of Rhode Island, USA
MARCO BAGLIETTO DIST-Univ. of Genova, Italy
TOM HESKES Radboud Univ. Nijmegen, The Netherlands
LUBICA BENUSKOVA Univ. of Otago, New Zealand AMIT BHAYA Federal Univ. of Rio de Janeiro, Brazil IVO BUKOVSKY Czech Technical Univ. in Prague, Czech Republic SHENG CHEN Univ. of Southampton, U.K. TIANPING CHEN Fudan Univ., China PAU-CHOO (JULIA) CHUNG National Cheng Kung Univ., Taiwan MING DONG Wayne State Univ., USA EL-SAYED EL-ALFY King Fahd Univ. of Petroleum & Minerals, Saudi Arabia
RHEE MAN KIL Korea Advanced Inst. of Science and Technology, Korea IRWIN KING Chinese Univ. of Hong Kong LI-WEI (LEO) KO National Chiao-Tung Univ., Taiwan
AKIRA HIROSE Univ. of Tokyo, Japan ZENG-GUANG HOU The Chinese Acad. Sci., China SANQING HU Hangzhou Dianzi Univ. , China AMIR HUSSAIN Univ. of Stirling, U.K. KAZUSHI IKEDA Nara Inst. of Sci. & Technol., Japan
JINHU LU The Chinese Acad. Sci., China
YAOCHU JIN Honda Research Inst., Germany FAKHRI KARRAY University of Waterloo, Canada
DRAGUNA VRABIE Univ. of Texas at Arlington, USA
SEIICHI OZAWA Kobe Univ., Japan
ZIDONG WANG Brunel Univ., U.K.
MIKE PAULIN Univ. of Otago, New Zealand
MARCO WIERING Univ. of Groningen, The Netherlands
ROBI POLIKAR Rowan Univ., USA
JAMES KWOK Hong Kong Univ. of Sci. & Technol. DANIL PROKHOROV Toyota Research Institute NA, FRANK L. LEWIS USA Univ. of Texas at Arlington, USA MARCELLO SANGUINETI ROBERT LEGENSTEIN Univ. of Genoa, Italy Graz Univ. of Technology, Austria ALESSANDRO SPERDUTI ARISTIDIS LIKAS Univ. of Padova, Italy Univ. of Ioannina, Greece GUO-PING LIU Univ. of Glamorgan, U.K.
HOSSEIN JAVAHERIAN General Motors R&D Center, USA
DANILO P. MANDIC Imperial College London, U.K.
STEFANO SQUARTINI Univ. Politecnica delle Marche, Italy DIPTI SRINIVASAN National Univ. of Singapore
YUNQIAN MA Honeywell International Inc., USA
SERGIOS THEODORIDIS Univ. of Athens, Greece
MALIK MAGDON-ISMAIL Rensselaer Polytechnic Institute, USA
MARC M. VAN HULLE Katholieke Univ. Leuven, Belgium
ZHANG YI Sichuan Univ., China VICENTE ZARZOSO University of Nice-Sophia Antipolis, France ZHIGANG ZENG Huazhong Univ. of Sci. & Technol., China G. PETER ZHANG Georgia State Univ., USA HUAGUANG ZHANG Northeastern Univ., China NIAN ZHANG Univ. of District of Columbia, USA LIANG ZHAO Univ. of Sao Paulo, Brazil NANNING ZHENG Xi’an Jiaotong Univ., China
IEEE Officers DAVID A. HODGES, Vice President, Publication Services and Products HOWARD E. MICHEL, Vice President, Member and Geographic Activities STEVE M. MILLS, President, Standards Association DONNA L. HUDSON, Vice President, Technical Activities RONALD G. JENSEN, President, IEEE-USA
MOSHE KAM, President GORDON W. DAY, President-Elect ROGER D. POLLARD, Secretary HAROLD FLESCHER, Treasurer PEDRO A. RAY, Past President TARIQ S. DURRANI, Vice President, Educational Activities
VINCENZO PIURI, Director, Division X
IEEE Executive Staff DR. E. JAMES PRENDERGAST, THOMAS SIEGERT, Business Administration MATTHEW LOEB, Corporate Activities DOUGLAS GORHAM, Educational Activities BETSY DAVIS, SPHR, Human Resources CHRIS BRANTLEY, IEEE-USA ALEXANDER PASIK, Information Technology
Executive Director & Chief Operating Officer PATRICK MAHONEY, Marketing CECELIA JANKOWSKI, Member and Geographic Activities ANTHONY DURNIAK, Publications Activities JUDITH GORMAN, Standards Activities MARY WARD-CALLAN, Technical Activities
IEEE Periodicals Transactions/Journals Department Staff Director: FRAN ZAPPULLA Editorial Director: DAWN MELLEY Production Director: PETER M. TUOHY Managing Editor: JEFFREY E. CICHOCKI Journal Coordinator: MICHAEL J. HELLRIGEL IEEE TRANSACTIONS ON NEURAL NETWORKS (ISSN 1045-9227) is published monthly by The Institute of Electrical and Electronics Engineers, Inc. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society/Council, or its members. IEEE Corporate Office: 3 Park Avenue, 17th Floor, New York, NY 10016-5997. IEEE Operations Center: 445 Hoes Lane, Piscataway, NJ 08854-4141. NJ Telephone: +1 732 981 0060. Price/Publication Information: Individual copies: IEEE Members $20.00 (first copy only), nonmembers $146.00 per copy. (Note: Postage and handling charge not included.) Member and nonmember subscription prices available upon request. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For all other copying, reprint, or republication permission, write to Copyrights and Permissions Department, IEEE Publications Administration, 445 Hoes Lane, Piscataway, NJ 08854-4141. Copyright © 2011 by The IEEE, Inc. All rights reserved. Periodicals Postage Paid at New York, NY and at additional mailing offices. Postmaster: Send address changes to IEEE TRANSACTIONS ON NEURAL NETWORKS, IEEE, 445 Hoes Lane, Piscataway, NJ 08854-4141. GST Registration No. 125634188. CPC Sales Agreement #40013087. Return undeliverable Canada addresses to: Pitney Bowes IMEX, P.O. Box 4332, Stanton Rd., Toronto, ON M5W 3J4, Canada. IEEE prohibits discrimination, harassment and bullying. For more information visit http://www.ieee. org/nondiscrimination Printed in U.S.A.
Digital Object Identifier 10.1109/TNN.2010.2102710
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
1
Editorial: One Year as EiC, and Editorial-Board Changes at TNN
AM ABOUT to start my second year of service as the Editor-in-Chief (EiC) of the IEEE T RANSACTIONS ON N EURAL N ETWORKS (TNN). Needless to say, my first year as the EiC has been full of excitement and challenges. Transitioning this position from my predecessor to me went very smoothly during the months of September 2009 to January 2010. During the past year, we have accumulated 50+ Associate Editors (AEs) handling roughly 600 new submissions (not counting resubmissions and revised submissions). With the help of these AEs and my predecessor, I was quickly able to learn to do my job, and as such, the transition had very few glitches. The easy part of my job is checking whether a submission is in compliance with our guidelines and where it is within the scope of the T RANSACTIONS, before it is assigned to an AE for handling. The difficult part of my job has been dealing with some papers with three or more reviewers, all of whom agreed to review them but for some reason failed to respond to repeated automatic-review reminders. AEs handling these papers have to take several extra steps to remind reviewers through phone calls or e-mails, look for replacement reviewers, or review the papers themselves. Most authors have been appreciative of the work of the AEs and reviewers, and they accept our decisions without a problem. The backlog of papers has been kept short over the last year. We have maintained an organized printing and paperacceptance schedule, with papers typically printed in the journal within 2–3 months of acceptance. Our page budget has been kept constant in the past few years (roughly 2060 pages per year), and we expect to hold the same page count for next year. There are three special issues that are being organized this year. These include: 1) White-box nonlinear prediction models (organized by Bart Baesens, David Martens, Rudy Setiono, and Jacek Zurada); 2) Data-based optimization, control, and modeling (organized by Tianyou Chai, Zhongsheng Hou, Frank L. Lewis, and Amir Hussain); and 3) Online learning in kernel methods (organized by Jose C. Principe, Seiichi Ozawa, Sergios Theodoridis, Tulay Adali, Danilo P. Mandic, and Weifeng Liu). Interested authors should refer to the individual solicitations or contact the special-issue organizers for more details. I would like to take this opportunity to thank the hardworking AEs whose terms have ended this year. They are
I
Angelo Alessandri, Fahmida Chowdhury, Bhaskar DasGupta, Rene Doursat, Deniz Erdogmus, Mark Girolami, Barbara Hammer, Giacomo Indiveri, Stefanos Kollias, Chih-Jen Lin, Mark Plumbley, Jagath Rajapakse, George A. Rovithakis, Kate Smith-Miles, Changyin Sun, and Simon X. Yang. Thank you for your excellent service to TNN. I wish you much success in your future endeavors. I would also like to welcome the following new AEs whose terms officially start on January 1, 2011 (K. Ikeda and J. Lu started on June 1, 2010):
• • •
• •
•
• •
•
• • •
• • • •
•
Marco Baglietto, DIST-University of Genova, Italy Lubica Benuskova, University of Otago, New Zealand Ivo Bukovsky, Czech Technical University in Prague, Czech Republic Tianping Chen, Fudan University, China Tom Heskes, Radboud University Nijmegen, The Netherlands Kazushi Ikeda, Nara Institute of Science and Technology, Japan Fakhri Karray, University of Waterloo, Canada Rhee Man Kil, Korea Advanced Institute of Science and Technology, Korea Robert Legenstein, Graz University of Technology, Austria Jinhu Lu, Chinese Academy of Sciences, China Yunqian Ma, Honeywell International Inc., USA Malik Magdon-Ismail, Rensselaer Polytechnic Institute, USA Mike Paulin, University of Otago, New Zealand Robi Polikar, Rowan University, USA Danil Prokhorov, Toyota Research Institute NA, USA Marco Wiering, University of Groningen, The Netherlands Vicente Zarzoso, University of Nice Sophia Antipolis, France
All the above AEs are established authorities in their respective fields and have been carefully selected on the basis of their achievements, their geographical diversity, and our needs for expertise on various subject areas of TNN. I look forward to working with them to make TNN an even better journal.
Date of current version January 4, 2011. Digital Object Identifier 10.1109/TNN.2010.2099171
D ERONG L IU, Editor-in-Chief 1045–9227/$26.00 © 2011 IEEE
2
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Marco Baglietto (M’04) was born in Savona, Italy, in 1970. He received the Laurea degree in electronic engineering in 1995, and the Ph.D. degree in electronic engineering and computer science in 1999, both from the University of Genoa, Genoa, Italy. He has been an Assistant Professor of Automatic Control in the Department of Communications, Computer and Systems Science, University of Genoa, since 1999. His current research interests include neural approximations, linear and nonlinear estimation, distributed-information control systems, and control of communication networks. Dr. Baglietto is currently an Associate Editor for the IEEE Control Systems Society Conference Editorial Board. He has been a member of the guest editorial team of the Special Issue of the IEEE T RANSACTIONS ON N EURAL N ETWORKS on “Adaptive Learning Systems in Communication Networks.” He was a co-recipient of the 2004 Outstanding Paper Award of the IEEE T RANSACTIONS ON N EURAL N ETWORKS.
Lubica “Luba” Benuskova received the Ph.D. degree in biophysics from Comenius University, Bratislava, Slovakia, in 1994. She became an Associate Professor in the Department of Applied Informatics of the Faculty of Mathematics, Physics, and Informatics at Comenius University, in 2002. After, she served as a Director of the Center for Neurocomputation and Neuroinformatics in the Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand, in 2007. Currently, she is a Senior Lecturer in the Department of Computer Science, University of Otago, Dunedin, New Zealand. She co-authored the book Computational Neurogenetic Modelling (New York, NY: Springer, 2007). Her current research interests include computational neuroscience, spiking neural networks, neural dynamics, neuroinformatics, bioinformatics, and consciousness/emotions. Dr. Benuskova is currently a member of the Editorial Board of the peer-reviewed journal Neural Network World. She is a member of the IEEE Computational Intelligence Society and Otago Chapter of the Society for Neuroscience. Recently she has become a Professional member of the Royal Society of New Zealand.
Ivo Bukovsky received the Ph.D. degree in field of control and system engineering from Czech Technical University, Prague, Czech Republic, in 2007. He is currently the Head of Division of Automatic Control and Engineering Informatics in the Department of Instrumentation and Control Engineering within the Faculty of Mechanical Engineering, Czech Technical University. He was a Visiting Researcher at the University of Saskatchewan, Saskatoon, SK, Canada, in 2003. His thesis on nonconventional neural units and adaptive approach to evaluation of complicated dynamical systems was recognized by the Verner von Siemens Excellence Award 2007. For six months in 2009, he worked on neural networks and biomedical applications at the Cyberscience Center, Tohoku University, Miyagi, Japan. He held a short assignment at the University of Manitoba, Winnipeg, MB, Canada, in 2010. His current research interests include multiscale analyzes for adaptive evaluation of complicated dynamical systems and neural networks. Dr. Bukovsky has been a member of IEEE Computational Intelligence Society (CIS) Neural Network Technical Committee since 2007, and the Chair of CIS Neural Networks Technical Committee Task Force on Education since 2009. He became involved in IEEE CIS Student Activity Subcommittee in 2010.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
3
Tianping Chen received the Postgraduate degree from the Mathematics Department, Fudan University, Shanghai, China, in 1965. He is currently a Professor in the School of Mathematical Sciences, Fudan University. His current research interests include complex networks, neural networks, principal component analysis, independent component analysis, dynamical system harmonic analysis, and approximation theory. Prof. Chen was a recipient of several awards, including the second prize of National Natural Science Award of China in 2002, the Outstanding Paper award of the IEEE T RANSACTIONS ON N EURAL N ETWORKS in 1997, and the Best Paper Award of the Japanese Neural Network Society in 1997.
Tom Heskes received the Ph.D. degree in physics from Radboud University, Nijmegen, Netherlands, in 1993. He was a Post-Doctoral Fellow at the Beckman Institute, University of Illinois at UrbanaChampaign, Urbana. He is currently a Professor of artificial intelligence and computer science in Radboud University. He leads the Machine Learning Group and is Principal Investigator and Director of the Institute for Computing and Information Sciences. He is also the Principal Investigator at the Donders Center for Neuroscience, Radboud University. He has published over 100 research papers and books in the following areas. His current research interests include (Bayesian) machine learning and probabilistic graphical models, with applications to cognitive neuroimaging and bioinformatics. Prof. Heskes received the prestigious National Grant (Vici) for the research on probabilistic artificial intelligence, in 2006. He is the Editor-in-Chief of Neurocomputing and Associate Editor of several other journals. He has served on program committees of dozens of international conferences.
Kazushi Ikeda (M’94–SM’07) received the B.E., M.E., and Ph.D. degrees in mathematical engineering and information physics from the University of Tokyo, Tokyo, Japan, in 1989, 1991, and 1994, respectively. He joined the Department of Electrical and Computer Engineering, Kanazawa University, Kanazawa, Japan, and moved to the Department of Systems Science, Kyoto University, Kyoto, Japan, as an Associate Professor, in 1998. Since 2008, he has been a Professor in the Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, Japan. His current research interests include machine learning theory such as support vector machines and information geometry, applications to adaptive systems, and brain informatics. Dr. Ikeda is currently the Editor-in-Chief of Journal of Japanese Neural Network Society, an Action Editor of Neural Networks, and an Associate Editor of Institute of Electronics, Information and Communication Engineers Transactions on Information and Systems. He has served as a member of the Board of Governors of Japanese Neural Network Society and Institute of Systems, Control and Information Engineers.
4
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Fakhri Karray (SM’89–M’90–SM’99) received the Ph.D. degree from the University of Illinois at Urbana-Champaign, Urbana, in 1989. He is a Professor in the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada, and the Associate Director of the Pattern Analysis and Machine Intelligence Laboratory, University of Waterloo. He holds 13 U.S. patents in various areas of intelligent systems design using tools of computational intelligence. He is the coauthor of the textbook Tools of Soft Computing and Intelligent Systems Design (New York, NY: Addison-Wesley, 2004). He has extensively published in the following areas. His current research interests include soft computing and tools of computational intelligence with applications to autonomous systems and intelligent man-machine interaction. Dr. Karray has served over the years as Associate Editor for the IEEE T RANSACTIONS ON M ECHATRONICS , the IEEE T RANSACTIONS ON S YSTEMS M AN AND C YBERNETICS PART B, the IEEE C OMPUTATIONAL I NTELLIGENCE M AGAZINE, the International Journal of Robotics and Automation, the International Journal of Control and Intelligent Systems, and the International Journal of Image Processing. He has been a Guest Editor for the IEEE T RANSACTIONS ON M ECHATRONICS and the J OURNAL OF C ONTROL AND I NTELLIGENT S YSTEMS. He was the recipient of a number of professional and scholarly awards and has served as Chair/Co-Chair for more than 12 international conferences and technical programs. He is the founding General Co-Chair of the International Conference on Autonomous and Intelligent Systems and the founding Co-Chair of the IEEE Computational Intelligence Society, Kitchener-Waterloo Chapter, and Chair of the IEEE Control Systems Society of the same chapter.
Rhee Man Kil (M’94–SM’09) received the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, in 1991. He joined the Basic Research Department of Electronics and Telecommunications Research Institute, Daejeon, Korea. Since 1994, he has been with the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, where he is currently an Associate Professor in the Department of Mathematical Sciences. In the KAIST, he has been working as an Operating Committee member of the Brain Science Research Center funded by the Korean Ministry of Science and Technology. His current research interests include theories and applications of machine learning, pattern classification, model selection in regression problems, active learning, text mining, financial data mining, noise-robust speech feature extraction, and binaural information processing. He served as a Guest Editor for neural information processing journals and also served as a program committee member for several international conferences related to neural networks.
Robert Legenstein received the Ph.D. degree in telematics from Graz University of Technology (TUG), Graz, Austria, in 2002. He is currently an Assistant Professor in the Department of Computer Science, TUG. He is also the Deputy Head of the Institute for Theoretical Computer Science, TUG. He is especially interested in biologically inspired neural computation. Currently, he is coordinating the international research project “Novel Brain-Inspired Learning Paradigms for Large-Scale Neuronal Networks” of the European Commission. His current research interests include neural networks, learning in neural systems, reward-based learning, spiking neural networks, information processing in biological neural systems, and dynamics in neural networks. Dr. Legenstein has been honored as an outstanding reviewer at the 2008 conference on Advances in Neural Information Processing Systems.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
5
Jinhu Lu (M’03–SM’06) received the Ph.D. degree in applied mathematics from the Academy of Mathematics and Systems Science (AMSS), Chinese Academy of Sciences (CAS), Beijing, China, in 2002. He is an Associate Professor of AMSS, CAS, and also a Professor and Australian Research Council (ARC) Future Fellow with the School of Electrical and Computer Engineering, Royal Melbourne Institute of Technology University, Melbourne, Australia. He has held several visiting positions in Australia, Canada, France, Germany, and Hong Kong, and was a Visiting Fellow in Princeton University, Princeton, NJ, from 2005 to 2006. His current research interests include nonlinear circuits and systems, neural networks, complex systems, and networks. Dr. Lu is an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I AND II. He is the Secretary of the Technical Committee of Neural Systems and Applications of the IEEE Circuits and Systems Society. He has received several prestigious awards, including National Science Fund for Distinguished Young Scholars in China, the Hundred Talents Program of CAS, the National Natural Science Award from the Chinese Government, the Natural Science Award of the Ministry of Education of China, and the ARC Future Fellowships Award in Australia.
Yunqian Ma (SM’07) received the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, in 2003. He joined Honeywell International Inc., Morristown, NJ, where he is currently a Senior Principal Research Scientist, Advanced Technology Laboratory, Honeywell Aerospace. He holds 10 U.S. patents and 35 patent applications. He has authored 50 publications, including two books. His research has been supported by internal funds and external contracts, such as the Defense Advanced Research Projects Agency, Homeland Security Advanced Research Projects Agency, and Federal Aviation Administration. His current research interests include inertial navigation, integrated navigation, surveillance, signal and image processing, pattern recognition, computer vision, machine learning, and neural networks. Dr. Ma received the International Neural Network Society Young Investigator Award for outstanding contributions in the application of neural networks in 2006. He is currently on the Editorial Board of Pattern Recognition Letters, and has served on program committees of several international conferences. He also served on the panel of National Science Foundation in the Division of Information and Intelligent Systems. He is included in the Marquis Who is Who in Engineering and Science.
Malik Magdon-Ismail received the B.S. degree in physics from Yale University in 1993, the Masters degree in physics in 1995, and the Ph.D. degree in electrical engineering with a minor in physics from the California Institute of Technology, Pasadena, in 1998. He is currently an Associate Professor of Computer Science at Rensselaer Polytechnic Institute (RPI), Tory, NY, where he is a member of the Theory Group. His current research interests include the theory and applications of machine learning, social network algorithms, communication networks, computational finance, and theoretical and algorithmic aspects of learning from data. Dr. Ismail has served on the program committees of several conferences, and was an Associate Editor for Neurocomputing. He has several publications in peer-reviewed journals and conferences, has been a Financial Consultant, has collaborated with a number of companies, and has several active grants from National Science Foundation and other government funding agencies. He has been awarded the RPI Early Career Award in recognition of research.
6
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Mike Paulin received the B.Sc. (hons.) degree in mathematics from the University of Otago, Dunedin, New Zealand, in 1979, and the Ph.D. degree from the University of Auckland, Auckland, New Zealand, in 1985. He carried out post-doctoral research in experimental and computational neuroscience at the University of Southern California, Los Angeles, and at the California Institute of Technology, Pasadena. He has been a Scientific Programmer and a Lecturer in mathematics at the University of Auckland. For a number of years, he has been a Technical Consultant and Distinguished Visiting Scientist developing biologically inspired algorithms for robotics at NASA-Jet Propulsion Laboratory, Pasadena. He is currently an Associate Professor at the University of Otago. He teaches zoology, neuroscience, mathematics, and computational modeling. His current research interests include principles of neural computation and mechanical design for agility in animals and robots. Prof. Paulin is a member of the NZ Mathematical Society, the NZ Institute of Mathematics and its Applications, and the IEEE Computational Intelligence Society.
Robi Polikar (M’93–SM’09) received the co-major Ph.D. degree in electrical engineering and biomedical engineering from Iowa State University, Ames, in 2000. He is currently an Associate Professor with the Department of Electrical and Computer Engineering at Rowan University, Glassboro, NJ, where he directs the Signal Processing and Pattern Recognition Laboratory. He is also a long-term Visiting Scholar at the School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA. His work in following areas has been supported primarily by National Science Foundation’s CAREER, Power, Control, and Adaptive Networks and Collaborative Research in Computational Neuroscience programs, and various industrial partners. He is the author of over 120 publications. His current research interests include machine learning, pattern recognition, neural networks, with specific emphasis on incremental learning, nonstationary learning, concept drift, data fusion, and applications of computational intelligence in neuroscience. Dr. Polikar is a member of the IEEE Computational Intelligence Society, and its Technical Committee on neural networks. He was the recipient of Rowan University’s Research Excellence and Achievement Award.
Danil Prokhorov (SM’02) began his technical career in St. Petersburg, Russia, in 1992. He was a Research Engineer in the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Moscow, Russia. He became involved in automotive research in 1995, when he was a summer intern at Ford Scientific Research Laboratory, Dearborn, MI. In 1997, he became a Ford Research Staff Member involved in application-driven research on neural networks and other machine learning methods. While at Ford, he took active part in several production-bound projects including neural-network-based engine misfire detection. Since 2005, he has been with Toyota Technical Center, Ann Arbor, MI, overseeing important mid- and long-term research projects in computational intelligence. He has published more than 100 papers in various journals and conference proceedings, and has several inventions to his credit. Dr. Prokhorov is a frequent member of Program Committees of various international conferences including the International Joint Conference on Neural Networks and the World Congress on Computational Intelligence, a member of several IEEE technical committees and journal editorial boards.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
7
Marco Wiering received the Ph.D. degree from the University of Amsterdam, Amsterdam, Netherlands, in 1999, after completing the Ph.D. degree program at the Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Lugano, Switzerland. He worked as the Assistant Professor in the Intelligent Systems Group, Utrecht University, Utrecht, Netherlands, from 2000 to 2007. He is currently pursuing a tenure track for Full Professor in the Department of Artificial Intelligence, University of Groningen, Groningen, Netherlands. He is also the Director of the Robotlab in the Department of Artificial intelligence. He has published more than 70 peer-reviewed conference and journal papers and has supervised or supervises seven Ph.D. students. Furthermore, he has supervised more than 70 master graduation projects on many different topics. Together with Dr. Martijn van Otterlo, he is editing the book Reinforcement learning: State-of-the-art, which will be published in 2011. His current research interests include reinforcement learning, neural networks, robotics, computer games, computer vision, and signal processing. Dr. Wiering was the Chair of the IEEE Computational Intelligence Society Technical Committee on Adaptive Dynamic Programming and Reinforcement Learning in 2010.
Vicente Zarzoso (S’94–M’03–SM’10) received the Graduate degree with highest distinction in telecommunications engineering from the Polytechnic University of Valencia, Valencia, Spain, in 1996. After starting the Ph.D. degree program at the University of Strathclyde, Glasgow, U.K., he received the Ph.D. degree from the University of Liverpool, Liverpool, U.K, in 1999. He obtained the Habilitation to Lead Researches from the University of Nice Sophia Antipolis, Nice, France, in 2009. From 2000 to 2005, he held a Research Fellowship from the Royal Academy of Engineering of the U.K. Since 2005, he has been with the Computer Science, Signals and Systems Laboratory of Sophia Antipolis, University of Nice Sophia Antipolis, where he was appointed as a Professor in 2010. He has authored nearly 100 publications on these topics. His current research interests include statistical signal and array processing, with emphasis on independent component analysis, signal separation and their application to biomedical problems and communications. Dr. Zarzoso has served as Program Committee Member for several international conferences and was a Program Committee Chair of the 9th International Conference on Latent Variable Analysis and Signal Separation in 2010.
8
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Signature Neural Networks: Definition and Application to Multidimensional Sorting Problems Roberto Latorre, Francisco de Borja Rodr´ıguez, and Pablo Varona
Abstract—In this paper we present a self-organizing neural network paradigm that is able to discriminate information locally using a strategy for information coding and processing inspired in recent findings in living neural systems. The proposed neural network uses: 1) neural signatures to identify each unit in the network; 2) local discrimination of input information during the processing; and 3) a multicoding mechanism for information propagation regarding the who and the what of the information. The local discrimination implies a distinct processing as a function of the neural signature recognition and a local transient memory. In the context of artificial neural networks none of these mechanisms has been analyzed in detail, and our goal is to demonstrate that they can be used to efficiently solve some specific problems. To illustrate the proposed paradigm, we apply it to the problem of multidimensional sorting, which can take advantage of the local information discrimination. In particular, we compare the results of this new approach with traditional methods to solve jigsaw puzzles and we analyze the situations where the new paradigm improves the performance. Index Terms—Jigsaw puzzles, local contextualization, local discrimination, multicoding, neural signatures, self-organization.
I. Introduction
R
ECENT experiments in living neural circuits known as central pattern generators (CPG) show that some individual cells have neural signatures that consist of neuron specific spike timings in their bursting activity [33], [34]. Model simulations indicate that neural signatures that identify each cell can play a functional role in the activity of CPG circuits [22]–[24]. Neural signatures coexist with the information encoded in the slow wave rhythm of the CPG. Readers of the signal emitted by the CPG can take advantage of these multiple simultaneous codes and process them one by one, or simultaneously in order to perform different tasks [23]. The who and the what of the signals can be used to discriminate the information received by a neuron by distinctly processing the input as a function of these multiple codes. These results emphasize the importance of cell diversity for some living neural networks and suggest that local discrimination is important
Manuscript received December 31, 2009; revised May 5, 2010 and July 14, 2010; accepted July 14, 2010. Date of publication November 18, 2010; date of current version January 4, 2011. This work was supported in part by Ministry of Science and Innovation (MICINN) under Grant BFU2009-08473 and Grant ´ TIN2007-65989, and in part by the Comunidad Autónoma de Madrid (CAM) under Grant S-SEM-0255-2006. The authors are with the Grupo de Neurocomputaci´on Biol´ogica, Dpto. de Ingenier´ıa Inform´atica, Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid, Madrid 28049, Spain (e-mail:
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2060495
in systems where neural signatures are present. This kind of information processing can be a powerful strategy for neural systems to enhance their capacity and performance. Artificial neural networks (ANNs) are inspired to some extent from their biological counterparts. However, in the context of artificial neural computation, phenomena such as local recognition, discrimination of input signals and multicoding strategies have not been analyzed in detail. Most traditional ANN paradigms consider network elements as indistinguishable units, with the same transfer functions, and without mechanisms of transient memory in each cell. None of the existing ANN paradigms discriminates information as a function of the recognition of the emitter unit. While neuron uniformity facilitates the mathematical formalism of classical paradigms [1], [3], [13], [17], [36] (which has largely contributed to their success [39]), some specific problems could benefit from other approaches. Here, we propose a neural network paradigm that makes use of neural signatures to identify each unit of the network, and multiple simultaneous codes to discriminate the information received by a cell. The network self-organization is based on the signature recognition and on a distinct processing of input information as a function of a local transient memory in each cell that we have called the local informational context of the unit. The efficiency of the network depends on a tradeoff between the advantages provided by the local information discrimination and its computational cost. In this paper we discuss the application of signature neural networks (SNNs) to solve multidimensional sorting problems. In particular, to fully illustrate the use of this neural network and to evaluate its performance, we apply this formalism to the task of solving canonical jigsaw puzzles. The paper is organized as follows. In Section II we present the general formalization of the proposed paradigm. In Section III, we: 1) discuss its application to generic multidimensional sorting, and 2) provide an implementation for this kind of problems. To test the performance, in Section IV we: 1) review the jigsaw puzzle problem and the traditional algorithms to solve it; 2) provide a specific solution using a SNN; 3) describe the methods to evaluate the performance; and 4) we present our quantitative results on the comparison of this new approach with traditional methods to solve jigsaw puzzles and we analyze the situations where the new paradigm improves the performance (Section IV-H). Finally, in the Appendix, we illustrate in detail the evolution of the network with another example of multidimensional sorting.
c 2010 IEEE 1045-9227/$26.00
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
II. SNN Formalization In this section, we present the SNN paradigm. Behind this new paradigm, there are four main ideas. 1) Each neuron of the network has a signature that allows its unequivocal identification by the rest of the cells. 2) The neuron outputs are signed with the neural signature. Therefore, there are multiple codes in a message (multicoding) regarding the who and the what of the information. 3) The single neuron discriminates the input signals as a function of the following: a) the recognition of the emitter signature; b) a transient memory that keeps track of the information and its sources. This memory provides a contextualization mechanism to the single neuron processing. 4) The network self-organization relies to a large extent on the local discrimination by each unit. A. SNN Definitions The formalism requires the definition of several terms that will be used in the following sections. Some of the SNN definitions are open and depend on the specific problem to be solved. This provides a framework that can be applied to different problems by only customizing the open definitions. To illustrate the use of the SNN, we will fix these open definitions for the general multidimensional sorting problem and the particular case of the jigsaw puzzle solver in Sections III-A and IV-D, respectively. 1) Neuron or cell: the processing unit of the network. 2) Neuron signature: the neuron ID in the network. This ID is used for the local information discrimination. 3) Neuron data: information stored in each neuron about the problem. 4) Neuron information: the joint information about the who (neuron signature) and the what (neuron data) of the cell. 5) Synapse: connection between two neurons. 6) Neuron neighborhood: cells directly connected to the neuron. This concept is used to define the output channels of each neuron. The neuron neighborhood can change during the evolution of the SNN. 7) Local informational context: transient memory of each neuron to keep track of the information and its sources. This memory consists of a subset of neuron informations from other cells received in previous iterations. The maximum size of the context (Ncontext ) is the maximum number of elements in this set, and it is an important parameter of the algorithm. The neuron signature and the local informational context are the key concepts of the SNN. 8) Local discrimination: the distinct processing of a unit as a function of the recognition of the emitter and the local informational context. 9) Message: the output or total information transmitted through a synapse between two neurons in a single iteration. The message consists of the neuron information of a subset of cells that are part of the context of the
9
emitter plus its own neuron information (see below). The maximum message size is equal to Ncontext . The input to a neuron consists of all messages received at a given iteration. 10) A receptor starts recognizing the signature of an emitter cell during the message processing when it detects that the neuron data of the emitter is relevant to solve the problem (emitter and receptor data are compatible). The network self-organization is based on this recognition. The meaning of “relevant” depends on the specific problem. 11) Information propagation mode: depending on the problem, the information propagation can be monosynaptic or multisynaptic. Monosynaptic means that each neuron can receive only one input message per iteration. The information propagation is bidirectional between cells. 12) A neuron belongs to a cluster if it recognizes the signature of all the neurons in its neighborhood. The clusters allow to simplify the processing rules of the SNN. B. Algorithm The connectivity, the neuron data and the local informational contexts of all the network units are previously initialized. Depending on the problem, connectivity and neuron data initialization can be random or heuristic. Three different context initializations can be considered. 1) A free context initialization where initially the context of every neuron is empty. In this way, the cells have no information about the rest of the network. 2) A random context initialization where the context of the neurons is chosen randomly. 3) A neighborhood context initialization where all the contexts are coherent with the neighborhood of each neuron. After the initialization, the algorithm consists in the iteration of three different steps for each neuron in the network until the stop condition is fulfilled. Note that the network selforganization takes place both in steps 1 and 3 by modifying the network connections. 1) Process synaptic inputs: in this phase of the algorithm, each neuron applies the local information discrimination. a) First, the cell discriminates the input messages as a function of the emitter signature to determine which of them will pass to a second discrimination stage. If no signatures are recognized (a situation likely in the first iterations), all messages pass to the second stage. b) Second, the neuron uses the memory of the processing in previous iterations stored in its local informational context to select the set of neuron informations from the messages that will be finally processed. c) Third, the cell processes these set of neuron informations by applying its corresponding transfer functions or processing rules (which are specific of the problem to solve). If the neuron data processed
10
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
is relevant to solve the problem, the cell starts recognizing the corresponding signature and establishes a new connection with the cell identified by this signature. d) Finally, as the last step of this phase, the local informational context of the receptor is updated using the neuron information set analyzed during the processing of the input messages. Local discrimination can lead to changes in the network connectivity. Network reconfiguration as a function of the local discrimination implies a nonsupervised synaptic learning. Clusters represent partial solutions to the problem. Neurons belonging to a cluster have the same processing rules. 2) Propagate information: during this phase, neurons build and send the output messages. For this task each neuron adds its own information to the local informational context and signs the message. If the message size reaches the Ncontext value, the neuron information from the oldest cell of the context is deleted from the message (this will be illustrated in Fig. 4 for the puzzle solver case). The output message of a neuron is the same for all output channels. 3) Restore neighborhood: if a neuron has not reached its maximum number of neighbors, it randomly tries to connect to another neuron in the same situation (only one connection per neuron and iteration). First, it tries to connect to neurons from its local informational context and, if this is not possible, to other cells. This allows to maximize the information propagation in the network. To establish synapses with cells not belonging to the local context allows propagating information to other regions of the network. III. SNNs for Multidimensional Sorting An ANN paradigm based on local discrimination relies on the criteria used to perform discrimination, which necessarily depends on the problem to be solved. This implies that a SNN must be designed thinking on the specific problem at hand. To illustrate the concept and the applicability of the SNN paradigm, we will apply it to the problem of multidimensional sorting. The ideas relating neural signatures with local information discrimination have a direct application in the wide scope of multidimensional sorting problems. This is an example in which a solver can take advantage of local information discrimination, specifically when the global solution depends on local sorting criteria. For example, if we consider a scheduling problem (i.e., a specific case of multidimensional sorting where different tasks must be ordered to optimize the time or cost spent in a global problem), the SNN will find the global solution by defining a local discrimination task for the neurons that will use the informational context as a transient memory to achieve the final sorting goal. A general multidimensional sorting problem [16] consists in finding the correct order of a set of elements in several dimensions simultaneously. Different criteria must be met in each of the dimensions to reach the solution. In many
cases, these criteria are not global but local, which makes the problem much harder. Here we will emphasize how the local information discrimination of SNNs can lead to an efficient solution of multidimensional sorting problems with local order criteria. A. Customization of the SNN In this section, we describe how to use a SNN to solve a general multidimensional sorting problem. For this task, we have to fix some of the open definitions, conditions and constraints for the algorithm. All definitions of Section II-A apply to build the SNN network. However, the parameters regarding the dimension of the problem, the number of neighbors, the final structure of the network, and most of all, the recognition for the local discrimination task, will depend on the specific problem at hand. Common grounds for all multidimensional sorting problems to be solved with a SNN are as follows. 1) The number of neurons of the network is equal to the number of elements to sort. There is a one to one relationship between the neurons and the elements to sort. 2) The neuron signature can be the neuron number or some other value that allows to identify unequivocally each neuron. 3) The neuron data of each cell is a structure with information about the element to sort in each dimension (e.g., in a scheduling problem neuron data will be a task with a cost, effort, and priority). 4) The network is d-dimensional, where d is the number of dimensions of the problem (each dimension can use a specific sorting algorithm, global or local). 5) The information propagation mode is multisynaptic. 6) The compatibility for an element is given by the sorting criteria. If the sorting criteria is local, elements can only be compatible or not compatible. If the sorting criteria is global, a best compatibility measure can be assigned among different elements. During the algorithm evolution, neurons can change dynamically their compatibilities and, thus, the set of signatures recognized in a given iteration to adjust the discrimination rules. 7) The sorting criteria defines two possible neighbors in each dimension, the previous and the next element to be sorted. 8) If none of the neurons has learned a new signature for a given amount of iterations, the network reaches the stop condition. B. Implementation of the SNN Here we present a brief pseudocode for the SNN paradigm to solve a general multidimensional sorting problem. The following notation is used: expression → variable means that variable takes the value of the evaluation of the expression; variable[] means that variable is a vector; and Signature(ni ), Compatibility(ni , nj ) and Context(ni ) denote the signature, the compatibility with nj , and the local informational context
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
of neuron ni , respectively. P is the probability to establish a new connection with a cell of its local informational context during the processing phase. T is the threshold for the maximum number of iterations in which a neuron is allowed to have an incomplete neighborhood. The initialization on the network consists of the following steps: 1) assign each element to sort to a neuron of the network; 2) establish random connections between cells to build the initial network architecture; 3) for each neuron of the network → ni : a) initialize Context(ni ) using one of the initializing algorithms (see Section II-B). The neural network main function is repeated until the end condition is fulfilled. This function consists of the following steps in each neuron: 1) Process synaptic input: a) synaptic input messages → inputs[]. Regarding the who of the information (steps b) and c) constitute the first discrimination stage, while step e) corresponds to the second discrimination stage): b) select from inputs[] those messages sent by an emitter with a recognized signature → recognized[]; c) if recognized[] is empty, select all messages from inputs[] → recognized[]; d) for each emitter in recognized[] → emitter: i) if receptor recognizes Signature(emitter), reconfigure the network to place elements in their correct position; e) select randomly Ncontext neurons not including in Context(receptor) from messages in recognized[] → in. Regarding the what of the information: f) for each dimension, process information of in: i) sort neurons of in with the corresponding sorting algorithm → sorted[]; ii) choose from sorted[] those neurons that are compatible with receptor → ni ; iii) if ni exists: - choose the corresponding neighbor of receptor for the corresponding dimension → neighbor; - if Compatibility(receptor, ni ) is better than Compatibility(receptor, neighbor): (i) Break connection between receptor and neighbor and connect receptor and ni ; (ii) receptor starts recognizing Signature(ni ); (iii) receptor stops recognizing Signature(neighbor). iv) Else search in Context(receptor) for a cell with incomplete neighborhood → nj . - If nj exists, connect receptor and nj with probability P. g) Update Context(receptor) with in.
11
2) Propagate information: a) for each neuron of the network → ni : i) for each neighbor of ni → nj , send messages between ni and nj . 3) Restore neighborhood: a) for each neuron with incomplete neighborhood → ni : i) search for a side of ni without a neighbor → empty; ii) search for a neuron in Context(ni ) without a neighbor in the opposite side to empty → nj ; iii) if nj exists, connect ni and nj through empty; iv) else, if ni has had incomplete neighborhood for a number of iterations larger than T : - choose randomly among the non-neighbors of ni a neuron different than ni → nk ; - break the connection of nk in the opposite side to empty and connect ni and nk through empty. This general implementation can be used in a wide variety of problems by customizing the discrimination rules. In the Appendix, we describe in detail a multidimensional sorting example that helps the reader to further understand the use of the local informational context and the local information discrimination. To test the performance of the SNN framework, we will first consider another example in which the discrimination rules are well known.
IV. SNN for the Jigsaw Puzzle Problem A. Problem Definition Jigsaw puzzles are a specific case of multidimensional sorting problems in which the order criteria is local and it is given by the fitting among pieces. A typical jigsaw puzzle is a 2-D picture that has to be rebuilt from different fragments or pieces. Once the pieces are mixed, the solution to the problem consists in reassembling them into the original picture. The difficulty of solving the puzzle depends mainly on the number of pieces, on their assembly complexity and on the graphical representation of the picture. To rebuild a jigsaw puzzle without the original image is a NP-complete problem [11]. To efficiently solve jigsaw puzzles is considered a classical fitting or pattern recognition problem and the algorithms to solve it may have potential applications in many different fields of knowledge, such as archeology, art restoration, failure analysis, steganography, and others. For example, such algorithms have been used to reassembly manuscripts from separate pieces [26], to rebuild broken objects [20], [25], [37], [40], to send secure messages in a non secure channel [45], to hide secret messages in seemingly innocuous carriers [12] or even to design evolutionary algorithms to solve complex problems [44]. Although some of these problems can be considered as 3D jigsaw puzzle assembling, we focus our work on solving 2-D puzzles as the one shown in Fig. 1. Jigsaw puzzle pieces
12
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
B. General Solver Schema
Fig. 1. Example of a canonical jigsaw puzzle with a picture of Mount Kilimanjaro. Pieces are rectangular and the number of neighbors is four except for the border pieces. Corners have two neighbors and the rest of border pieces have three. The solution to the puzzle consists in reassembling pieces into the original picture once they are mixed.
are typically rectangular and they fit with their contiguous neighbors, which are usually four except for the border pieces. The full picture is usually square-shaped. Puzzles with these constraints are called canonical jigsaw puzzles [43]. Our method and results can be easily extended to rebuild 3-D objects from fragments with a different number of neighbors. Although the solution to the problem involves several different tasks, the research literature about jigsaw puzzles and reconstruction of broken objects is mainly focused on algorithms to test the matching of pieces according to their shape [4], [6], [14], [15], [27], [30], [38], [42], and, more recently, also on image (texture and color) matching [8], [21], [31], [43]. Different techniques have been used for this goal: shape matching [43], image merging [43], neural networks [32], genetic algorithms [35], best first search [5], and so on. To solve the jigsaw puzzle, pairs of pieces are chosen (randomly or with a heuristic method) to test their fitting. Several tasks related to solving the puzzle can also be considered to be part of a sorting or classification problem, in the sense that pieces must be sorted and clustered in different groups to reduce the search space for a correct fitting. However, the performance of the associated sorting algorithm is usually disregarded. Typically it is thought that the efficiency to solve the problem is mainly related to the way the solver determines if two pieces can fit, rather than to the way the pieces are sorted and classified to test this fitting. Here we focus on the sorting and classification tasks. The jigsaw puzzle problem is interesting in the context of our study because the sorting and classification algorithms are multidimensional sorting problems that can take advantage of local information discrimination to reduce the space search for correct fittings [28], [40]. If similar pieces are grouped into sets, each piece only needs to be compared with those in the same set. We have used the proposed paradigm to build a neural network that is able to efficiently implement the fitting algorithm. With this paradigm, we improve the performance to solve jigsaw puzzles by optimizing the strategy to choose pairs of pieces to test their fitting.
Traditional puzzle solver algorithms follow a common general schema to find the correct solution. The reconstruction of the puzzle (or the object in a general case) is usually an exhaustive search over all pieces or fragments trying to find the best fittings. Therefore, we can consider that the general algorithm is as follows: 1) choose a piece (P1 ) from the set of available pieces; 2) search for one piece (P2 ) that fits with P1 through one of its borders; 3) assemble both pieces in a new single piece; 4) add this new piece to the set of available pieces, deleting P1 and P2 ; 5) back to the first step until only one piece is left. Differences between existing approaches arise both from the algorithm used to test the matching of pieces P1 and P2 , and from the one used to select which pieces are to be tested. Therefore, the performance of a jigsaw puzzle solver depends mainly on these two algorithms. C. Traditional Algorithms to Choose Pieces to Compare In classical approaches to solve jigsaw puzzles, the algorithms used to select a pair of pieces to test their fitting are based on the way humans solve jigsaw puzzles. Firstly they search for border pieces. Later they can choose to group the rest of pieces into different sets (e.g., according to the number of straight edges of each piece, according to their colors or any other similitude metrics) to make the search easier by focusing only on pieces with a greater probability of fitting. Finally, they try to find the correct fitting for each piece of the puzzle. Traditional algorithms for piece selection are stochastic in different levels. In most cases they are brute force techniques that sort pieces until the correct solution is found [6]. Pieces are placed randomly and if the solution is not reached there is a new random search iteration. Alternatively, each piece is compared with all the rest until the correct fittings are found [14]. In other cases, for each piece of the puzzle the fitting is tested only for a subset of the available pieces. For example, in many approaches key pieces are identified first and then assembled independently using different heuristics. This set of approaches sort all possible matchings according to specific measures to find the best candidates to fit as a function of the shape and/or graphical content of the piece. Here, pieces are chosen following an order (best first, highest confidence first, and so on) and not randomly [5], [10], [15], [20], [28], [42], [43]. Thus, the space search is reduced. These algorithms require the calculation of complex similitude measures to be effective. The measures that are easy to calculate do not always give a good performance. Experiments reported in the literature using this kind of algorithms use puzzles with less than 300 pieces. A detailed comparison between different image feature solving methods can be found in [28]. D. SNN to Solve Jigsaw Puzzles In this section, we describe in detail how to use the SNN to solve the jigsaw puzzle problem, and in the next section we will provide a pseudocode for the implementation of
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
13
Fig. 2. Example of iteration status in a fragment of the proposed neural network. In the jigsaw puzzle case, signatures are the neuron numbers (10, 11, 12, . . . ) and the data are the specific pieces of the puzzle.
this algorithm. The SNN paradigm defines a different search than the general puzzle-solver search schema described in Section IV-B. Here, the processing units are neurons that try to find the best fitting locally. The SNN described in Section III-B for the general multidimemsional sorting case can be easily adapted to the jigsaw puzzle problem. However, to allow a fair comparison between the performance of the SNN and a traditional stochastic algorithm (SA), we need to impose some restrictions to the general framework. Note that the SNN can also be applied without these restrictions, as we will discuss later. 1) The number of neurons of the network is equal to the number of pieces of the puzzle. 2) The neuron signature is the neuron number. There exist different matching algorithms that use different metrics to represent the characteristics of a piece or fragment [19], [18], [41]. For example, objects can be represented by “shape signatures” which are strings that are obtained by an approximation of the boundary curve. The signature of a neuron could be the shape signature of the piece that it contains. However, as we are not interested in evaluating the fitting algorithm, to simplify our implementation we use the neuron number as the neural signature. 3) Now, the neuron data of each cell is one piece of the puzzle (this is illustrated in Fig. 2). 4) As we solve canonical jigsaw puzzles, the maximum number of neighbors is four, one for each side of the piece that the neuron represents (up, down, left, and right). The neighbor order is important, up-down and left-right are opposite sides. In a more general case, e.g., to rebuild broken objects, there could exist more than four neighbors. The SNN has periodic boundary conditions. 5) The initial structure of the network is 2-D with each cell connected to its four nearest neighbors.
Fig. 3. Reconfiguration of the SNN when two neurons recognize their signatures. For example, neurons 18 and 25 of Fig. 2 recognize their signatures; however, their corresponding pieces are not well located. The piece corresponding to neuron 25 has to be located to the left of the piece corresponding to neuron 18 and not down (compare Fig. 1). When the network is reconfigured, 1) the connections between 17–18 and 25–26 are broken; 2) 17 and 25 are interchanged; and 3) 18–25 are connected in their correct position. In this example, as a consequence of this network reconfiguration, neurons 18 and 25 temporary have only three neighbors. Note that neurons 17 and 26 are now connected as a consequence of the SNN reconfiguration.
6) In our example, the information propagation mode is monosynaptic, i.e. only one input message is processed per iteration. Fig. 4 shows the way messages are built and propagated with this choice of parameters. 7) When two neurons contain pieces with a complementary border (borders that match correctly), they are compatible. For example, in Fig. 2, since neurons 18 and 25 contain complementary pieces, they are compatible and, recognize their signatures. In the puzzle solution (see Fig. 1), the piece that corresponds to neuron 25 is located to the left of the piece that corresponds to neuron 18. When a neuron recognizes the signature of another cell, the network is reconfigured to move pieces to their correct positions (Fig. 3). Neural signatures allow to identify the source of the information to achieve the correct fitting by reconfiguring the network from the starting 2-D structure to a multidimensional one. At the end of the self-organization the network recovers a 2-D structure. 8) If a neuron belongs to a cluster: a) it does not process the part of its informational context related to its neighbors; b) it does not add its own neuron data to its output. Neurons in a cluster are only relayers of their input information.
14
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Fig. 4. Synaptic transmission example for the SNN shown in Fig. 2. In this example, we consider Ncontext = 3. If a message follows the path 10−11−12−19, in iteration 1, the message only consists of information about neuron 10. In iterations 2 and 3, information about neurons 11 and 12 is added to the head of the message. Finally, in iteration 4, neuron 19 deletes the tail information of its input message (information about neuron 10) and adds its own information to the head of its output message.
E. Puzzle Solver Implementation Here we present a brief pseudocode for the SNN paradigm to solve jigsaw puzzles with the same notation as in Section III-B. In the simulations discussed in this paper, values of P and T are 0.1 and 10, respectively: 1) Process synaptic input. a) Synaptic input message → inputs[] (note that when information propagation is monosynaptic, inputs only contains one input message). Regarding the who of the information: b) select from inputs[] those messages sent by an emitter with a recognized signature and not in its correct position → recognized[]; c) if recognized[] is empty, select randomly a message of inputs[] → recognized[]; d) for each emitter in recognized[] → emitter: i) if receptor has recognized Signature(emitter), reconfigure the network to move pieces to their correct position (see Fig. 3); e) select randomly Ncontext neurons not including in Context(receptor) from messages in recognized[] → in. Regarding the what of the information: f) for each dimension, process information of in as follows: i) search in in for a neuron whose signature has not been recognized by receptor but has a complementary piece to Piece(receptor) in the corresponding dimension → ni . Note that the part of the incoming messages about neurons whose signature is recognized is not processed. ii) If ni exists: - connect receptor and ni ; - receptor starts recognizing Signature(ni ). This means that the emitter will recognize the signature of the emitter in the next iteration. iii) Else, search in Context(receptor) for a cell with incomplete neighborhood → nj . - If nj exists, connect receptor and nj with probability P. g) Set Context(receptor) equal to the set of neuron informations of in. If receptor belongs to a cluster do not include the neuron information of any neuron in its neighborhood to build Context(receptor).
2) Propagate piece information. a) For each neuron of the network whose corresponding piece is not in its correct position → ni : i) search for a neighbor of ni with a signature not recognized by ni which contains a complementary piece of Piece(ni ) → nj ; ii) if nj exists, send messages between ni and nj ; iii) else, choose randomly a neighbor of ni (note that information propagation is monosynaptic) → nk . - If nk exists, send messages between ni and nk . 3) Restore neighborhood. a) For each neuron whose corresponding piece is not in its correct position and with incomplete neighborhood → ni : i) search for a side of ni without a neighbor → empty; ii) search for a neuron in Context(ni ) without a neighbor in the opposite side to empty → nj ; iii) if nj exists, connect ni and nj through empty; iv) else, if ni has had incomplete neighborhood for a number of iterations larger than T : - choose randomly among the non-neighbors of ni a neuron different than ni → nk ; - break the connection of nk in the opposite side to empty and connect ni and nk through empty. Note that the only significant change to the pseudocode described in Section III-B is related to the what of the information F. Methodology and Validation 1) How to Evaluate the Performance of the SNN: To evaluate the SNN to solve jigsaw puzzles, we have compared its performance with the performance of a traditional SA based on the general solver schema described in Section IV-B. The SA consists in the following steps. a) For each piece of the puzzle (Pi ) repeat N times: i) choose randomly a piece of the puzzle (Pj ); ii) if Pi and Pj have a complementary side: - assemble both pieces in a new single piece; - add the new piece to the set of available pieces, deleting Pi and Pj . b) Back to first step until only one piece is left. The number of attempts to find a complementary piece for each piece per iteration (N) is the main parameter of the SA.
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
We consider this value equivalent to the context size in the SNN (Ncontext ). In Section IV-C we reviewed the traditional algorithms to choose pieces to compare in order to solve the puzzles. All of them are stochastic in different levels and can be described under this general scheme, some of them only by setting the number of attempts to find a complementary piece for each piece per iteration (N). For the rest, specific rules for clustering pieces need to be added to execute the SA in different sets of pieces (e.g., border pieces), or to use a priori knowledge about the matching of two pieces (e.g., their concavity and convexity). These set of rules can also be easily added to the SNN. When they are applied, the improvement is equivalent for both methods under the same conditions. Here we do not use them to compare both approaches in the simplest case. 2) What Puzzles to Solve: To probe the viability of the proposed algorithm and to compare it with traditional approaches we have solved several puzzles of different sizes with the SNN and with the SA described in the previous section. In all our tests we have used computer generated canonical squared jigsaw puzzles of size n × n. To generate these puzzles, we divided pictures in n × n squared fragments and mixed them randomly. 3) How to Test the Piece Matching: To test the piece matching we use information about the overall picture. Before mixing pieces, we save the neighborhood of all the fragments. In this way, during the algorithm evolution, we can evaluate if two pieces fit or not. 4) How to Quantify the Performance: To assess the algorithm performance we use three measurements that will allow us to compare the different methods in terms of time requirements and effectiveness: the average number of iterations to solve the puzzles, the average total number of fitting tests needed, and the effective number of fitting tests (see below). These three measurements will allow to quantitatively analyze our results. A method’s performance is often evaluated using the average time needed to solve the jigsaw puzzle. Let us define an iteration as a cycle of the algorithm in which all the processing units (pieces in the SA and neurons in the SNN) are updated. Therefore, this measure is equivalent for both methods in the sense that in each iteration they try to find the best fittings for all the pieces of the puzzle. To have a measure of the performance independent of the computer power and on the quality of the implementation, here we will quantify the performance of the algorithms in terms of the average number of iterations needed to solve puzzles of different sizes. Fitting algorithms can be complex and computationally expensive. Therefore, performance is improved as the total number of fitting tests is reduced independently of the number of iterations. We consider that a fitting test takes place during the algorithm evolution every time the borders of two pieces are compared. For example, when reassembling two pieces of four borders a maximum of 16 fitting tests are performed. Finally, the effective number of fitting tests per iteration is defined as the percentage of correct matchings between pieces in each iteration of the algorithm. This quantity is
15
used to assess the relationship between the two previous measurements. To illustrate the results of the performance comparison between the proposed SNN and the SA, we calculate the difference between the value of the above defined measures for each algorithm. Thus, let us define the following “distances”: dit = IterationsSA − IterationsSNN dtests = TestsSA − TestsSNN deff = EffectiveTestsSNN − EffectiveTestsSA . Negative values of these distances mean a poor performance by the SNN as compared with the performance of the SA. Note that the larger the number of iterations and the larger the number of fitting tests, the worst is the performance. However, the larger the number of effective tests, the better the performance. 5) Simulation Parameter: The main parameters in our simulations are the local informational context size (for the SNN) and the number of attempts to find complementary pieces in each iteration (for the SA). For each piece, these values indicate the maximum number of pieces for the fitting test per iteration. In this sense, we consider both parameters equivalent in order to compare the performance of the two algorithms. Here on these quantities are called simulation parameters. In all our tests we set the simulation parameter to a percentage of the puzzle border length. For example, when we deal with puzzles of size 50 × 50 and we say that the simulation parameter is 10%, it means that the size of the local informational context of the SNN and the number of fitting attempts in the SA is equal to 5 (10% of 50). Note that the storage requirement of the SNN is O(N ∗ Ncontext ), where N is the number of neurons in the network. G. Context Initialization In order to test the dependency of the SNN on the initial conditions we have used the three different context initializations proposed in Section II-B: a free context initialization, a random context initialization, and a neighborhood context initialization. H. Results To assess the viability of the SNN paradigm we have solved several canonical jigsaw puzzles of different sizes: from puzzles of 5×5 pieces to puzzles of 100×100 pieces increasing the border size in steps of five pieces. In all cases we compare the results obtained with the SNN with those of solving the puzzles using the SA described in Section IV-F1. As mentioned before, we evaluate the performance as a function of the simulation parameter: the size of the local informational context for the SNN and the number of attempts to find complementary pieces per iteration for the SA. For each size, the simulation parameter goes from 10% to 100% in steps of 5%. The first result observed in our tests is that the SNN performance does not depend on the context initialization. There are only very small differences resulting from the three
16
Fig. 5. Comparison between the mean number of iterations needed to solve 100 puzzles of 25 × 25 (top) and 100 × 100 pieces (bottom) with the SA and the SNN. The x-axis is the simulation parameter (see Section IV-F5). The y-axis is the mean number of iterations needed to solve 100 different puzzles. For small puzzles, the performance of the neural network improves as the value of the simulation parameter increases, but it is never better than the performance of the SA. For large puzzles, the performance of the SNN is better than the SA for small values of the simulation parameter.
methods proposed to initialize the network. These differences are significant only when the size of the local informational context and the puzzle size are large (greater than 80% and 75 × 75, respectively). Taking into account this result, we have decided to use the free context initialization in all the simulations discussed here, since this is the method that uses no a priori information to solve the problem. We start the comparison between the performance of the SNN and the SA by analyzing the results of solving jigsaw puzzles in terms of the number of iterations (Fig. 5) and the number of fitting tests (Fig. 6) required to solve 100 puzzles of small size (25 × 25) and 100 puzzles of large size (100 × 100). Figs. 5 and 6 show that there is not a clear relationship between both measures. While for the number of iterations the performance of the SA is better in general, for the number of fitting tests the situation is the opposite. This is an interesting result in the context of the jigsaw puzzle problem. It means that the way pieces are chosen to test their fitting is important for improving the performance of the solver. Fig. 5 shows that the performance of the SNN in terms of the mean number of iterations is only better for large puzzles when the simulation parameter is small (smaller than 30%). For example, with a simulation parameter equal to 10% our algorithm requires a mean of 894 iterations, while the SA needs a mean of 1036, i.e., a performance improvement of 14%. As one can expect, the efficiency of the puzzle solver (for both methods) in terms of the number of iterations improves
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Fig. 6. Mean number of fitting tests needed to solve 100 puzzles of 25 × 25 (top) and 100×100 pieces (bottom) as a function of the simulation parameter. This measure is smaller for the SA only for small puzzles with a small simulation parameter. In the rest of cases, the number of fitting tests needed to solve the puzzles is always less for the SNN.
with larger simulation parameters. With the SA, the number of iterations tends to 0 as the simulation parameter tends to 100%. The larger the number of attempts, the larger is also the probability to find the right piece in each iteration. The extreme case is when the number of attempts to find a complementary piece is equal to the total number of pieces. In this case, puzzles can be solved in only one iteration. The SNN can never achieve this performance level because the SNN has an adaptation period to fill the local informational contexts of all the neurons with relevant information for each unit at the initial iterations. For example, the bottom panel of Fig. 5 shows that with a simulation parameter equal to 100% (x-axis), the SA needs an average of 106 iterations, while the SNN needs an average of 219 (52% less performance). Thus, for large values of the simulation parameter, the SA requires always fewer iterations to solve the puzzle. However, for large puzzles, the computational cost of the local information discrimination is less significant compared with the total number of iterations needed to solve the puzzle. Therefore, the performance of the SNN becomes better as the context size is smaller. Fig. 6 shows the results for the evaluation of the mean number of fitting tests. In general the performance of the SNN is better in this case. The only exception is for small puzzles with a small value for the simulation parameter. Again, this is due to the initial adaptation period of the SNN. For example, with a simulation parameter equal to 10%, the performance of the SNN is approximately 75% worse (around 6 · 106 more fitting tests). Out of this region, the mean number of fitting tests depends on the puzzle size but not on the simulation parameter. For small puzzles the number of fitting tests is
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
17
Fig. 7. Left panels: Comparison between the number of iterations needed to solve puzzles of a given size with the SA and the SNN with different values of the simulation parameter. Right panels: Comparison between the mean number of fitting tests between pieces needed to solve puzzles with both approaches. The x-axis is the simulation parameter (from 10% to 100% of the puzzle border size). The y-axis is the puzzle border size (the total size of the puzzle goes from 10 × 10 to 100 × 100 pieces). In the top panels, the z-axis is the difference between the corresponding average distances (dit -left- and dtests -right-) for solving 100 different puzzles with both algorithms. The dark plane shows the zero value. Above this plane the number of iterations or fitting tests needed with the SA is greater than those needed with the SNN. Bottom panels show these distances as a contour map. Lighter colors denote regions where the performance of the SNN is better. Taking into account the number of iterations, the larger the puzzle and the smaller the simulation parameter, the better is the performance of the SNN. The worst performance of the SNN appears in regions where the puzzle size and the simulation parameter are small. In the rest of the regions the performance is slightly worst than the SA (compare Fig. 5). For the number of fitting tests, the performance of the SNN is always better than the performance of the SA except for small puzzles with a small value of the simulation parameter.
very similar for both methods. However, for large puzzles (bottom panel) the difference between both algorithms is approximately 15 · 107 fitting tests, meaning an improvement of the performance by the SNN of 24%. These results suggest that the larger the puzzle, the better is the performance in terms of the number of fitting tests of the SNN, independently of the simulation parameter. To extend the analysis, we calculated the distances dit and dtests to solve 100 different puzzles with the SA and the SNN for a wide range of puzzle sizes and simulation parameters (Fig. 7). The results are in agreement with those shown in Figs. 5 and 6. For the mean number of iterations, the performance space can be divided in three different regions (Fig. 7, left panels). 1) For puzzles of moderate size (up to size 50 × 50) and a small simulation parameter (smaller than 20%), the performance of the SNN is not good compared with that of the SA. 2) For puzzles with more than 50 × 50 pieces with a simulation parameter between 10% and 30%, the performance of the SNN is better, i.e., the value of dit is greater than 0. 3) In the rest of cases, the performance of the SA is better, but very similar to that provided by the SNN.
Regarding the number of fitting tests (right panels of Fig. 7), the performance space can also be divided in three different regions. a) For small puzzles and small simulation parameters, the performance of the SA is better. b) The second region corresponds also to small puzzles, but now with the largest values for the simulation parameter. Here, the mean number of fitting tests needed to solve the puzzles is very similar for both methods. c) For puzzles larger than 45 × 45, the SNN has the best performance independently of the value of the simulation parameter. The improvement clearly increases with the puzzle size. For example, for puzzles of size 75 × 75 the performance of the SNN improves approximately 10% with respect to the SA. For puzzles of 100 × 100 pieces this improvement is 24%. The results shown in Fig. 7 suggest that there is not a clear relationship between the number of iterations and the number of fitting tests needed to solve a puzzle. However, both values decrease when the puzzle size increases and when the local informational context decreases. To address this point we have used a small context (10% of the border size) to solve
18
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
sizes (darker regions in the contour plots), while the SNN needs some iterations to achieve a minimum level of selforganization. Then the SNN improves and the effectiveness is similar for both approaches (white regions where deff is close to zero). For small puzzles the SA solves the puzzle before the SNN reaches a minimum effectiveness level. The effectiveness improvement is not translated into a better performance in terms of the number of iterations. However, for large puzzles (and specially for small values of the context size -panels a and b-), the opposite situation occurs. The SNN starts with a poor performance, but then it reaches a high effectiveness level and solves the puzzle before the SA. Taking into account the results shown in Fig. 8, the better performance of the SNN will further increase, as compared to the SAs, for larger puzzles. A large local context does not imply an optimal performance. When the context size is increased, the total number of fitting tests needed to solve the puzzle is very similar to the one needed with a small context (right panels of Fig. 7). In the limit, for a context size close to the total number of neurons in the network, the local processing becomes equivalent to the global processing of the SA. In this case, the problem can be solved in only a few iterations, but this does not mean that the number of fitting tests decreases. Based on all our measurements, we can conclude that the best performance of the SNN (in terms of both the number of fitting tests and iterations) is achieved for large puzzles using a relatively small context. This combination provides the optimal balance between the number of iterations and the number of comparisons to efficiently solve the problem. Fig. 8. Top panel: Comparison between the mean number of iterations needed to solve puzzles of different sizes (100 × 100, 200 × 200, 300 × 300, 400 × 400, and 500 × 500) with a simulation parameter equal to 10%. The y-axis is the mean number of iterations needed to solve the puzzles. This is a linear function of the border size for both algorithms. Note that the slope for the SA is greater. The larger the puzzle, the better is the performance of the SNN. Bottom panel: Comparison between the mean number of fitting tests needed to solve the puzzles with a simulation parameter equal to 10%. In this case, the performance is also better for the SNN, but the number of fitting tests increases nonlinearly with the puzzle size.
100 different puzzles with 100 × 100, 200 × 200, 300 × 300, 400 × 400 and 500 × 500 pieces. Fig. 8 shows the results of these tests. Both the number of iterations and the number of fitting tests increase with the puzzle size. However, the increment rate is larger for the SA. Therefore, the corresponding values for the distances dit and dtests also increase with the puzzle size. For example, for puzzles of 500 × 500 pieces dit ≈ 2, 000 iterations and dtests ≈ 175 · 109 fitting tests. These values represent a 40% performance improvement by the SNN in both cases. The performance comparison between the SNN and the SA using the measure deff is shown in Fig. 9. Effectiveness is measured as the percentage of correct matchings between pieces in relation to the total number of fitting tests in one iteration. Each panel in Fig. 9 corresponds to the contour plot of deff for a specific context size as function of the iteration number and the puzzle size. Note that the SA has a better performance in the initial iterations, specially for large context
V. Discussion In this paper we have introduced a self-organizing neural network paradigm that is able to discriminate information locally using a strategy for information processing inspired in recent findings in living neural systems. The network uses neural signatures to identify each unit, a transient memory in each cell to keep track of the information and its sources, and a multicoding mechanism for information propagation. This provides the ability to discriminate inputs in each neuron during the processing. To illustrate that the proposed paradigm can use these strategies to efficiently solve a problem, we have defined a general framework for multidimensional sorting problems and we have applied it to a classical task: the assembly of jigsaw puzzles. We have compared the results of our new approach with a classical stochastic method to solve the problem, and we have pointed out the situations where the new paradigm improves the performance. We have analyzed the performance of the proposed algorithm in terms of the effort needed to solve the problem according to two different measurements (number of iterations and number of fitting tests). In both cases we have found a similar result. Local information discrimination has a computational cost that is evident in small puzzles. For large puzzles this computational cost is justified as our results show that local discrimination provides a better performance. Due to the nature of the jigsaw puzzle problem, we have limited our analysis to the case in which each neuron receives
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
19
Fig. 9. Evolution of deff as a function of the puzzle size for three different contexts. Dark colors indicate the cases where the SA has better effectiveness (deff < 0). On the contrary, light colors indicate better effectiveness for the SNN (deff > 0). Solid/dashed lines denote the average number of iterations needed to solve a puzzle of a specific size with the SA/SNN paradigm (calculated over 100 puzzles). In the initial iterations the effectiveness of the SA is always better. Then the SNN improves and the effectiveness is similar for both approaches (white regions where deff is near 0). For small puzzles the SA solves the problem before the SNN reaches a minimum effectiveness level. However, for large puzzles (and specially for small values of the context size panels a and b), the opposite situation occurs. (a) Context size = 15, (b) Context size = 25, and (c) Context size = 50.
and processes one input message per iteration. This restriction allowed us to compare the SNN performance with a classical approach in equivalent conditions. If we consider multiple messages per iteration in the SNN, neurons can process a larger amount of information in parallel. In this case, the local informational context can be built in different ways. In a multiple message scenario, we have randomly chosen different fragments of the inputs of each cell to build the context. This strategy leads to solve the problem in fewer iterations (see top panel of Fig. 10). However, the number of fitting tests required is larger, as expected (bottom panel of Fig. 10). On the other hand, we have only used the local informational context to store neuron data from cells whose piece matching is to be tested. Alternatively, a “negative context” can be used to save temporary information about cells whose matching has already been tested with a negative result and thus this information is considered not useful for the neuron. This negative context not only reduces the number of fitting tests, but also the number of iterations. In the context of the jigsaw puzzle problem, the SNN defines a different search than the general puzzle-solver search schema. Local discrimination allows to group pieces in clusters dynamically with no a priory information. Each cluster contains pieces that have a high probability to match. Regarding the problem of reassembling real 3-D broken objects, this is an desirable property, because the fitting among fragments usually is more difficult than among pieces of a commercially produced jigsaw puzzle [40]. Some of the traditional algorithms try to group similar pieces to reduce the number of fitting tests [5], [20], [15], [43]. However these approaches require a significant preprocessing. On the other hand, the processing rules of the SNN for the jigsaw puzzle could include the use of classical similitude metrics such as the concavity and convexity of border pieces. Using them together with the local information discrimination, they can significantly reduce the number of fitting tests needed for the SNN to find the puzzle solution. We would like to emphasize that the proposed paradigm has a wider use beyond the context of jigsaw puzzles. There is a large flexibility to implement the core concepts of the SNN, thus these networks can be adapted to solve different problems
Fig. 10. Comparison between the performance of the SNN in monosynaptic information propagation mode (only one input channel per iteration and neuron) and in multisynaptic propagation mode (each neuron receives four input messages in parallel). Top panel: Performance in terms of the mean number of iterations needed to solve the puzzle. Bottom panel: Performance in terms of the mean number of fitting tests. In all cases the size of the local informational context is equal to 10%. All measures are calculated by solving 100 different puzzles for each border size. The large number of iterations for puzzles of size 30 × 30 pieces is due to the low informational context for this size. Note that this effect is reduced in the multisynaptic mode.
that can benefit from the local information discrimination. Depending on the specific problem, one might need to provide a good performance in terms of number of iterations and cost measurements. The SNN can provide in many cases a good balance between performance and computational cost. A straightforward application of SNN is multidimensional sorting when the order in a particular dimension can be independent of the order in other dimensions, or when there is no global sorting criteria in any dimension. The local discrimination of the SNN can contribute to provide an
20
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
efficient solution to these problems once the right balance between its cost and the performance is found (through the specification of the size of the local informational context). Areas of application for this kind of sorting that are likely to benefit from the SNN approach are scheduling, planning and optimization [2], [7]. Note that the SNN uses a self-organizing strategy that includes a nonsupervised learning as a function of the local discrimination. In addition, SNNs allow for a new set of learning rules that can include not only the modification of the connections, but also the parameters that affect the local discrimination. Subcellular plasticity is also a characteristic that has been recently studied in the nervous system [9]. In the introduction we have mentioned that uniformity of neurons has facilitated the mathematical formulation of many ANN paradigms. Local discrimination is somehow a problem to achieve a compact formalization for the SNN paradigm since this formalization depends on the specific problem that the network is trying to solve. This does not mean that some concepts that underlie the strategy of SNN paradigm cannot be used to extend classical ANN. For example, in particular applications, we can consider having different sets of transfer functions for each unit, and make the selection of the specific function depend on the state of a local informational context. This strategy can combine synaptic and intra-unit learning paradigms and lead to achieve multifunctionality in the network. There is an increasing amount of new results on the strategies of information processing in living neural systems [29]. Beyond the specific results reported in this paper, the use of novel bio-inspired information processing strategies can contribute to a new generation of neural networks with enhanced capacity to perform a given task. Appendix Multidimensional Sorting Example To illustrate the strategy of the SNNs to solve multidimensional sorting problems and explain in detail the evolution of the SNN implementation described in Section III-B, let us consider the multidimensional sorting presented in Fig. 11. For simplicity, we will describe the bidimensional case first. In this case, the final goal is to sort horizontally and vertically with a SNN 12 elements in the order shown in panel a. The order criteria is given by the compatibility between colors displayed in this panel. In this SNN example [Fig. 11(b)], the neuron signature is the number of each neuron in the network. The neuron data are the elements to sort illustrated by the different colors. For simplicity, we choose Ncontext = 3, P = 100% and T = 10. There is no global order criteria, so two neurons are compatible if their corresponding blocks are adjacent in Fig. 11(a). We will consider a multisynaptic information propagation mode with a maximum of two active output channels per iteration. Taking this into account, we describe below the evolution of the SNN presented in Fig. 11(b) during the first two iterations. To follow this example, we recommend the reader to have in mind the definitions of Section II-A and Fig. 12.
Fig. 11. Basic example of multidimensional sorting. (a) The correct order of the elements to sort in a 2-D problem is represented by the colors assigned to the blocks. (b) Initialization of the SNN used to sort the elements of panel a (elements are randomly assigned). Each neuron of the network has a signature that identifies it and some information needed to solve the problem. In the example, the signature of each neuron is its order number (N01, N02, N03, and so on) and the data are the blocks to sort. (c) Architecture of the SNN when the solution is reached. (d) 3-D generalization of the problem, in this case the compatibility is represented by adjacent sides with the same color.
A. Initialization of the SNN The initialization of the network consists of the following. 1) Neuron data are initialized by randomly assigning an element to each neuron. 2) The initial structure of the network is 2-D with each cell connected to its four nearest neighbors. The SNN has periodic boundary conditions, each neuron of a border is connected to the neuron of the opposite side. 3) For each neuron of the network → ni : a) initialize Context(ni ) with the top, right and bottom neighbors of ni (not shown). B. Discrimination and Processing Rules After the initialization, neurons build and send initial messages and the SNN starts to search for the solution (see Fig. 12). For this task, neurons follow the algorithm described in Section III-B. This example uses the following local discrimination rules: 1) During the process synaptic input phase, in the second discrimination stage (step e) of the algorithm), neurons to be processed are selected from messages of in. To simplify our description, here we will assume that active channels are sampled one neuron information at a time in a clockwise order (top channel first, right second, bottom third, and left fourth). For example, if we consider neuron N01 in the left panel of Fig. 12, the neuron receives message N10 − N05 − N09 from N09 (top channel) and message N08 − N12 − N04 (left channel) from N04. Right and bottom channels are not active in this case. As the signatures of N09 and N04 are not recognized, both messages pass to the second discrimination stage. To select the neuron
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
21
Fig. 12. Evolution of the SNN for the sorting problem illustrated in Fig. 11. Left panel: neighborhood, local informational context and output messages of each neuron of the SNN shown in the right panel of Fig. 11 at the end of iteration 1. The details of the processing and the discrimination rules are described in the text. In this example, Ncontext = 3, the context initialization is a neighborhood context initialization (initially all the local informational contexts contain the corresponding top, right and bottom neuron informations. For example, for N01 the initial context corresponds to neurons N09, N02, and N05. After the initialization, the first messages are built and sent (not shown here). In our case, N01 sends messages to N09 and N04; N02 to N10 and N03; N03 to N11 and N02; N04 to N01 and N08; N05 to N06 and N08; N06 to N10 and N05; N07 to N11; N08 to N04 and N05; N09 to N01 and N12; N10 to N06 and N02; N11 to N07 and N03; and N12 to N09. During the process synaptic input phase of iteration 1, all incoming messages will pass to the second discrimination stage. Now, each neuron selects the set of neuron informations to process following the rules described in the text. The selected set is shown below each neuron, and this will be the local informational context for the next iteration. When a neuron starts recognizing a signature, the corresponding cell is shown green and filled. If the cells are also in their correct position, the connection between them are green and solid instead of red and dotted. Neurons filled and grey have a new connection established during the restore neighborhood phase. Arrows denote the output channels activated during the propagate information phase. Right panel: Evolution of the SNN in iteration 2. The differences with respect to the network in the left panel consist in the yellow filled neurons. These are cells moved to their correct position because a receptor receives a message from an emitter whose signature is recognized (and the emitter is not in its correct position).
informations to process from the selected messages, N01 initially chooses N09 (first neuron information from the top channel), N04 (first neuron information from the left channel) and N05 (second neuron information from the top channel). N09 and N05 are discarded because they belong to the local informational context of N01 [remember that at the beginning of this iteration the context of N01 is the one built during the context initialization: N09, N02 and N05, see Fig. 11(a)]. Then, as Ncontext = 3 in this example, N01 chooses N12 (second neuron information from the left channel) and N10 (third neuron information from the top channel), to finally process N04, N12 and N10. 2) If a receptor starts recognizing a signature in a given iteration, it sends a message to this emitter. As we have a maximum number of active channels, these messages have a higher priority during the information propagation phase (step a) of the algorithm). In the example, N01 starts recognizing the signature of N04 in iteration 1 (left panel of Fig. 12). Then, it sends a message to N04. The rest of output channels are activated randomly taking into account that the communication is bidirectional and the maximum value of active channels. 3) During the information propagation phase, the emitter does not include information about the receptor in the output message. For example, let us consider neuron N01 whose context is N04, N12 and N10 at the end of
iteration 1 (shown below the neuron in Fig. 12). N01 sends to neuron N04 the message N10 − N12 − N01 instead of N12 − N04 − N01. C. Evolution of the SNN 1) Iteration 1: The left panel of Fig. 12 illustrates the evolution of the SNN presented in Fig. 11(b) during the first iteration of the algorithm. The first step during the process synaptic input phase consists in the selection of the messages to process. In iteration 1 there are no messages received from an emitter with a recognized signature, because no neuron recognizes any signature yet. Therefore, all messages pass to the second stage of the discrimination. In this stage each neuron selects the set of neurons informations to be processed. In the figure, this set is shown below each neuron as it becomes the local informational context for the next iteration. The neurons use this set to find their adjacent blocks (a compatible neuron in a given dimension). In this way, neuron N01 starts recognizing signature N04, or neuron N08 signature N01. In the first case, N04 belongs to the neighborhood of N01 and a reconfiguration of the network is not needed. On the other hand, N01 does not belong to the neighborhood of N08. This implies a network reconfiguration (self-organization of the SNN) before starting to recognize the signature. As a consequence of this reconfiguration, N08 and N01 are now neighbors, connections between N08 and N07, and N01 and N02 are broken, and N07 stops having a right neighbor
22
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
and N02 a left neighbor. This reconfiguration mechanism is repeated during the algorithm evolution for the rest of neurons. Finally, during the restore neighborhood phase, neurons with incomplete neighborhood try to connect with neurons of their local informational context in this same status to maximize the information flow in the network. For example, at the end of iteration 1, as N04 is included in the local informational context of N02, N02 has not a bottom neighbor and N04 has not a top neighbor; a new connection is established between these cells. Note that although N02 and N09 could be connected to complete their neighborhood (N02 is included in the local context of N09) they are not connected because they are already neighbors. Note that the neighborhood restoration takes place after the local informational context is updated. 2) Iteration 2: The second iteration starts from the situation shown in the left panel of Fig. 12. The mechanism used to process the incoming messages in the receptor is analogous to the mechanism described for iteration 1 with only one difference. Now, N01 receives a message from N04, an emitter with a recognized signature and not its correct position. Then, N01 only processes this message. Before processing the message, as N04 is not in its correct position, a reconfiguration of the network takes place to move N04 on top N01 (right panel of Fig. 12). At this point N04 does not recognize signature S01 yet. Later, when N04 processes its input message it will start recognizing S01. During the process synaptic input phase, N04 processes, in this order, the neuron information of cells N01, N03 and N12. The processing of neuron information of N01 implies that N04 starts recognizing signature N01. As N01 belongs to the neighborhood of N04 a reconfiguration is not needed. The same occurs when the neuron information of N03 is processed. Finally, the processing of neuron information of N12 also implies that N04 starts recognizing signature N12. But now, N12 does not belong to the neighborhood of N04. When a new synapse is established to set N12 in its correct position, the connection between N03 and N04 must be broken. As a consequence, N04 stops recognizing signature N03 and, N03 stops recognizing N04. In this simple example, in just two iterations most of the SNN neurons have reached the local solution [compare to Fig. 11(c)], and only two more iterations are needed to reach the global solution to the problem (not shown here). Note that the corresponding problem in three dimensions [illustrated in Fig. 11(d)] or more only requires to repeat step f) of the algorithm described in section III-B.
References [1] M. Anthony, “On the generalization error of fixed combinations of classifiers,” J. Comput. Syst. Sci., vol. 73, no. 5, pp. 725–734, 2007. [2] W. G. Aref and I. Kamel, “On multi-dimensional sorting orders,” in Proc. 11th Int. Conf. Database Expert Syst. Applicat., vol. 1873. 2000, pp. 774–783. [3] P. Auer, H. Burgsteiner, and W. Maass, “A learning rule for very simple universal approximators consisting of a single layer of perceptrons,” Neural Netw., vol. 21, no. 5, pp. 786–795, 2008. [4] N. Ayache and O. D. Faugeras, “Hyper: A new approach for the recognition and positioning of two dimensional objects,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, no. 1, pp. 44–54, Jan. 1986.
[5] H. Bunke and G. Kaufmann, “Jigsaw puzzle solving using approximate string matching and best-first search,” in Proc. 5th Int. Conf. CAIP, 1993, pp. 299–308. [6] B. Burdea and H. Wolfson, “Solving jigsaw puzzles by a robot,” IEEE Trans. Robot. Autom., vol. 5, no. 6, pp. 752–764, Dec. 1989. [7] O. Catoni, “Solving scheduling problems by simulated annealing,” Siam J. Contr. Optim., vol. 36, no. 5, pp. 1539–1575, 1998. [8] M. Chung, M. Fleck, and D. Forsyth, “Jigsaw puzzle solver using shape and color,” in Proc. 4th ICSP, 1998, pp. 877–880. [9] G. W. Davis, “Homeostatic control of neural activity: From phenomenology to molecular design,” Annu. Rev. Neurosci., vol. 29, no. 1, pp. 307– 323, 2006. [10] J. De Bock, R. De Smet, W. Philips, and J. D’Haeyer, “Constructing the topological solution of jigsaw puzzles,” in Proc. ICSP, vol. 3. 2004, pp. 2127–2130. [11] E. D. Demaine and M. L. Demaine, “Jigsaw puzzles, edge matching, and polyomino packing: Connections and complexity,” Graph. Comb., vol. 23, no. 1, pp. 195–208, 2007. [12] E.-J. Farn and C.-C. Chen, “Novel steganographic method based on jig swap puzzle images,” J. Electron. Imag., vol. 18, no. 1, p. 013003, 2009. [13] J. Fort, “Som’s mathematics,” Neural Netw., vol. 19, nos. 6–7, pp. 812– 816, 2006. [14] H. Freeman and L. Gardner, “Apictorial jigsaw puzzles: A computer solution to a problem in pattern recognition,” IEEE Trans. Electron. Comput., vol. EC-13, no. 2, pp. 118–127, 1964. [15] D. Goldberg, C. Malon, and M. Bern, “A global approach to automatic solution of jigsaw puzzles,” Computat. Geometry, vol. 28, nos. 2–3, pp. 165–174, 2004. [16] J. E. Goodman and R. Pollack, “Multidimensional sorting,” Siam J. Comput., vol. 12, no. 3, pp. 484–507, 1983. [17] R. Ilin, R. Kozma, and P. Werbos, “Beyond feedforward models trained by backpropagation: A practical training tool for a more efficient universal approximator,” IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 929–937, Jun. 2008. [18] E. Kishon, T. Hastie, and H. Wolfson, “3-D curve matching using splines,” in Proc. 1st Eur. Conf. Comput. Vision, 1990, pp. 589–591. [19] E. Kishon and H. Wolfson, “3-D curve matching,” in Proc. AAAI Workshop Spatial Reasoning Multi-Sensor Fusion, 1987, pp. 250–261. [20] W. Kong and B. Kimia, “On solving 2-D and 3-D puzzles using curve matching,” in Proc. IEEE Comput. Vision Patt. Recog., vol. 2. 2001, pp. II-583–II-590. [21] D. A. Kosiba, P. M. Devaux, S. Balasubramanian, T. L. Gandhi, and R. Kasturi, “An automatic jigsaw puzzle solver,” in Proc. 12th IAPR Int. Conf. Patt. Recog., 1994, pp. 616–618. [22] R. Latorre, F. B. Rodr´ıguez, and P. Varona, “Effect of individual spiking activity on rhythm generation of central pattern generators,” Neurocomputing, vols. 58–60, pp. 535–540, Jun. 2004. [23] R. Latorre, F. B. Rodr´ıguez, and P. Varona, “Neural signatures: Multiple coding in spiking-bursting cells,” Biologic. Cybern., vol. 95, no. 2, pp. 169–183, 2006. [24] R. Latorre, F. B. Rodr´ıguez, and P. Varona, “Reaction to neural signatures through excitatory synapses in central pattern generator models,” Neurocomputing, vol. 70, pp. 1797–1801, Jun. 2007. [25] H. C. G. Leitao and J. Stolfi, “A multiscale method for the reassembly of two-dimensional fragmented objects,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1239–1251, Sep. 2002. [26] M. Levison, “The siting of fragments,” Comput. J., vol. 7, no. 4, pp. 275–277, 1965. [27] K. Nagura, K. Sato, H. Maekawa, T. Morita, and K. Fujii, “Partial contour processing using curvature function-assembly of jigsaw puzzles and recognition of moving figures,” Syst. Comput., vol. 2, pp. 30–39, 1986. [28] T. R. Nielsen, P. Drewsen, and K. Hansen, “Solving jigsaw puzzles using image features,” Patt. Recogn. Lett., vol. 29, no. 14, pp. 1924– 1933, 2008. [29] M. I. Rabinovich, P. Varona, A. I. Selverston, and H. D. I. Abarbanel, “Dynamical principles in neuroscience,” Rev. Modern Phys., vol. 78, no. 44, pp. 1213–1265, 2006. [30] G. Radack and N. Badler, “Jigsaw puzzle matching using a boundarycentered polar encoding,” Comput. Graphics Image Process., vol. 19, no. 1, pp. 1–17, May 1982. [31] J. T. Schwartz and M. Sharir, “Identification of partially obscured objects in two and three dimension by matching noisy characteristic curves,” Int. J. Robotics Res., vol. 6, no. 2, pp. 29–44, Jun. 1987. [32] P. N. Suganthan, “Solving jigsaw puzzles using Hopfield network,” in Proc. Int. Conf. Neural Netw., Jul. 1999, pp. 10–16.
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
[33] A. Sz¨ucs, H. D. I. Abarbanel, M. I. Rabinovich, and A. I. Selverston, “Dopamine modulation of spike dynamics in bursting neurons,” Eur. J. Neurosci., vol. 21, no. 3, pp. 763–772, Feb. 2005. [34] A. Sz¨ucs, R. D. Pinto, M. I. Rabinovich, H. D. I. Abarbanel, and A. I. Selverston, “Synaptic modulation of the interspike interval signatures of bursting pyloric neurons,” J. Neurophysiol., vol. 89, no. 3, pp. 1363– 1377, Mar. 2003. [35] F. Toyama, Y. Fujiki, K. Shoji, and J. Miyamichi, “Assembly of puzzles using a genetic algorithm,” in Proc. 16th Int. Conf. Patt. Recog., vol. 4. 2002, pp. 389–392. [36] S. Trenn, “Multilayer perceptrons: Approximation order and necessary number of hidden units,” IEEE Trans. Neural Netw., vol. 19, no. 5, pp. 836–844, May 2008. ¨ [37] G. Ucoluk and I. Toroslu, “Automatic reconstruction of broken 3D surface objects,” Comput. Graphics, vol. 23, no. 4, pp. 573–582, 1999. [38] R. W. Webster, P. S. LaFollette, and R. L. Stafford, “Isthmus critical points for solving jigsaw puzzles in computer vision,” IEEE Trans. Syst., Man, Cybern., vol. 21, no. 5, pp. 1271–1278, 1991. [39] D. A. White and A. Sofge, Eds., Handbook of Intelligent Control Neural, Fuzzy, and Adaptive Approaches. New York: Reinhold, 1992. [40] A. Willis and D. Cooper, “Computational reconstruction of ancient artifacts,” IEEE Signal Process. Mag., vol. 25, no. 4, pp. 65–83, Jul. 2008. [41] H. Wolfson, “On curve matching,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 12, no. 5, pp. 483–489, May 1990. [42] H. Wolfson, E. Schonberg, A. Kalvin, and Y. Landam, “Solving jigsaw puzzles by computer,” Ann. Oper. Res., vol. 12, nos. 1–4, pp. 51–64, Feb. 1988. [43] F.-H. Yao and G.-F. Shao, “A shape and image merging technique to solve jigsaw puzzles,” Patt. Recogn. Lett., vol. 24, no. 12, pp. 1819– 1835, 2003. [44] A. Zaritsky and M. Sipper, “The preservation of favored building blocks in the struggle for fitness: The puzzle algorithm,” IEEE Trans. Evol. Comput., vol. 8, no. 5, pp. 443–455, Oct. 2004. [45] Y.-X. Zhao, M.-C. Su, Z.-L. Chou, and J. Lee, “A puzzle solver and its application in speech descrambling,” in Proc. Annu. Conf. Int. Conf. Comput. Eng. Applicat., 2007, pp. 171–176.
23
Roberto Latorre received the B.S. degree in computer engineering and the Ph.D. in computer science and telecommunications from Universidad Aut´onoma de Madrid, Madrid, Spain, in 2000 and 2008, respectively. Since 2002, he has been a Profesor Asociado with the Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid. He has been a member of the Grupo de Neurocomputaci´on Biol´ogica, Escuela Polit´ecnica Superior, since 2001. His research interests include different topics in neuroscience and neurocomputing, from the generation of motor patterns and information coding to pattern recognition and ANNs.
Francisco de Borja Rodr´ıguez received the B.S. degree in applied physics and the Ph.D. in computer science from Universidad Aut´onoma de Madrid, Madrid, Spain, in 1992 and 1999, respectively. He then was with Nijmegen University, Nijmegen, Holland, and the Institute for Nonlinear Science, University of California, San Diego. Since 2002, he has been a Profesor Titular with the Escuela Polit´ecnica Superior, Universidad Auton´oma de Madrid.
Pablo Varona received the B.S. degree in theoretical physics and the Ph.D. in computer science from Universidad Aut´onoma de Madrid, Madrid, Spain, in 1992 and 1997, respectively. He was a Post-Doctoral Fellow and later an Assistant Research Scientist with the Institute for Nonlinear Science, University of California, San Diego. Since 2002, he has been a Profesor Titular with the Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid.
24
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Adaptive Dynamic Programming for Finite-Horizon Optimal Control of Discrete-Time Nonlinear Systems with ε-Error Bound Fei-Yue Wang, Fellow, IEEE, Ning Jin, Student Member, IEEE, Derong Liu, Fellow, IEEE, and Qinglai Wei
Abstract— In this paper, we study the finite-horizon optimal control problem for discrete-time nonlinear systems using the adaptive dynamic programming (ADP) approach. The idea is to use an iterative ADP algorithm to obtain the optimal control law which makes the performance index function close to the greatest lower bound of all performance indices within an ε-error bound. The optimal number of control steps can also be obtained by the proposed ADP algorithms. A convergence analysis of the proposed ADP algorithms in terms of performance index function and control policy is made. In order to facilitate the implementation of the iterative ADP algorithms, neural networks are used for approximating the performance index function, computing the optimal control policy, and modeling the nonlinear system. Finally, two simulation examples are employed to illustrate the applicability of the proposed method. Index Terms— Adaptive critic designs, adaptive dynamic programming, approximate dynamic programming, learning control, neural control, neural dynamic programming, optimal control, reinforcement learning.
I. I NTRODUCTION HE optimal control problem of nonlinear systems has always been the key focus of control fields in the past several decades [1]–[15]. Traditional optimal control approaches are mostly implemented in infinite time horizon [2], [5], [9], [11], [13], [16], [17]. However, most real-world systems need to be effectively controlled within finite time horizon (finite-horizon for brief), such as stabilized or tracked to a desired trajectory in a finite duration of time. The design of finite-horizon optimal controllers faces a major obstacle in
T
Manuscript received April 16, 2010; revised August 20, 2010; accepted August 24, 2010. Date of publication September 27, 2010; date of current version January 4, 2011. This work was supported in part by the Natural Science Foundation (NSF) China under Grant 60573078, Grant 60621001, Grant 60728307, Grant 60904037, Grant 60921061, and Grant 70890084, by the MOST 973 Project 2006CB705500 and Project 2006CB705506, by the Beijing Natural Science Foundation under Grant 4102061, and by the NSF under Grant ECS-0529292 and Grant ECCS-0621694. The acting Editor-inChief who handled the review of this paper was Frank L. Lewis. F. Y. Wang and Q. Wei are with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected];
[email protected]). N. Jin is with the Department of Electrical and Computer Engineering, University of Illinois, Chicago, IL 60607 USA (e-mail:
[email protected]). D. Liu is with the Department of Electrical and Computer Engineering, University of Illinois, Chicago, IL 60607 USA. He is also with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2076370
comparison to the infinite horizon one. An infinite horizon optimal controller generally obtains an asymptotic result for the controlled systems [9], [11]. That is, the system will not be stabilized or tracked until the time reaches infinity, while for finite-horizon optimal control problems the system must be stabilized or tracked to a desired trajectory in a finite duration of time [1], [8], [12], [14], [15]. Furthermore, in the case of discrete-time systems, the determination of the number of optimal control steps is necessary for finite-horizon optimal control problems, while for the infinite horizon optimal control problems the number of optimal control steps is infinity in general. The finite-horizon control problem has been addressed by many researchers [18]–[23]. But most of the existing methods consider only the stability problems of systems under finite-horizon controllers [18], [20], [22], [23]. Due to the lack of methodology and the fact that the number of control steps is difficult to determine, the optimal controller design of finitehorizon problems still presents a major challenge to control engineers. This motivates our present research. As is known, dynamic programming is very useful in solving optimal control problems. However, due to the “curse of dimensionality” [24], it is often computationally untenable to run dynamic programming to obtain the optimal solution. The adaptive/approximate dynamic programming (ADP) algorithms were proposed in [25] and [26] as a way to solve optimal control problems forward in time. There are several synonyms used for ADP including “adaptive critic designs” [27]–[29], “adaptive dynamic programming” [30]– [32], “approximate dynamic programming” [26], [33]–[35], “neural dynamic programming” [36], “neuro-dynamic programming” [37], and “reinforcement learning” [38]. In recent years, ADP and related research have gained much attention from researchers [27], [28], [31], [33]–[36], [39]–[57]. In [29] and [26], ADP approaches were classified into several main schemes: heuristic dynamic programming (HDP), actiondependent HDP, also known as Q-learning [58], dual heuristic dynamic programming (DHP), ADDHP, globalized DHP (GDHP), and ADGDHP. Saridis and Wang [10], [52], [59] studied the optimal control problem for a class of nonlinear stochastic systems and presented the corresponding Hamilton-Jacobi-Bellman (HJB) equation for stochastic control problems. Al-Tamimi et al. [27] proposed a greedy HDP iteration algorithm to solve the discrete-time HJB (DTHJB) equation of the optimal control problem for discrete-time nonlinear systems. Though great progress has been made for ADP in the optimal control
1045–9227/$26.00 © 2010 IEEE
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
field, most ADP methods are based on infinite horizon, such as [16], [27], [33], [36], [37], [43]–[45], [53], [56] and [57]. Only [60] and [61] discussed how to solve the finite-horizon optimal control problems based on ADP and backpropagation-through-time algorithms. In this paper, we will develop a new ADP scheme for finitehorizon optimal control problems. We will study the optimal control problems with an ε-error bound using ADP algorithms. First, the HJB equation for finite-horizon optimal control of discrete-time systems is derived. In order to solve this HJB equation, a new iterative ADP algorithm is developed with convergence and optimality proofs. Second, the difficulties of obtaining the optimal solution using the iterative ADP algorithm is presented and then the ε-optimal control algorithm is derived based on the iterative ADP algorithms. Next, it will be shown that the ε-optimal control algorithm can obtain suboptimal control solutions within a fixed finite number of control steps that make the performance index function converge to its optimal value with an ε-error. Furthermore, in order to facilitate the implementation of the iterative ADP algorithms, we use neural networks to obtain the iterative performance index function and the optimal control policy. Finally, an ε-optimal state feedback controller is obtained for finite-horizon optimal control problems. This paper is organized as follows. In Section II, the problem statement is presented. In Section III, the iterative ADP algorithm for finite-horizon optimal control problem is derived. The convergence property and optimality property are also proved in this section. In Section IV, the ε-optimal control algorithm is developed, the properties of the algorithm are also proved in this section. In Section V, two examples are given to demonstrate the effectiveness of the proposed control scheme. Finally, in Section VI, the conclusion is drawn. II. P ROBLEM S TATEMENT In this paper, we will study deterministic discrete-time systems x k+1 = F(x k , u k ), k = 0, 1, 2, . . . (1) where x k ∈ Rn is the state and u k ∈ Rm is the control vector. Let x 0 be the initial state. The system function F(x k , u k ) is continuous for ∀ x k , u k and F(0, 0) = 0. Hence, x = 0 is an equilibrium state of system (1) under the control u = 0. The performance index function for state x 0 under the control sequence u 0N−1 = (u 0 , u 1 , . . . , u N−1 ) is defined as N−1 U (x i , u i ) J x 0 , u 0N−1 =
(2)
i=0
where U is the utility function, U (0, 0) = 0, and U (x i , u i ) ≥ 0 for ∀ x i , u i . The sequence u 0N−1 defined above is a finite sequence of controls. Using this sequence of controls, system (1) gives a trajectory starting from x 0 : x 1 = F(x 0 , u 0 ), x 2 = F(x 1 , u 1 ), . . . , x N = F(x N−1 , u N−1 ). We call the number of elements in the control sequence u0N−1 the length of N−1 N−1 and denote it as u 0 . Then, u 0N−1 = N. The u0 length of the associated trajectory x 0N = (x 0 , x 1 , . . . , x N )
25
is N + 1. We denote the final state of the trajectory as x ( f ) x 0 , u 0N−1 , i.e., x ( f ) x 0 , u 0N−1 = x N . Then, for ∀ k ≥ 0, the finite control sequence starting at k can be written as u k+i−1 = (u k , u k+1 , . . . , u k+i−1 ), where i ≥ 1 is the length k of the control sequence. The final state can be written as = x k+i . x ( f ) x k , u k+i−1 k We note that the performance index function defined in (2) does not have the term associated with the final state since in this paper we specify the final state x N = F(x N−1 , u N−1 ) to be at the origin, i.e., x N = x ( f ) = 0. For the present finitehorizon optimal control problems, the feedback controller u k = u(x k ) must not only drive the system state to zero within finite number of time steps but also guarantee the performance index function (2) to be finite, i.e., u kN−1 = (u(x k ), u(x k+1 ), . . . , u(x N−1 )) must be a finite-horizon admissible control sequence, where N > k is a finite integer. Definition 2.1: A control sequence u kN−1 is said to be finite n , if x ( f ) x , u N−1 = 0 ∈ R horizon admissible for a state x k k k and J x k , u kN−1 is finite, where N > k is a finite integer. A state x k is said to be finite-horizon controllable (controllable for brief) if there is a finite-horizon admissible control sequence associated with this state. Let u k be an arbitrary finite-horizon admissible control sequence starting at k and let Ax k = u k : x ( f ) x k , u k = 0 be the set of all finite-horizon admissible control sequences of x k . Let k+i−1 k+i−1 k+i−1 (f) u =i = 0, x , u : x A(i) = u k xk k k k
be the set of all finite-horizon admissible control sequences of (i) x k with length i . Then, Axk = ∪1≤i<∞ Axk . By this notation, a state x k is controllable if and only if Axk = ∅. For any given system state x k , the objective of the present finite-horizon optimal control problem is to find a finite(N−k) ⊆ horizon admissible control sequence u kN−1 ∈ Axk N−1 Axk to minimize the performance index J x k , u k . The control sequence u kN−1 has finite length. However, before it is determined, we do not know its length, which means that the length of the control sequence u kN−1 = N − k is unspecified. This kind of optimal control problems has been called finitehorizon problems with unspecified terminal time [1] (but in the present case, with fixed terminal state x ( f ) = 0). Define the optimal performance index function as (3) J ∗ (x k ) = inf J (x k , u k ) : u k ∈ Axk . uk
Then, according to Bellman’s principle of optimality [24], J ∗ (x k ) satisfies the DTHJB equation J ∗ (x k ) = min U (x k , u k ) + J ∗ (F(x k , u k )) . (4) uk
Now, define the law of optimal control sequence starting at k by u ∗ (x k ) = arg inf J (x k , u k ) : u k ∈ Axk uk
and define the law of optimal control vector by u ∗ (x k ) = arg min U (x k , u k ) + J ∗ (F(x k , u k )) . uk
26
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
In other words, u ∗ (x k ) = u ∗k and u ∗ (x k ) = u ∗k . Hence, we have J ∗ (x k ) = U x k , u ∗k + J ∗ F x k , u ∗k . III. P ROPERTIES OF THE I TERATIVE ADP A LGORITHM In this section, a new iterative ADP algorithm is developed to obtain the finite-horizon optimal controller for nonlinear systems. The goal of the present iterative ADP algorithm is to construct an optimal control policy u ∗ (x k ), k = 0, 1, . . ., which drives the system from an arbitrary initial state x 0 to the singularity 0 within finite time, and simultaneously minimizes the performance index function. Convergence proofs will also be given to show that the performance index function will indeed converge to the optimum. A. Derivation We first consider the case where, for any state x k , there exists a control vector u k such that F(x k , u k ) = 0, i.e., we can control the state of system (1) to zero in one step from any initial state. For the case where F(x k , u k ) = 0 does not hold, we will discuss and solve the problem later in this paper. In the iterative ADP algorithm, the performance index function and control policy are updated by recursive iterations, with the iteration index number i increasing from 0 and with the initial performance index function V0 (x) = 0 for ∀ x ∈ Rn . The performance index function for i = 1 is computed as
Remark 3.1: Equations (6)–(9) in the iterative ADP algorithm are similar to the HJB equation (4), but they are not the same. There are at least two obvious differences. 1) For any finite time k, if x k is the state at k, then the optimal performance index function in HJB (4) is unique, i.e., J ∗ (x k ), while in the iteration ADP equation (6)–(9), the performance index function is different for each iteration index i , i.e., Vi (x k ) = V j (x k ) for ∀ i = j, in general. 2) For any finite time k, if x k is the state at k, then the optimal control law obtained by HJB (4) possesses the unique optimal control expression, i.e., u ∗k = u ∗ (x k ), while the control law solved by the iterative ADP algorithm (6)–(9) is different from each other for each iteration index i , i.e., v i (x k ) = v j (x k ) for ∀i = j, in general. Remark 3.2: According to (2) and (8), we have k+i (i+1) . Vi+1 (x k ) = min J x k , u k+i : u ∈ A x k k k u k+i k
Since Vi+1 (x k ) = min {U (x k , u k ) + Vi (x k+1 )} uk
= min U (x k , u k ) + min U (x k+1 , u k+1 ) u k+1 uk + min U (x k+2 , u k+2 ) + · · · u k+2
+ min {U (x k+i−1 , u k+i−1 ) u k+i−1 + V1 (x k+i )} · · ·
V1 (x k ) = min {U (x k , u k ) + V0 (F(x k , u k ))} uk
s.t. F(x k , u k ) = 0
where
= min U (x k , u k ) s.t. F(x k , u k ) = 0
V1 (x k+i ) = min U (x k+i , u k+i )
uk
= U (x k , u ∗k (x k )) F(x k , u ∗k (x k )) = 0. chosen as v 1 (x k )
where V0 (F(x k , u k )) = 0 and vector v 1 (x k ) for i = 1 is Therefore, (5) can also be written as
u k+i
(5) The control = u ∗k (x k ).
s.t. F(x k+i , u k+i ) = 0 we obtain Vi+1 (x k ) = min {U (x k , u k ) + U (x k+1 , u k+1 ) u k+i k
V1 (x k ) = min U (x k , u k ) s.t. F(x k , u k ) = 0
+ · · · + U (x k+i , u k+i )}
uk
= U (x k , v 1 (x k ))
(10)
s.t. F(x k+i , u k+i ) = 0 k+i (i+1) = min J x k , u k+i . : u k ∈ Ax k k
(6)
u k+i k
where v 1 (x k ) = arg min U (x k , u k ) s.t. F(x k , u k ) = 0. uk
(7)
For i = 2, 3, 4, . . ., the iterative ADP algorithm will be implemented as follows: uk
(8)
where v i (x k ) = arg min {U (x k , u k ) + Vi−1 (x k+1 )} uk
= arg min {U (x k , u k ) + Vi−1 (F(x k , u k ))} . uk
Equations (6)–(9) form the iterative ADP algorithm.
Vi+1 (x k ) =
i U x k+ j , v i+1− j (x k+ j ) .
(11)
j =0
B. Properties
Vi (x k ) = min {U (x k , u k ) + Vi−1 (F(x k , u k ))} = U (x k , v i (x k )) + Vi−1 (F(x k , v i (x k )))
Using the notation in (9), we can also write
(9)
In the above, we can see that the performance index function J ∗ (x k ) solved by HJB equation (4) is replaced by a sequence of iterative performance index functions Vi (x k ) and the optimal control law u ∗ (x k ) is replaced by a sequence of iterative control law v i (x k ), where i ≥ 1 is the index of iteration. We can prove that J ∗ (x k ) defined in (3) is the limit of Vi (x k ) as i → ∞. Theorem 3.1: Let x k be an arbitrary state vector. Suppose (1) that Axk = ∅. Then, the performance index function Vi (x k )
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
obtained by (6)–(9) is a monotonically nonincreasing sequence for ∀ i ≥ 1, i.e., Vi+1 (x k ) ≤ Vi (x k ) for ∀ i ≥ 1. Proof: We prove this by mathematical induction. First, we let i = 1. Then, we have V1 (x k ) given as in (6) and the finite-horizon admissible control sequence is uˆ kk = (v 1 (x k )). Next, we show that there exists a finite-horizon admissible = control sequence uˆ k+1 with length 2 such that J x k , uˆ k+1 k k V1 (x k ). The trajectory starting from x k under the control of uˆ kk = (v 1 (x k )) is x k+1 = F(x k , v 1 (x k )) = 0. Then, we create a new control sequence uˆ k+1 by adding a 0 to the k k to obtain the control sequence uˆ k+1 = end of sequence u ˆ k k k k+1 uˆ k , 0 . Obviously, uˆ k = 2. The state trajectory under the control of uˆ k+1 is x k+1 = F(x k , v 1 (x k )) = 0 and k x k+2 = F(x k+1 , uˆ k+1 ), where uˆ k+1 = 0. Since x k+1 = 0 and is a finite-horizon F(0, 0) = 0, we have x k+2 = 0. So, uˆ k+1 k admissible control sequence. Furthermore J (x k , uˆ k+1 k ) = U (x k , v 1 (x k )) + U (x k+1 , uˆ k+1 ) = U (x k , v 1 (x k ))
27
On the other hand, we have (q+1) k+q k+q . : u k ∈ Ax k Vq+1 (x k ) = min J x k , u k k+q
uk
Thus, we obtain
(q+1) k+q k+q Vq+1 (x k ) = min J x k , u k : u k ∈ Axk k+q
uk
k+q ≤ J x k , uˆ k
= Vq (x k )
which completes the proof. From Theorem 3.1, we know that the performance index function Vi (x k ) ≥ 0 is a monotonically nonincreasing sequence and is bounded below for iteration index i = 1, 2, . . .. Now, we can derive the following theorem. Theorem 3.2: Let x k be an arbitrary state vector. Define the performance index function V∞ (x k ) as the limit of the iterative function Vi (x k ) V∞ (x k ) = lim Vi (x k ).
= V1 (x k )
i→∞
since U (x k+1 , uˆ k+1 ) = U (0, 0) = 0. On the other hand, according to Remark 3.2, we have k+1 (2) . : u ∈ A V2 (x k ) = min J x k , u k+1 x k k k u k+1 k
Then, we obtain
≤J
= V1 (x k ).
q−1 j =0
(12)
U x k+ j , v q− j (x k+ j ) .
The corresponding finite-horizon admissible control k+q−1 = v q (x k ), v q−1 (x k+1 ), . . . , v 1 (x k+q−1 ) . sequence is uˆ k k+q = sequence uˆ k For i = q + 1, we create a control v q (x k ), v q−1 (x k+1 ), . . . , v 1 (x k+q−1 ), 0 with length q + 1. k+q Then, the state trajectory under the control of uˆ k is x k , x k+1 = F(x k , v q (x k )), x k+2 = F(x k+1 , v q−1 (x k+1 )), . . ., x k+q = F(x k+q−1 , v 1 (x k+q−1 )) = 0, x k+q+1 = F(x k+q , 0) = k+q is a finite-horizon admissible control sequence. 0. So, uˆ k The performance index function under this control sequence is k+q
J (x k , uˆ k
) = U (x k , v q (x k )) + U (x k+1 , v q−1 (x k+1 )) +· · ·+ U (x k+q−1 , v 1 (x k+q−1 )) + U (x k+q , 0) =
q−1 j =0
U x k+ j , v q− j (x k+ j )
= Vq (x k ) since U (x k+q , 0) = U (0, 0) = 0.
uk
Proof: Let ηk = η(x k ) be any admissible control vector. According to Theorem 3.1, for ∀ i , we have
V∞ (x k ) ≤ U (x k , ηk ) + V∞ (x k+1 )
Therefore, the theorem holds for i = 1. Assume that the theorem holds for any i = q, where q > 1. From (11), we have Vq (x k ) =
V∞ (x k ) = min{U (x k , u k ) + V∞ (x k+1 )}.
Let i → ∞, we have
u k+1 k
x k , uˆ k+1 k
Then, we have
V∞ (x k ) ≤ Vi+1 (x k ) ≤ U (x k , ηk ) + Vi (x k+1 ).
k+1 (2) V2 (x k ) = min J x k , u k+1 ∈ A : u xk k k
(13)
which is true for ∀ ηk . Therefore V∞ (x k ) ≤ min{U (x k , u k ) + V∞ (x k+1 )}. uk
(14)
Let ε > 0 be an arbitrary positive number. Since Vi (x k ) is nonincreasing for i ≥ 1 and limi→∞ Vi (x k ) = V∞ (x k ), there exists a positive integer p such that V p (x k ) − ε ≤ V∞ (x k ) ≤ V p (x k ). From (8), we have V p (x k ) = min{U (x k , u k ) + V p−1 (F(x k , u k ))} uk
= U (x k , v p (x k )) + V p−1 (F(x k , v p (x k ))). Hence, V∞ (x k ) ≥ U (x k , v p (x k )) + V p−1 (F(x k , v p (x k ))) − ε ≥ U (x k , v p (x k )) + V∞ (F(x k , v p (x k ))) − ε ≥ min{U (x k , u k ) + V∞ (x k+1 )} − ε. uk
Since ε is arbitrary, we have V∞ (x k ) ≥ min{U (x k , u k ) + V∞ (x k+1 )}. uk
(15)
Combining (14) and (15), we prove the theorem. Next, we will prove that the iterative performance index function Vi (x k ) converges to the optimal performance index function J ∗ (x k ) as i → ∞.
28
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Theorem 3.3: Let V∞ (x k ) be defined in (13). If the system state x k is controllable, then we have the performance index function V∞ (x k ) equal to the optimal performance index function J ∗ (x k ) lim Vi (x k ) = J ∗ (x k ) i→∞
where Vi (x k ) is defined in (8). Proof: According to (3) and (10), we have (i) k+i−1 ) : u ∈ A J ∗ (x k ) ≤ min J (x k , u k+i−1 x k = Vi (x k ). k k u k+i−1 k
Then, let i → ∞, we obtain J ∗ (x k ) ≤ V∞ (x k ).
(16)
Next, we show that V∞ (x k ) ≤ J ∗ (x k ).
(17)
For any ω > 0, by the definition of J ∗ (x k ) in (3), there exists ηk ∈ Axk such that J (x k , ηk ) ≤ J ∗ (x k ) + ω.
(18)
( p)
Suppose that |ηk | = p. Then ηk ∈ Axk . So, by Theorem 3.1 and (10), we have V∞ (x k ) ≤ V p (x k ) ( p) k+ p−1 k+ p−1 = min J (x k , u k ) : uk ∈ Ax k k+ p−1
uk
Next, we prove the following theorem. Theorem 3.4: Let T 0 = {0} and T i be defined in (19). Then, for i = 0, 1, . . ., we have T i ⊆ T i+1 . Proof: We prove the theorem by mathematical induction. First, let i = 0. Since T 0 = {0} and F(0, 0) = 0, we know that 0 ∈ T 1 . Hence, T 0 ⊆ T 1 . Next, assume that T i−1 ⊆ T i holds. Now, if x k ∈ T i , we have F(x k , ηi−1 (x k )) ∈ T i−1 for some ηi−1 (x k ). Hence, F(x k , ηi−1 (x k )) ∈ T i by the assumption of T i−1 ⊆ T i . So, x k ∈ T i+1 by (19). Thus, T i ⊆ T i+1 , which proves the theorem. According to Theorem 3.4, we have {0} = T 0 ⊆ T 1 ⊆ · · · ⊆ T i−1 ⊆ T i ⊆ · · · . We can see that by introducing the sets T i , i = 0, 1, . . ., the state x k can be classified correspondingly. According to Theorem 3.4, the properties of the ADP algorithm can be derived in the following theorem. Theorem 3.5: (i) 1) For any i , x k ∈ T i ⇔ Axk = ∅ ⇔ Vi (x k ) is defined at xk . 2) Let T ∞ = ∪∞ i=1 T i . Then, x k ∈ T ∞ ⇔ Ax k = ∅ ⇔ J ∗ (x k ) is defined at x k ⇔ x k is controllable. 3) If Vi (x k ) is defined at x k , then V j (x k ) is defined at x k for every j ≥ i . 4) J ∗ (x k ) is defined at x k if and only if there exists an i such that Vi (x k ) is defined at x k .
≤ J (x k , ηk ) ≤ J ∗ (x k ) + ω.
IV. ε-O PTIMAL C ONTROL A LGORITHM
Since ω is chosen arbitrarily, we know that (17) is true. Therefore, from (16) and (17), we prove the theorem. We can now present the following corollary. Corollary 3.1: Let the performance index function Vi (x k ) be defined by (8). If the system state x k is controllable, then the iterative control law v i (x k ) converges to the optimal control law u ∗ (x k ), i.e., limi→∞ v i (x k ) = u ∗ (x k ). Remark 3.3: Generally speaking, for the finite-horizon optimal control problems, the optimal performance index function depends not only on state x k but also on the time left (see [60], [61]). For the finite-horizon optimal control problems with unspecified terminal time, we have proved that the iterative performance index functions converge to the optimal as the iterative index i reaches infinity. Then, the time left is negligible and we say that the optimal performance index function V (x k ) is only a function of the state x k , which is like the case of infinite-horizon optimal control problems. By Theorem 3.3 and Corollary 3.1, we know that if x k is controllable, then, as i → ∞, the iterative performance index function Vi (x k ) converges to the optimal performance index function J ∗ (x k ) and the iterative control law v i (x k ) also converges to the optimal control law u ∗ (x k ). So, it is important to note that for controllable state x k , the iterative performance index functions Vi (x k ) are well defined for all i under the iterative control law v i (x k ). Let T 0 = {0}. For i = 1, 2, . . . , define
T i = {x k ∈ Rn | ∃ u k ∈ Rm s.t. F(x k , u k ) ∈ T i−1 }.
(19)
In the previous section, we proved that the iterative performance index function Vi (x k ) converges to the optimal performance index function J ∗ (x k ) and J ∗ (x k ) = minu k {J (x k , u k ), u ∈ Axk } satisfies the Bellman’s equation (4) for any controllable state x k ∈ T ∞ . To obtain the optimal performance index function J ∗ (x k ), a natural strategy is to run the iterative ADP algorithm (6)–(9) until i → ∞. But unfortunately, it is not practical to do so. In many cases, we cannot find the equality J ∗ (x k ) = Vi (x k ) for any finite i . That is, for any admissible control sequence u k with finite length, the performance index starting from x k under the control of u k will be larger than, not equal to, J ∗ (x k ). On the other hand, by running the iterative ADP algorithm (6)–(9), we can obtain a control vector v ∞ (x k ) and then construct a control sequence u ∞ (x k ) = (v ∞ (x k ), v ∞ (x k+1 ), . . . , v ∞ (x k+i ), . . . ), where x k+1 = F(x k , v ∞ (x k )), . . . , x k+i = F(x k+i−1 , v ∞ (x k+i−1 )), . . . . In general, u ∞ (x k ) has infinite length. That is, the controller v ∞ (x k ) cannot control the state to reach the target in finite number of steps. To overcome this difficulty, a new ε-optimal control method using iterative ADP algorithm will be developed in this section. A. ε-Optimal Control Method In this section, we will introduce our method of iterative ADP with the consideration of the length of control sequences. For different x k , we will consider different length i for the
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
optimal control sequence. For a given error bound ε > 0, the number i will be chosen so that the error between J ∗ (x k ) and Vi (x k ) is within the bound. Let ε > 0 be any small number and x k ∈ T ∞ be any controllable state. Let the performance index function Vi (x k ) be defined by (8) and J ∗ (x k ) be the optimal performance index function. According to Theorem 3.3, given ε > 0, there exists a finite i such that |Vi (x k ) − J ∗ (x k )| ≤ ε.
(20)
We can now give the following definition. Definition 4.1: Let x k ∈ T ∞ be a controllable state vector. Let ε > 0 be a small positive number. The approximate length of optimal control sequence with respect to ε is defined as K ε (x k ) = min{i : |Vi (x k ) − J ∗ (x k )| ≤ ε}.
(21)
Given a small positive number ε, for any state vector x k , the number K ε (x k ) gives a suitable length of control sequence for optimal control starting from x k . For x k ∈ T ∞ , since lim Vi (x k ) = J ∗ (x k ), we can always find i such that (20) i→∞
is satisfied. Therefore, {i : |Vi (x k ) − J ∗ (x k )| ≤ ε} = ∅ and K ε (x k ) is well defined. We can see that an error ε between Vi (x k ) and J ∗ (x k ) is introduced into the iterative ADP algorithm, which makes the performance index function Vi (x k ) converge within finite number of iteration steps. In this part, we will show that the corresponding control is also an effective control that drives the performance index function to within error bound ε from its optimal. From Definition 4.1, we can see that all the states x k that satisfy (21) can be classified into one set. Motivated by the definition in (19), we can further classify this set using the following definition. (ε) Definition 4.2: Let ε be a positive number. Define T 0 = {0} and for i = 1, 2, . . . , define
T (ε) i = {x k ∈ T ∞ : K ε (x k ) ≤ i }. T (ε) i ,
Accordingly, when x k ∈ to find the optimal control sequence which has performance index less than or equal to J ∗ (x k ) + ε, one only needs to consider the control sequences (ε) u k with length |u k | ≤ i . The sets T i have the following properties. Theorem 4.1: Let ε > 0 and i = 0, 1, . . . . Then: (ε) 1) x k ∈ T i if and only if Vi (x k ) ≤ J ∗ (x k ) + ε; (ε) 2) T i ⊆ T i ; (ε) (ε) 3) T i ⊆ T i+1 ; (ε) 4) ∪i T i = T ∞ ; (ε) (δ) 5) If ε > δ > 0, then T i ⊇ T i . Proof: (ε) 1) Let x k ∈ T i . By Definition 4.2, K ε (x k ) ≤ i . Let j = K ε (x k ). Then, j ≤ i and by Definition 4.1, |V j (x k ) − J ∗ (x k )| ≤ ε. So, V j (x k ) ≤ J ∗ (x k ) + ε. By Theorem 3.1, Vi (x k ) ≤ V j (x k ) ≤ J ∗ (x k ) + ε. On the other hand, if Vi (x k ) ≤ J ∗ (x k ) + ε, then |Vi (x k ) − J ∗ (x k )| ≤ ε. So, K ε (x k ) = min{ j : |V j (x k ) − J ∗ (x k )| ≤ ε} ≤ i , which (ε) implies that x k ∈ T i .
29
(ε)
2) If x k ∈ T i , K ε (x k ) ≤ i and |Vi (x k ) − J ∗ (x k )| ≤ ε. So, Vi (x k ) is defined at x k . According to Theorem 3.5 1), (ε) we have x k ∈ T i . Hence, T i ⊆ T i . (ε) (ε) 3) If x k ∈ T i , K ε (x k ) ≤ i < i + 1. So, x k ∈ T i+1 . Thus, (ε) T (ε) i ⊆ T i+1 . (ε) (ε) 4) Obviously, ∪i T i ⊆ T ∞ since T i are subsets of T ∞ . (ε) For any x k ∈ T ∞ , let p = K ε (x k ). Then, x k ∈ T p . (ε) (ε) So, x k ∈ ∪i T i . Hence, T ∞ ⊆ ∪i T i ⊆ T ∞ , and we (ε) obtain, ∪i T i = T ∞ . (δ) 5) If x k ∈ T i , Vi (x k ) ≤ J ∗ (x k ) + δ by part 1) of this theorem. Clearly, Vi (x k ) ≤ J ∗ (x k ) + ε since δ < ε. This (ε) (ε) (δ) implies that x k ∈ T i . Therefore, T i ⊇ T i . (ε)
According to Theorem 4.1 1), T i is just the region where Vi (x k ) is close to J ∗ (x k ) with error less than ε. This region is a subset of T i according to Theorem 4.1 2). As stated in (ε) Theorem 4.1 3), when i is large, the set T i is also large. That means, when i is large, we have a large region where we can use Vi (x k ) as the approximation of J ∗ (x k ) under certain error. On the other hand, we claim that if x k is far away from the origin, we have to choose a long control sequence to approximate the optimal control sequence. Theorem 4.1 4) means that for every controllable state x k ∈ T ∞ , we can always find a suitable control sequence with length i to (ε) approximate the optimal control. The size of the set T i depends on the value of ε. A smaller value of ε gives a smaller (ε) set T i , which is indicated by Theorem 4.1 5). (ε) Let x k ∈ T ∞ be an arbitrary controllable state. If x k ∈ T i , the iterative performance index function satisfies (20) under the control v i (x k ), we call this control the ε-optimal control and denote it as µ∗ε (x k ) µ∗ε (x k ) = v i (x k ) = arg min {U (x k , u k ) + Vi−1 (F(x k , u k ))} . uk
(22)
We have the following corollary. Corollary 4.1: Let µ∗ε (x k ) be expressed in (22), which makes the performance index function satisfy (20) for x k ∈ (ε) ′ ∗ ′ T (ε) i . Then, for any x k ∈ T i , µε (x k ) guarantees |Vi (x k′ ) − J ∗ (x k′ )| ≤ ε. (23) Proof: The corollary can be proved by contradiction. Assume that the conclusion is not true. Then, the inequality (ε) (23) is false under the control µ∗ε (·) for some x k′′ ∈ T i . ∗ As µε (x k ) makes the performance index function satisfy (ε) (20) for x k ∈ T i , we have K ε (x k ) ≤ i. Using the ε-optimal ∗ control law µε (·) at the state x k′′ , according to the assumption, we have |Vi (x k′′ ) − J ∗ (x k′′ )| > ε. Then, K ε (x k′′ ) > i and (ε) / T i . It is in contradiction with the assumption x k′′ ∈ (ε) x k′′ ∈ T i . Therefore, the assumption is false and (23) holds (ε) for any x k′ ∈ T i . Remark 4.1: Corollary 4.1 is very important for neural network implementation of the iterative ADP algorithm. It shows that we do not need to obtain the optimal control law by (ε) searching the entire subset T i . Instead, we can just find one (ε) (ε) point of T i , i.e., x k ∈ T i , to obtain the ε-optimal control
30
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
(ε)
µ∗ε (x k ) which will be effective for any other state x k′ ∈ T i . This property not only makes the computational complexity much reduced but also makes the optimal control law easily obtained using neural networks. (ε) Theorem 4.2: Let x k ∈ T i and let µ∗ε (x k ) be expressed (ε) ∗ in (22). Then, F(x k , µε (x k )) ∈ T i−1 . In other words, if ∗ K ε (x k ) = i , then K ε (F(x k , µε (x k ))) ≤ i − 1. (ε) Proof: Since x k ∈ T i , by Theorem 4.1 1), we know that Vi (x k ) ≤ J ∗ (x k ) + ε.
x
xk ∈Ti (ε)
(24)
i
T0
According to (8) and (22), we have Vi (x k ) = U (x k , µ∗ε (x k )) + Vi−1 (F(x k , µ∗ε (x k ))).
(25)
T1(ε)
T2(ε)
Combining (24) and (25), we have
Ti−1(ε)
Vi−1 (F(x k , µ∗ε (x k ))) = Vi (x k ) − U (x k , µ∗ε (x k )) ≤ J ∗ (x k ) + ε − U (x k , µ∗ε (x k )).
Ti (ε)
(26)
On the other hand, we have J ∗ (x k ) ≤ U (x k , µ∗ε (x k )) + J ∗ (F(x, µ∗ε (x k ))).
(27)
Putting (27) into (26), we obtain
Fig. 1. Control process of the controllable sate xk ∈ ADP algorithm.
T (ε) i
using iterative
Vi−1 (F(x k , µ∗ε (x k ))) ≤ J ∗ (F(x k , µ∗ε (x k ))) + ε. B. ε-Optimal Control Algorithm
By Theorem 4.1 1), we have (ε)
F(x k , µ∗ε (x k )) ∈ T i−1 .
(28) (ε)
So, if K ε (x k ) = i , we know that x k ∈ T i and (ε) F(x, µ∗ε (x k )) ∈ T i−1 according to (28). Therefore, we have K ε (F(x k , µ∗ε (x k ))) ≤ i − 1 which proves the theorem. Remark 4.2: From Theorem 4.2, we can see that the parameter K ε (x k ) gives an important property of the finitehorizon ADP algorithm. It not only gives an optimal condition of the iteration process, but also gives an optimal number of control steps for the finite-horizon ADP algorithm. For example, if |Vi (x k ) − J ∗ (x k )| ≤ ε for small ε, then we have Vi (x k ) ≈ J ∗ (x k ). According to Theorem 4.2, we can get N = k + i , where N is the number of control steps to drive the system to zero. The whole control sequence u 0N−1 may not be ε-optimal, but the control sequence u kN−1 is ε-optimal control sequence. If k = 0, we have N = K ε (x 0 ) = i . Under this condition, we say that the iteration index K ε (x 0 ) denotes the number of ε-optimal control steps. Corollary 4.2: Let µ∗ε (x k ) be expressed in (22), which makes the performance index function satisfy (20) for x k ∈ (ε) ′ ∗ ′ T (ε) i . Then, for any x k ∈ T j , where 0 ≤ j ≤ i , µε (x k ) guarantees |Vi (x k′ ) − J ∗ (x k′ )| ≤ ε.
(29)
Proof: The proof is similar to Corollary 4.1 and is omitted here. Remark 4.3: Corollary 4.2 shows that the ε-optimal control (ε) µ∗ε (x k ) obtained for ∀ x k ∈ T i is effective for any state x k′ ∈ (ε) ′ T (ǫ) j , where 0 ≤ j ≤ i . This means that for ∀ x k ∈ T j , ∗ ′ 0 ≤ j ≤ i , we can use a same ε-optimal control µε (x k ) to control the system.
According to Theorem 4.1 3) and Corollary 4.1, the (ε) ε-optimal control µ∗ε (x k ) obtained for an x k ∈ T i is effective (ε) for any state x k′ ∈ T i−1 (which is also stated in Corollary 4.2). That is to say, in order to obtain effective ε-optimal control, the iterative ADP algorithm only needs to run at some state x k ∈ T ∞ . In order to obtain an effective ε-optimal control (ε) (ε) law µ∗ε (x k ), we should choose the state x k ∈ T i \T i−1 for each i to run the iterative ADP algorithm. The control process using iterative ADP algorithm is illustrated in Fig. 1. From the iterative ADP algorithm (6)–(9), we can see that for any state x k ∈ Rn , there exits a control u k ∈ Rm that drives the system to zero in one step. In other words, for ∀ x k ∈ Rn , there exists a control u k ∈ Rm such that x k+1 = F(x k , u k ) = 0 holds. A large class of systems possesses this property, for example, all linear systems of the type x k+1 = Ax k + Bu k when B is invertible and the affine nonlinear systems with the type x k+1 = f (x k ) + g(x k )u k when the inverse of g(x k ) exists. But there are also other classes of systems for which there does not exist any control u k ∈ Rm that drives the state to zero in one step for some x k ∈ Rn , i.e., ∃x k ∈ Rn such that F(x k , u k ) = 0 is not possible for ∀ u k ∈ Rm . In the following part, we will discuss the situation where F(x k , u k ) = 0 for some x k ∈ Rm . Since x k is controllable, there exists a finite-horizon ad= {u k , u k+1 , . . . , u k+i−1 } ∈ missible control sequence u k+i−1 k (i) k+i−1 ( f ) = x k+i = 0. Let N = k + i Axk that makes x x k , u k be the terminal time. Assume that for k + 1, k + = 2, . . . , N − 1, the optimal control sequence u (N−1)∗ k+1 (N−k−1) ∗ ∗ ∗ has been deter{u k+1 , u k+2 , . . . , u N−1 } ∈ Axk+1 mined. Denote the performance index function for x k+1 as (N−1)∗ J x k+1 , u k+1 = V0 (x k+1 ). Now, we use the iterative ADP algorithm to determine the optimal control sequence for the state x k .
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
The performance index function for i = 1 is computed as V1 (x k ) = U (x k , v 1 (x k )) + V0 (F(x k , v 1 (x k )))
(30)
where v 1 (x k ) = arg min{U (x k , u k ) + V0 (F(x k , u k ))}. uk
(31)
Note that the initial condition used in the above expression is the performance index function V0 , which is obtained previously for x k+1 and now applied at F(x k , u k ). For i = 2, 3, 4, . . ., the iterative ADP algorithm will be implemented as follows: Vi (x k ) = U (x k , v i (x k )) + Vi−1 (F(x k , v i (x k )))
(32)
where v i (x k ) = arg min {U (x k , u k ) + Vi−1 (F(x k , u k ))} . uk
(33)
Theorem 4.3: Let x k be an arbitrary controllable state vector. Then, the performance index function Vi (x k ) obtained by (30)–(33) is a monotonically nonincreasing sequence for ∀ i ≥ 0, i.e., Vi+1 (x k ) ≤ Vi (x k ) for ∀ i ≥ 0. Proof: It can easily be proved following the proof of Theorem 3.1, and the proof is omitted here. Theorem 4.4: Let the performance index function Vi (x k ) be defined by (32). If the system state x k is controllable, then the performance index function Vi (x k ) obtained by (30)–(33) converges to the optimal performance index function J ∗ (x k ) as i → ∞ lim Vi (x k ) = J ∗ (x k ).
If |Vi (x k )− J ∗ (x k )| ≤ ε holds, we have Vi (x k ) ≤ J ∗ (x k )+ε and J ∗ (x k ) ≤ Vi+1 (x k ) ≤ Vi (x k ). These imply that 0 ≤ Vi (x k ) − Vi+1 (x k ) ≤ ε
(34)
or |Vi (x k ) − Vi+1 (x k )| ≤ ε. On the other hand, according to Theorem 4.4, |Vi (x k ) − Vi+1 (x k )| → 0 implies that Vi (x k ) → J ∗ (x k ). Therefore, for any given small ε, if |Vi (x k ) − Vi+1 (x k )| ≤ ε holds, we have |Vi (x k ) − J ∗ (x k )| ≤ ε holds if i is sufficiently large. We will use (34) as the optimal criterion instead of the optimal criterion (20). Let u 0K −1 = (u 0 , u 1 , . . . , u K −1 ) be an arbitrary finitehorizon admissible control sequence and the corresponding state sequence be x 0K = (x 0 , x 1 , . . . , x K ) where x K = 0. We can see that the initial control sequence u 0K −1 = (u 0 , u 1 , . . . , u K −1 ) may not be optimal, which means that the initial number of control steps K may not be optimal. So, the iterative ADP algorithm must complete two kinds of optimization: one is to optimize the number of control steps; and the other is to optimize the control law. In the following, we will show how the number of control steps and the control law are both optimized in the iterative ADP algorithm simultaneously. For the state x K −1 , we have F(x K −1 , u K −1 ) = 0. Then, we run the iterative ADP algorithm (6)–(9) at x K −1 as follows. The performance index function for i = 1 is computed as V11 (x K −1 ) = min {U (x K −1 , u K −1 ) + V0 (F(x K −1 , u K −1 ))} u K −1
i→∞
Proof: This theorem can be proved following similar steps to the proof of Theorem 3.3 and the proof is omitted here. Remark 4.4: We can see that the iterative ADP algorithm (30)–(33) is an expansion from of the previous one (6)–(9). So, the properties of the iterative ADP algorithm (6)–(9) is also effective for the current one (30)–(33). But there also exist differences. From Theorem 3.1, we can see that Vi+1 (x k ) ≤ Vi (x k ) for all i ≥ 1, which means that V1 (x k ) = max{Vi (x k ) : i = 0, 1, . . .}, while Theorem 4.3 shows that Vi+1 (x k ) ≤ Vi (x k ) for all i ≥ 0, which means that V0 (x k ) = max{Vi (x k ) : i = 0, 1, . . .}. This difference is caused by the difference of the initial conditions of the two iterative ADP algorithms. In the previous iterative ADP algorithm (6)–(9), it begins with the initial performance index function V0 (x k ) = 0 since F(x k , u k ) = 0 can be solved, while in the current iterative ADP algorithm (30)–(33), it begins with the performance index function V0 for the state x k+1 which is determined previously. This also causes the difference between the proof of Theorems 3.1 and 3.3 and the corresponding results in Theorems 4.3 and 4.4. But the difference of the initial conditions of the iterative performance index function does not affect the convergence property of the two iterative ADP algorithms. For the iterative ADP algorithm, the optimal criterion (20) is very difficult to verify because the optimal performance index function J ∗ (x k ) is unknown in general. So, an equivalent criterion is established to replace (20).
31
s.t. F(x K −1 , u K −1 ) = 0 = U (x K −1 , v 11 (x K −1 ))
(35)
where v 11 (x K −1 ) = arg min U (x K −1 , u K −1 )
(36)
u K −1
s.t. F(x K −1 , u K −1 ) = 0 and V0 (F(x K −1 , u K −1 )) = 0. The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
Vi1 (x K −1 ) = U x K −1 , v i1 (x K −1 )
1 F x K −1 , v i1 (x K −1 ) (37) + Vi−1 where
v i1 (x K −1 ) = arg min U (x K −1 , u K −1 ) u K −1
1 + Vi−1 (F(x K −1 , u K −1 ))
until the inequality 1 Vl1 (x K −1 ) − Vl11 +1 (x K −1 ) ≤ ε
(38)
(39) (ε)
is satisfied for l1 > 0. This means that x K −1 ∈ T l1 and the optimal number of control steps is K ε (x K −1 ) = l1 . Considering x K −2 ) = x K −1 . Put , we have F(x K −2 , u K −2 1 1 x K −2 into (39). If Vl1 (x K −2 ) − Vl1 +1 (x K −2 ) ≤ ε holds, then
32
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
(ε)
according to Theorem 4.1 1), we know that x K −2 ∈ T l1 . (ε) Otherwise, if x K −2 ∈ / T l1 , we will run the iterative ADP algorithm as follows. Using the performance index function Vl11 as the initial condition, we compute for i = 1
V12 (x K −2 ) = U x K −2 , v 12 (x K −2 )
(40) + Vl11 F x K −2 , v 12 (x K −2 ) where
v 12 (x K −2 ) = arg min U x K −2 , u K −2 ) u K −2 + Vl11 (F(x K −2 , u K −2 )) .
(41)
The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
Vi2 (x K −2 ) = U x K −2 , v i2 (x K −2 )
2 (42) F x K −2 , v i2 (x K −2 ) + Vi−1 where
v i2 (x K −2 ) = arg min U (x K −2 , u K −2 ) u K −2
2 + Vi−1 (F(x K −2 , u K −2 ))
until the inequality 2 Vl2 (x K −2 ) − Vl22 +1 (x K −2 ) ≤ ε
(43)
(44)
T l(ε) 2 l2 .
is satisfied for l2 > 0. We can then obtain that x K −2 ∈ and the optimal number of control steps is K ε (x K −2 ) = (ε) Next, assume that j ≥ 2 and x K − j +1 ∈ T l j −1 j −1 j −1 (45) Vl j −1 (x K − j +1) − Vl j −1 +1 (x K − j +1) ≤ ε holds. Considering x K − j , we have F(x K − j , u K − j ) = x K − j +1. Putting x K − j into (45) and, if j −1 j −1 (46) Vl j −1 (x K − j ) − Vl j −1 +1 (x K − j ) ≤ ε (ε)
holds, then we know that x K − j ∈ T l j −1 . Otherwise, if x K − j ∈ /
where j v i (x K − j ) = arg min U (x K − j , u K − j ) uK−j
j
+ Vi−1 (F(x K − j , u K − j )) until the inequality j j Vl j (x K − j ) − Vl j +1 (x K − j ) ≤ ε
(50)
(51) (ε)
is satisfied for l j > 0. We can then obtain that x K − j ∈ T l j and the optimal number of control steps is K ε (x K − j ) = l j . Finally, considering x 0 , we have F(x 0 , u 0 ) = x 1 . If K −1 −1 Vl K −1 (x 0 ) − VlKK −1 +1 (x 0 ) ≤ ε
holds, then we know that x 0 ∈
T l(ε) . Otherwise, if x 0 ∈ / K −1
T l(ε) , then we run the iterative ADP algorithm as follows, K −1 −1 as the initial Using the performance index function VlKK −1 condition, we compute for i = 1
−1 V1K (x 0 ) = U x 0 , v 1K (x 0 ) + VlKK −1 (52) F x 0 , v 1K (x 0 )
where −1 (F(x 0 , u 0 )) . v 1K (x 0 ) = arg min U (x 0 , u 0 ) + VlKK −1 u0
(53)
The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
K (54) F x 0 , v iK (x 0 ) ViK (x 0 ) = U x 0 , v iK (x 0 ) + Vi−1 where K (F(x 0 , u 0 )) v iK (x 0 ) = arg min U (x 0 , u 0 ) + Vi−1 u0
until the inequality K Vl K (x 0 ) − VlKK +1 (x 0 ) ≤ ε
(55)
(56)
T l(ε) , then we run the iterative ADP algorithm as follows. is satisfied for l K > 0. Therefore, we can obtain that x 0 ∈ T l(ε) j −1 K j −1
Using the performance index function Vl j −1 as the initial condition, we compute for i = 1
j j V1 (x K − j ) = U x K − j , v 1 (x K − j )
j −1 j (47) + Vl j −1 F x K − j , v 1 (x K − j ) where
j v 1 (x K − j ) = arg min U (x K − j , u K − j ) uK−j j −1 + Vl j −1 (F(x K−j , u K−j )) .
(48)
The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
j j Vi (x K − j ) = U x K − j , v i (x K − j )
j j (49) + Vi−1 F x K − j , v i (x K − j )
and the optimal number of control steps is K ε (x 0 ) = l K . Starting from the initial state x 0 , the optimal number of control steps is l K according to our ADP algorithm. Remark 4.5: For the case where there are some x k ∈ Rn , there does not exit a control u k ∈ Rm that drives the system to zero in one step, and the computational complexity of the iterative ADP algorithm is much related to the original finite-horizon admissible control sequence u 0K −1 . First, we repeat the iterative ADP algorithm at x K −1 , x K −2 , . . ., x 1 , x 0 , respectively. It is related to the control steps K of u 0K −1 . K −1 If K is large, it means that u0 takes a large number of control steps to drive the initial state x 0 to zero and then the number of times needed to repeat the iterative ADP algorithm will be large. Second, the computational complexity is also u 0K −1 is related to the quality of control results of u 0K −1 . If (N−1)∗ close to the optimal control sequence u 0 , then it will take less computation to make (51) hold for each j .
V. S IMULATION S TUDY To evaluate the performance of our iterative ADP algorithm, we choose two examples with quadratic utility functions for numerical experiments. Example 5.1: Our first example is chosen from [57]. We consider the following nonlinear system: x k+1 = f (x k ) + g(x k )u k where x k = [x 1k x 2k ]T and u k = [u 1k u 2k ]T are the state and control variables, respectively. The system functions are given as 2 ) 0.2x 1k exp(x 2k −0.2 0 f (x k ) = ) = . , g(x k 3 0 −0.2 0.3x 2k The initial state is x 0 = [1 −1]T . The performance index function is in quadratic form with finite-time horizon expressed as N−1 x kT Qx k + u kT Ru k J x 0 , u 0N−1 = k=0
where the matrix Q = R = I and I denotes the identity matrix with suitable dimensions. The error bound of the iterative ADP is chosen as ε = 10−5 . Neural networks are used to implement the iterative ADP algorithm and the neural network structure can be seen in [32] and [57]. The critic network and the action network are chosen as three-layer backpropagation (BP) neural networks with the structures of 2–8–1 and 2–8–2, respectively. The model network is also chosen as a three-layer BP neural network with the structure of 4–8–2. The critic network is used to approximate the iterative performance index functions, which are expressed by (35), (37), (40), (42), (47), (49), (52), and (54). The action network is used to approximate
33
Control trajectories
4.5 4 3.5 3 2.5 2
1
3
0.2
u1 u2
0.1 0 −0.1 −0.2
5 7 9 11 13 15 Iteration steps
0
2
(a) 0.8 0.6 0.4 0.2 0
0
2
4 6 Time steps
(c)
4 6 Time steps
8
10
8
10
(b)
1
State trajectory x2
Now, we summarize the iterative ADP algorithm as follows. Step 1: Choose an error bound ε and choose randomly an array of initial states x 0 . Step 2: Obtain an initial finite-horizon admissible control sequence u 0K −1 = (u 0 , u 1 , . . . , u K −1 ) and obtain the corresponding state sequence x 0K = (x 0 , x 1 , . . . , x K ), where x K = 0. Step 3: For the state x K −1 with F(x K −1 , u K −1 ) = 0, run the iterative ADP algorithm (35)–(38) at x K −1 until (39) holds. Step 4: Record Vl11 (x K −1 ), vl11 (x K −1 ) and K ε (x K −1 ) = l1 . Step 5: For j = 2, 3, . . . , K , if for x K − j , the inequality (46) holds, go to Step 7; otherwise, go to Step 6. j −1 Step 6: Using the performance index function Vl j −1 as the initial condition, run the iterative ADP algorithm (47)–(50) until (51) is satisfied. Step 7: If j = K , then we have obtained the optimal performance index function V ∗ (x 0 ) = VlKK (x 0 ), the law of the optimal control sequence u ∗ (x 0 ) = vlKK (x 0 ) and the number of optimal control steps K ε (x 0 ) = l K ; otherwise, set j = j + 1, and go to Step 5. Step 8: Stop.
State trajectory x1
C. Summary of the ε-Optimal Control Algorithm
Performance index function
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
8
10
0.5 0 −0.5 −1
0
2
4 6 Time steps
(d)
Fig. 2. Simulation results for Example 1. (a) Convergence of performance index function. (b) ε-optimal control vectors. (c) and (d) Corresponding state trajectories.
the optimal control laws, which are expressed by (36), (38), (41), (43), (48), (50), (53), and (55). The training rules of the neural networks can be seen in [50]. For each iterative step, the critic network and the action network are trained for 1000 iteration steps using the learning rate of α = 0.05 so that the neural network training error becomes less than 10−8 . Enough iteration steps should be implemented to guarantee the iterative performance index functions and the control law to converge sufficiently. We let the algorithm run for 15 iterative steps to obtain the optimal performance index function and optimal control law. The convergence curve of the performance index function is shown in Fig. 2(a). Then, we apply the optimal control law to the system for T f = 10 time steps and obtain the following results. The ε-optimal control trajectories are shown in Fig. 2(b) and the corresponding state curves are shown in Fig. 2(c) and (d). After seven steps of iteration, we have |V6 (x 0 ) − V7 (x 0 )| ≤ 10−5 = ε. Then, we obtain the optimal number of control steps K ε (x 0 ) = 6. We can see that after six time steps, the state variable becomes x 6 = [0.912 × 10−6 , 0.903 × 10−7 ]T . The entire computation process takes about 10 s before satisfactory results are obtained. Example 5.2: The second example is chosen from [62] with some modifications. We consider the following system:
(57) x k+1 = F(x k , u k ) = x k + sin 0.1x k2 + u k where x k , u k ∈ R, and k = 0, 1, 2, . . . . The performance index function is defined as in Example 5.1 with Q = R = 1. The initial state is x 0 = 1.5. Since F(0, 0) = 0, x k = 0 is an equilibrium state of system (57). But since (∂ F(x k , u k )/∂ x k )(0, 0) = 1, system (57) is marginally stable at x k = 0 and the equilibrium x k = 0 is not attractive. We can see that for the fixed initial state x 0 , there does not exist a control u 0 ∈ R that makes x 1 = F(x 0 , u 0 ) = 0. The error bound of the iterative ADP algorithm is chosen as ε = 10−4 . The critic network, the action network, and the model network are chosen as three-layer BP neural networks with the structures of 1–3–1, 1–3–1, and 2–4–1, respectively.
1.6 1.58 1.56 1
3
5 7 9 11 13 15 Iteration steps
0
2
−1
4 6 Time steps
(c)
5 7 9 11 13 15 Iteration steps
8
10
1 0.5 0
0
2
4 6 Time steps
8
10
Fig. 3. Simulation results for Case 1 of Example 2. (a) Convergence of performance index function at xk = 0.8. (b) Convergence of performance index function at xk = 1.5. (c) ε-optimal control trajectory. (d) Corresponding state trajectory.
According to (57), the control can be expressed by u k = −0.1x k2 + sin−1 (x k+1 − x k ) + 2λπ
(58)
where λ = 0, ±1, ±2, . . . . To show the effectiveness of our algorithm, we choose two initial finite-horizon admissible control sequences. Case 1: The control sequence is uˆ 10 = (−0.225 − sin−1 (0.7), −0.064 − sin −1 (0.8)) and the corresponding state sequence is xˆ 20 = (1.5, 0.8, 0). For the initial finite-horizon admissible control sequences in this case, run the iterative ADP algorithm at the states 0.8 and 1.5, respectively. For each iterative step, the critic network and the action network are trained for 1000 iteration steps using the learning rate of α = 0.05 so that the neural network training accuracy of 10−8 is reached. After the algorithm runs for 15 iterative steps, we obtain the performance index function trajectories shown in Fig. 3(a) and (b), respectively. The ε-optimal control and state trajectories are shown in Fig. 3(c) and (d), respectively, for 10 time steps. We obtain K ε (0.8) = 5 and K ε (1.5) = 8. Case 2: The control sequence is uˆ 30 = (−0.225 − sin−1 (0.01), 2π −2.2201 −sin−1 (0.29), −0.144 −sin−1 (0.5), −sin−1 (0.7)) and the corresponding state sequence is xˆ 40 = (1.5, 1.49, 1.2, 0.7, 0). For the initial finite-horizon admissible control sequence in this case, run the iterative ADP algorithm at the states 0.7, 1.2, and 1.49, respectively. For each iterative step, the critic network and the action network are also trained for 1000 iteration steps using the learning rate of α = 0.05 so that the neural network training accuracy of 10−8 is reached. Then, we obtain the performance index function trajectories shown in Fig. 4(a)–(c), respectively. We have K ε (0.7) = 4, K ε (1.2) = 6, and K ε (1.49) = 8. After 25 steps of iteration, the performance index function Vi (x k ) is convergent sufficiently at x k = 1.49, with V83 (1.49) as the performance index function. For the state x k = 1.5, we have |V83 (1.5) − V93 (1.5)| = 0.52424 × 10−7 < ε. Therefore,
0
5 10 Iteration steps
13.5 13 12.5 12 11.5
15
(b)
14
0
5
10 15 20 Iteration steps
(c)
(d)
10 9 8 7 6 5 4
(a) Performance index function
State trajectory
Control trajectory
−0.5
2
3
(b) 1.5
0
1
4 6 8 10 12 14 15 Iteration steps
(a) 0
−1.5
1.17 1.165 1.16 1.155 1.15 1.145
1.5 1 0.5 0 −0.5 −1 −1.5
x u
State and control trajectories
1.54
10 9 8 7 6 5 4
Performance index function
1.62
Performance index function
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Performance index function
Performance index function
34
25
0
2
4 6 Time steps
8
10
(d)
Fig. 4. Simulation results for Case 2 of Example 2. (a) Convergence of performance index function at xk = 0.7. (b) Convergence of performance index function at xk = 1.2. (c) Convergence of performance index function at xk = 1.49. (d) ε-optimal control trajectory and the corresponding state trajectory.
the optimal performance index function at x k = 1.5 is V83 (1.5), (ε) and thus we have x k = 1.5 ∈ T 8 and K ε (1.5) = 8. The whole computation process takes about 20 s and then satisfactory results are obtained. Then we apply the optimal control law to the system for T f = 10 time steps. The ε-optimal control and state trajectories are shown in Fig. 4(d). We can see that the ε-optimal control trajectory in Fig. 4(d) is the same as the one in Fig. 3(c). The corresponding state trajectory in Fig. 4(d) is the same as the one in Fig. 3(d). Therefore, the optimal control law is not dependent on the initial control law. The initial control sequence uˆ 0K −1 can arbitrarily be chosen as long as it is finite-horizon admissible. Remark 5.1: If the number of control steps of the initial admissible control sequence is larger than the number of control steps of the optimal control sequence, then we will have some of the states in the initial sequence to possess the same number of optimal control steps. For example, in Case 2 of Example 2, we see that the two states x = 1.49 and x = 1.5 possess the same number of optimal control steps, i.e., K ε (1.49) = K ε (1.5) = 8. Thus, we say that the control u = −0.225 − sin−1 (0.01) that makes x = 1.5 run to x = 1.49 is an unnecessary control step. After the unnecessary control steps are identified and removed, the number of control steps will reduce to the optimal number of control steps, and thus the initial admissible control sequence does not affect the final optimal control results. VI. C ONCLUSION In this paper, we developed an effective iterative ADP algorithm for finite-horizon ε-optimal control of discrete-time nonlinear systems. Convergence of the performance index function for the iterative ADP algorithm was proved, and the ε-optimal number of control steps could also be obtained. Neural networks were used to implement the iterative ADP algorithm. Finally, two simulation examples were given to illustrate the performance of the proposed algorithm.
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
R EFERENCES [1] A. E. Bryson and Y.-C. Ho, Applied Optimal Control: Optimization, Estimation, and Control. New York: Wiley, 1975. [2] T. Cimen and S. P. Banks, “Nonlinear optimal tracking control with application to super-tankers for autopilot design,” Automatica, vol. 40, no. 11, pp. 1845–1863, Nov. 2004. [3] N. Fukushima, M. S. Arslan, and I. Hagiwara, “An optimal control method based on the energy flow equation,” IEEE Trans. Control Syst. Technol., vol. 17, no. 4, pp. 866–875, Jul. 2009. [4] H. Ichihara, “Optimal control for polynomial systems using matrix sum of squares relaxations,” IEEE Trans. Autom. Control, vol. 54, no. 5, pp. 1048–1053, May 2009. [5] S. Keerthi and E. Gilbert, “Optimal infinite-horizon control and the stabilization of linear discrete-time systems: State-control constraints and nonquadratic cost functions,” IEEE Trans. Autom. Control, vol. 31, no. 3, pp. 264–266, Mar. 1986. [6] I. Kioskeridis and C. Mademlis, “A unified approach for four-quadrant optimal controlled switched reluctance machine drives with smooth transition between control operations,” IEEE Trans. Autom. Control, vol. 24, no. 1, pp. 301–306, Jan. 2009. [7] J. Mao and C. G. Cassandras, “Optimal control of multi-stage discrete event systems with real-time constraints,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 108–123, Jan. 2009. [8] I. Necoara, E. C. Kerrigan, B. D. Schutter, and T. Boom, “Finitehorizon min-max control of max-plus-linear systems,” IEEE Trans. Autom. Control, vol. 52, no. 6, pp. 1088–1093, Jun. 2007. [9] T. Parisini and R. Zoppoli, “Neural approximations for infinite-horizon optimal control of nonlinear stochastic systems,” IEEE Trans. Neural Netw., vol. 9, no. 6, pp. 1388–1408, Nov. 1998. [10] G. N. Saridis and F. Y. Wang, “Suboptimal control of nonlinear stochastic systems,” Control-Theory Adv. Technol., vol. 10, no. 4, pp. 847–871, Dec. 1994. [11] C. Seatzu, D. Corona, A. Giua, and A. Bemporad, “Optimal control of continuous-time switched affine systems,” IEEE Trans. Autom. Control, vol. 51, no. 5, pp. 726–741, May 2006. [12] K. Uchida and M. Fujita, “Finite horizon H∞ control problems with terminal penalties,” IEEE Trans. Autom. Control, vol. 37, no. 11, pp. 1762–1767, Nov. 1992. [13] E. Yaz, “Infinite horizon quadratic optimal control of a class of nonlinear stochastic systems,” IEEE Trans. Autom. Control, vol. 34, no. 11, pp. 1176–1180, Nov. 1989. [14] F. Yang, Z. Wang, G. Feng, and X. Liu, “Robust filtering with randomly varying sensor delay: The finite-horizon case,” IEEE Trans. Circuits Syst. I, vol. 56, no. 3, pp. 664–672, Mar. 2009. [15] E. Zattoni, “Structural invariant subspaces of singular Hamiltonian systems and nonrecursive solutions of finite-horizon optimal control problems,” IEEE Trans. Autom. Control, vol. 53, no. 5, pp. 1279–1284, Jun. 2008. [16] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar, Convex Analysis and Optimization. Boston, MA: Athena Scientific, 2003. [17] J. Doyle, K. Zhou, K. Glover, and B. Bodenheimer, “Mixed H2 and H∞ performance objectives II: Optimal control,” IEEE Trans. Autom. Control, vol. 39, no. 8, pp. 1575–1587, Aug. 1994. [18] L. Blackmore, S. Rajamanoharan, and B. C. Williams, “Active estimation for jump Markov linear systems,” IEEE Trans. Autom. Control, vol. 53, no. 10, pp. 2223–2236, Nov. 2008. [19] O. L. V. Costa and E. F. Tuesta, “Finite horizon quadratic optimal control and a separation principle for Markovian jump linear systems,” IEEE Trans. Autom. Control, vol. 48, no. 10, pp. 1836–1842, Oct. 2003. [20] P. J. Goulart, E. C. Kerrigan, and T. Alamo, “Control of constrained discrete-time systems with bounded ℓ2 gain,” IEEE Trans. Autom. Control, vol. 54, no. 5, pp. 1105–1111, May 2009. [21] J. H. Park, H. W. Yoo, S. Han, and W. H. Kwon, “Receding horizon controls for input-delayed systems,” IEEE Trans. Autom. Control, vol. 53, no. 7, pp. 1746–1752, Aug. 2008. [22] A. Zadorojniy and A. Shwartz, “Robustness of policies in constrained Markov decision processes,” IEEE Trans. Autom. Control, vol. 51, no. 4, pp. 635–638, Apr. 2006. [23] H. Zhang, L. Xie, and G. Duan, “H∞ control of discrete-time systems with multiple input delays,” IEEE Trans. Autom. Control, vol. 52, no. 2, pp. 271–283, Feb. 2007. [24] R. E. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ. Press, 1957. [25] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA: MIT Press, 1991, pp. 67–95.
35
[26] P. J. Werbos, “Approximate dynamic programming for real-time control and neural modeling,” in Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Eds. New York: Reinhold, 1992, ch. 13. [27] A. Al-Tamimi, M. Abu-Khalaf, and F. L. Lewis, “Adaptive critic designs for discrete-time zero-sum games with application to H∞ control,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 37, no. 1, pp. 240–247, Feb. 2007. [28] S. N. Balakrishnan and V. Biega, “Adaptive-critic-based neural networks for aircraft optimal control,” J. Guidance, Control, Dynamics, vol. 19, no. 4, pp. 893–898, Jul.-Aug. 1996. [29] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997. [30] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Jun. 2009. [31] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive dynamic programming,” IEEE Trans. Syst., Man, Cybern., Part C: Appl. Rev., vol. 32, no. 2, pp. 140–153, May 2002. [32] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May 2009. [33] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 943–949, Aug. 2008. [34] S. Ferrari, J. E. Steck, and R. Chandramohan, “Adaptive feedback control by constrained approximate dynamic programming,” , IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 982–987, Aug. 2008. [35] J. Seiffertt, S. Sanyal, and D. C. Wunsch, “Hamilton-Jacobi-Bellman equations and approximate dynamic programming on time scales,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 918–923, Aug. 2008. [36] R. Enns and J. Si, “Helicopter trimming and tracking control using direct neural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4, pp. 929–939, Jul. 2003. [37] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996. [38] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. [39] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, May 2005. [40] Z. Chen and S. Jagannathan, “Generalized Hamilton-Jacobi-Bellman formulation-based neural network control of affine nonlinear discretetime systems,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 90–106, Jan. 2008. [41] T. Hanselmann, L. Noakes, and A. Zaknich, “Continuous-time adaptive critics,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 631–647, May 2007. [42] G. G. Lendaris, “A retrospective on adaptive dynamic programming for control,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, Jun. 2009, pp. 14–19. [43] B. Li and J. Si, “Robust dynamic programming for discounted infinitehorizon Markov decision processes with uncertain stationary transition matrices,” in Proc. IEEE Symp. Approx. Dyn. Program. Reinforcement Learn., Honolulu, HI, Apr. 2007, pp. 96–102. [44] D. Liu, X. Xiong, and Y. Zhang, “Action-dependent adaptive critic designs,” in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 2. Washington D.C., Jul. 2001, pp. 990–995. [45] D. Liu and H. Zhang, “A neural dynamic programming approach for learning control of failure avoidance problems,” Int. J. Intell. Control Syst., vol. 10, no. 1, pp. 21–32, Mar. 2005. [46] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission control scheme for CDMA cellular networks,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1219–1228, Sep. 2005. [47] C. Lu, J. Si, and X. Xie, “Direct heuristic dynamic programming for damping oscillations in a large power system,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 1008–1013, Aug. 2008. [48] S. Shervais, T. T. Shannon, and G. G. Lendaris, “Intelligent supply chain management using adaptive critic learning,” IEEE Trans. Syst., Man Cybern., Part A: Syst. Humans, vol. 33, no. 2, pp. 235–244, Mar. 2003. [49] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier, “Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation,” IEEE Trans. Neural Netw., vol. 19, no. 8, pp. 1369–1388, Aug. 2008.
36
[50] J. Si and Y.-T. Wang, “On-line learning control by association and reinforcement,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264–276, Mar. 2001. [51] A. H. Tan, N. Lu, and D. Xiao, “Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback,” IEEE Trans. Neural Netw., vol. 19, no. 2, pp. 230–244, Feb. 2008. [52] F. Y. Wang and G. N. Saridis, “Suboptimal control for nonlinear stochastic systems,” in Proc. 31st IEEE Conf. Decis. Control, Tucson, AZ, Dec. 1992, pp. 1856–1861. [53] Q. L. Wei, H. G. Zhang, J. Dai, “Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions,” Neurocomputing, vol. 72, nos. 7–9, pp. 1839–1848, Mar. 2009. [54] P. J. Werbos, “Using ADP to understand and replicate brain intelligence: The next level design,” in Proc. IEEE Symp. Approx. Dyn. Program. Reinforcement Learn., Honolulu, HI, Apr. 2007, pp. 209–216. [55] P. J. Werbos, “Intelligence in the brain: A theory of how it works and how to build it,” Neural Netw., vol. 22, no. 3, pp. 200–212, Apr. 2009. [56] H. G. Zhang, Y. H. Luo, and D. Liu, “Neural network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraint,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [57] H. G. Zhang, Q. L. Wei, and Y. H. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008. [58] C. Watkins, “Learning from delayed rewards,” Ph.D. thesis, Dept. Comput. Sci., Cambridge Univ., Cambridge, U.K., 1989. [59] F. Y. Wang and G. N. Saridis, “On successive approximation of optimal control of stochastic dynamic systems,” in Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications, M. Dror, P. Lécuyer, and F. Szidarovszky, Eds. Boston, MA: Kluwer, 2002, pp. 333–386. [60] D. Han and S. N. Balakrishnan, “State-constrained agile missile control with adaptive-critic-based neural networks,” IEEE Trans. Control Syst. Technol., vol. 10, no. 4, pp. 481–489, Jul. 2002. [61] E. S. Plumer, “Optimal control of terminal processes using neural networks,” IEEE Trans. Neural Netw., vol. 7, no. 2, pp. 408–418, Mar. 1996. [62] N. Jin, D. Liu, T. Huang, and Z. Pang, “Discrete-time adaptive dynamic programming using wavelet basis function neural networks,” in Proc. IEEE Symp. Approx. Dyn. Program. Reinforcement Learn., Honolulu, HI, Apr. 2007, pp. 135–142.
Fei-Yue Wang (S’87–M’89–SM’94–F’03) received the Ph.D. degree in computer and systems engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1990. He joined the University of Arizona, Tuscon, in 1990, and became a Professor and Director of the Robotics and Automation Laboratory and the Program for Advanced Research in Complex Systems. In 1999, he founded the Intelligent Control and Systems Engineering Center at the Chinese Academy of Sciences (CAS), Beijing, China, with the support of the Outstanding Overseas Chinese Talents Program. Since 2002, he has been the Director of the Key Laboratory of Complex Systems and Intelligence Science at CAS. Currently, he is a Vice-President of the Institute of Automation, CAS. His current research interests include social computing, web science, complex systems, and intelligent control. Dr. Wang is a member of Sigma Xi and an elected Fellow of the International Council on Systems Engineering, the International Federation of Automatic Control, the American Society of Mechanical Engineers (ASME), and the American Association for the Advancement of Science. He was the Editorin-Chief of the International Journal of Intelligent Control and Systems and the World Scientific Series in Intelligent Control and Intelligent Automation
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
from 1995 to 2000. Currently, he is the Editor-in-Chief of IEEE I NTELLIGENT S YSTEMS and IEEE T RANSACTIONS ON I NTELLIGENT T RANSPORTATION S YSTEMS . He has served as Chair of more than 20 IEEE, the Association for Computing Machinery, the Institute for Operations Research and the Management Sciences, and ASME conferences. He was the President of the IEEE Intelligent Transportation Systems Society from 2005 to 2007, the Chinese Association for Science and Technology, Pittsburg, PA, in 2005, and the American Zhu Kezhen Education Foundation from 2007 to 2008. Currently, he is the Vice-President of the ACM China Council and Vice-President/SecretaryGeneral of the Chinese Association of Automation. In 2007, he received the National Prize in Natural Sciences of China and was awarded the Outstanding Scientist Award by ACM for his work in intelligent control and social computing.
Ning Jin (S’06) received the Ph.D. degree in electrical and computer engineering from the University of Illinois, Chicago, in 2005. He was an Associate Professor in the Department of Mathematics at Nanjing Normal University, Nanjing, China. From 2002 to 2005, he was a Visiting Scholar in the Department of Mathematics, Statistics, and Computer Science, University of Illinois. His current research interests include optimal control and dynamic programming, artificial intelligence, pattern recognition, neural networks, and wavelet analysis.
Derong Liu (S’91–M’94–SM’96–F’05) received the Ph.D. degree in electrical engineering from the University of Notre Dame, Notre Dame, IN, in 1994. He was a Staff Fellow with the General Motors Research and Development Center, Warren, MI, from 1993 to 1995. He was an Assistant Professor in the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, from 1995 to 1999. He joined the University of Illinois, Chicago, in 1999, and became a Full Professor of electrical and computer engineering and of computer science in 2006. He was selected for the “100 Talents Program” by the Chinese Academy of Sciences in 2008. Dr. Liu has been an Associate Editor of several IEEE publications. Currently, he is the Editor-in-Chief of the IEEE T RANSACTIONS ON N EURAL N ETWORKS and an Associate Editor of the IEEE T RANSACTIONS ON C ONTROL S YSTEMS T ECHNOLOGY. He received the Michael J. Birck Fellowship from the University of Notre Dame in 1990, the Harvey N. Davis Distinguished Teaching Award from the Stevens Institute of Technology in 1997, the Faculty Early Career Development Award from the National Science Foundation in 1999, the University Scholar Award from the University of Illinois in 2006, and the Overseas Outstanding Young Scholar Award from the National Natural Science Foundation of China in 2008.
Qinglai Wei received the B.S. degree in automation, the M.S. degree in control theory and control engineering, and the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2002, 2005, and 2008, respectively. He is currently a Post-Doctoral Fellow with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include neural-networks-based control, nonlinear control, adaptive dynamic programming, and their industrial applications.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
37
Solving Nonstationary Classification Problems with Coupled Support Vector Machines Guillermo L. Grinblat, Lucas C. Uzal, H. Alejandro Ceccatto, and Pablo M. Granitto
Abstract— Many learning problems may vary slowly over time: in particular, some critical real-world applications. When facing this problem, it is desirable that the learning method could find the correct input–output function and also detect the change in the concept and adapt to it. We introduce the time-adaptive support vector machine (TA-SVM), which is a new method for generating adaptive classifiers, capable of learning concepts that change with time. The basic idea of TA-SVM is to use a sequence of classifiers, each one appropriate for a small time window but, in contrast to other proposals, learning all the hyperplanes in a global way. We show that the addition of a new term in the cost function of the set of SVMs (that penalizes the diversity between consecutive classifiers) produces a coupling of the sequence that allows TA-SVM to learn as a single adaptive classifier. We evaluate different aspects of the method using appropriate drifting problems. In particular, we analyze the regularizing effect of changing the number of classifiers in the sequence or adapting the strength of the coupling. A comparison with other methods in several problems, including the well-known STAGGER dataset and the real-world electricity pricing domain, shows the good performance of TA-SVM in all tested situations. Index Terms— Adaptive methods, drifting concepts, support vector machine.
I. I NTRODUCTION N MANY real-world applications, pattern recognition problems may vary slowly over time. For example, weather conditions under which meteorological alerts should be raised are seasonal, or the state of a critical mechanical system that should trigger an alarm could change with the wear of the machine. In most cases, the underlying causes and characteristics of these slow changes are not evident from the data under analysis. Under such circumstances, it is desirable for the pattern recognition method to be able to learn related but distinct input–output functions at different epochs and, in particular, to have the flexibility to do it in a continuous way, profiting from the slow-drift property and thereby harnessing information from the entire historical database. In the next section, we review some previous works on this topic, which is sometimes called “drifting concepts”
I
Manuscript received February 16, 2010; revised July 18, 2010; accepted September 22, 2010. Date of publication November 9, 2010; date of current version January 4, 2011. This work was supported in part by the ANPCyT under Grant PICT-2006 643 and Grant 2226. The authors are with the CIFASIS - French Argentine International Center for Information and Systems Sciences, UPCAM (France) / UNR-CONICET (Argentina), Rosario S2000EZP, Argentina (e-mail:
[email protected];
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2083684
[1]–[4]. In this context, some authors distinguish sudden or instantaneous drift from gradual change [5], [6]. As Stanley [5] points out, the two problems present very different challenges. Algorithms appropriate for sudden concept changes [7]–[13] should be fast in detecting the change and react to it in an appropriate way. In gradual drift [4], [14]–[18], on the other hand, there is no need for a rapid reaction and the interesting problem is how to use efficiently the information from the full dataset. Our method focus on this latter kind of problems, and in particular on situations with scarce data, but also works efficiently for problems with a sudden change, as we will show later. Most previous approaches to handle concept drift rely on the use of “local” classifiers, each one fitted or adapted to a particular temporal window of a given length [2], [7]–[9], [19], [20]. As we discuss in Section II, the methods differ in how they select the length of the window, or in how they weigh the selected samples, or even in how they use the set of classifiers (some methods keep several classifiers in an ensemble, others use only the classifier corresponding to the current window). Here we present a new approach to this problem, the use of a sequence of classifiers that vary following the concept change, but which are all fitted in a global way. To build the sequence of classifiers, we selected one of the most powerful methods nowadays, the support vector machine (SVM) [21], [22], which we adapted accordingly. As in most previous methods, each SVM in the sequence is trained using data points from only one of a set of consecutive nonoverlapping time windows. The novelty of our method is that the sequence is not independent. We solve all the SVMs at the same time, using a coupling term that force time neighbors to be similar to each other. In our method, the interval of validity of each classifier can be as small as needed to follow the change in the concept but with reduced overfitting because the classifiers are trained to minimize a global measure of the error instead of adjusting them locally. In a previous work [23], we introduced a limited version of this method and showed its potential using an artificial drifting problem. In this paper, we describe an extended version of our algorithm that can use fewer classifiers than points in the dataset,1 producing more robust and efficient solutions. Based on the ideas of [4], we evaluate the new method in three different settings for drifting concepts: estimation, prediction, and extrapolation. 1 The previous version was limited to use one SVM for each point in the training sequence.
1045–9227/$26.00 © 2010 IEEE
38
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
In the estimation task, we train a sequence of classifiers using a given dataset, and then we test the sequence of classifiers in a new dataset equivalent to the training one (involving the same time span of the training set). This estimation task is appropriate, for example, to the analysis of a slow drifting problem with a cyclic behavior [24]. In this case, it is important to model accurately all the time span of the dataset, not only the last section. Our method is particularly aimed at this task, in which one can use not only information from past records but also from records corresponding to a future time. For the prediction and extrapolation tasks, we train a sequence of classifiers on a section of a dataset and then test it on the following section. In the prediction task, we evaluate each sequence using only the next point in the dataset. In the more difficult extrapolation task, we use a subset of data points including several steps ahead in the future. In this case, we need to extrapolate the position of the decision boundary, for which we use a simple linear technique. The objective in these two tasks is to make short-term predictions on a system that is evolving in a completely unknown way. In both cases, the evaluation puts emphasis in the performance of the last classifier(s) in the sequence. We discuss the three settings again in Section IV. This paper is organized as follows. In Section II, we discuss previous works on concept drift. In Section III, we introduce our solution to the problem and show illustrative examples, leaving mathematical details to Appendix A. We also discuss the relation with similar solution in other areas. Then, in Section IV, we present empirical results and comparisons with similar methods using artificial and real-world datasets and, finally, Section V closes the work with some remarks and conclusions. II. P REVIOUS W ORK Drifting concepts are specific classification problems in which the labels of examples or the shape of the decision boundaries change over time. In particular, we are interested in problems that change slowly over time. In a recent work, Kolter et al. [25] include a lengthy review of the state of the art in the field, starting from the early work of Schlimmer and Granger Jr. [26]. According to this, in this section we limit ourselves to briefly describe most of the previous methods and refer the interested reader to [25] for details and further references. There are three main approaches to concept drift in the literature: sample selection, sample weighting, and ensemble methods [5], [6]. As we stated before, the most common solution to the drifting concepts problem is to use a temporal window of a given length, also called a sliding window (SW) and to build a different classifier (or adapt a previous one) for each window [2], [7]–[9], [19], [20]. Some authors prefer to use the equivalent idea of uniform or stationary “batches” [20], [27], [28]. If the window is too big, the response time needed by the algorithm to follow the changes is excessive. On the contrary, when the window is too small, the algorithm adapts
quickly to any drift in the data, but it is also more sensitive to noise and loses accuracy because it must learn the input–output relationship from only a few examples. As a potential solution, many algorithms include an adaptive window size. One of the first to do that was FLORA2 [8]. Klinkenberg and Renz [19] presented an algorithm that modifies the number of stationary batches in the dataset by monitoring the accuracy, recall, and precision of the method. They applied it to the problem of detecting relevant documents in a series. Klinkenberg and Joachims [2] used SVMs to find the optimal time interval. The method adjusts SVMs with various window sizes, calculates the corresponding ξ α − esti mator [29] using the last batch, and keeps the window size that minimizes that quantity. Castillo et al. [27] and also Lanquillon [20] use statistical quality control to determine whether there is a concept change in a given batch. When this happens, a new classifier is constructed from scratch using only the data points considered to belong to the new context. Koychev et al. [30] also used a (different) statistical test to determine whether there was a change in concept in the last batch. In an interesting series of papers, Alippi and Roveri [12], [13], [31] developed another test for concept change and an adaptive classifier based on nearest neighbors. In a work focused on recurrent systems, Koychev [32] proposed to use a relatively small time window to learn the current context, then to select the past episodes that show a high predictive accuracy, and finally to train again the classifier using the original and the newly selected data points. In a similar way, Maloof et al. [33] introduced a method for the selection of examples in a partial memory learning system. They select some extreme examples and add them to the current ones to model the actual concept description. FLORA3 [8] also takes recurrence into account. In general, all these methods select an appropriate subset of the original dataset to train independent classifiers that are, each one, accurate at the corresponding time. In most cases, the selected subset is taken from the most recent examples. As a representative of sample selection methods in the comparisons we present in Section IV, we selected to use a simple SW, but with the length of the window optimized using independent validation sets. In an early work, Koychev [34] proposed to decrease the importance of old examples in the classifier simply by giving each data point a relative weight that decreases with time. The method, called gradual forgetting (GF), can be viewed as a softening of the sliding windows (SW) strategy, which gives “hard” (0/1) weights to (older/newer) examples. The author suggests using a simple linearly decreasing function for the relative weight. Klinkenberg [28] used an exponentially decaying function to weight older samples. The GF method is simple and easy to implement, and usually gives better results than SW. We also included (linear) GF in the comparisons in Section IV as a representative of sample weighting methods. Several authors have discussed the use of ensemble methods for drifting concepts. The streaming ensemble algorithm [10] fits an independent classifier to each batch of data, which are combined into a fixed-size ensemble using a heuristic replacement strategy. Wang et al. [35] used a similar strategy,
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
weighting each member of the ensemble according to its accuracy on the current batch. Gao et al. [36] applied ensembles to the task of classifying data streams and Hashemi et al. [37] used evolving one-versus-all multiclass classifiers for the same problem. Polikar and co-workers [24], [38], [39] developed an ensemble-based framework for different nonstationary problems. Recently, Kolter and Maloof [25], [40] introduced an improved ensemble method based on a previous work of Littlestone et al. [41]. Their dynamic weighted majority algorithm (DWM) dynamically creates and removes weighted experts in response to changes in performance. DWM uses four mechanisms to cope with concept drift: it trains, weights, or removes learners based on their individual performance, and also adds new experts based on the global performance of the ensemble. The authors produced an extensive evaluation and concluded that DWM outperformed most of the other learners considered in their work. DWM is also included in the evaluation in Section IV as the representative of ensemble methods. Finally, out of the scope of our work, we mention that the drifting problem has also been addressed from a computational learning theory point of view [4], [14], [16], [17], where some guarantees and theoretical bounds regarding the learning of sequences of functions were established. III. TA-SVM Let us assume that we have a dataset [(x1 , y1 ), . . . , (xn , yn )], where each pair (xi , yi ) was obtained at time i (that is, they are time ordered), xi is a vector in a given vector space, yi = ±1, and that the relation between x and y has a slow change in time. Our strategy to cope with this problem is to divide the dataset into m consecutive nonoverlapping time windows twν (with ν = 1, . . . , m and m ≤ n), and to create a coupled sequence of m (static) classifiers, each one being optimal in the corresponding time window. As we are assuming that the concept has a slow evolution, we expect that the classifiers will have the same property. According to this, we seek for a sequence of good classifiers in which time neighbors are similar to each other. The best solution to our problem should be a compromise between (individual) optimality and (neighbor) similarity. If we can define a simple distance measure d(cν , cµ ) to quantify the diversity between two neighbor classifiers cν and cµ , the base idea of our method is to minimize a two-term cost function min
m m−1 1 Errµ2 + γ d(cµ , cµ+1 ) m µ=1
(1)
µ=1
where the first term is the average of the usual cost function for each of the m classifiers and the second evaluates the total difference among the sequence of discriminant functions. The free parameter γ regulates the compromise between both terms, as in any regularized fitting. In principle, this method can be used with any classifier, if an appropriate distance measure can be defined. In this formulation, we use linear SVMs as classifiers (as, usually, we can use Kernels to produce nonlinear predictors if needed). Therefore, we look for a sequence of m pairs (w, b), each one defining a high-margin
39
hyperplane h ν given by wν x + bν = 0, where x belongs to the dataset’s vector space. We use a simple quadratic distance measure to quantify the diversity between hyperplanes d(h ν , h µ ) = ||wν − wµ ||2 + (bν − bµ )2 . Applying this measure to (1), we can introduce a new cost function for the full sequence of SVMs min
m m−1 n γ 1 ξi + ||wµ ||2 + C d(h µ , h µ+1 ) (2) m m −1 µ=1
µ=1
i=1
subject to
ξi ≥ 0 yi (wµ(i) xi + bµ(i) ) − 1 + ξi ≥ 0 where i = 1, . . . , n and µ(i ) indicates the time window including point xi . The first two terms in (2) correspond to the usual margin and error penalization terms in SVM [42], but for a complete set of classifiers, each one trained on a different time window. It is easy to see that the solution of this two-term problem gives the same sequence of SVMs that can be obtained by solving each SVM individually (if we use the same C for all SVMs). The last term in (2) corresponds to the new diversity penalization. The inclusion of this term couples the sequence, making each SVM dependent of all the others. The free parameter γ regulates the relative cost of the new term. Low γ values will almost decouple the sequence of classifiers, allowing for increased flexibility. High γ values, on the other side, will produce a sequence of almost similar SVMs. In this formulation, we have only considered the case in which data points arrive at regular time intervals. The more general case of nonconstant intervals (including missing data or data coming in bursts) can be addressed with simple extensions, for example, by giving different relative weights to the distances considered in the second term of (1), or by assigning different amounts of points to each hyperplane (see Appendix A). It is interesting to see that this formulation is valid even for time windows including only one point (m = n), because the coupling introduced by the new penalization term breaks with the indetermination of having only one point to define a hyperplane. As we show in detail in Appendix A, by deriving the corresponding dual (as usual in SVM methods) we can rephrase the problem in (2) as 1 αi max − α T Rα + α 2 i
subject to
0 ≤ αi ≤ C αi yi = 0
where αi are the Lagrange multipliers and R is a matrix with Kernel properties. The solution to this maximization problem is a coupled set of SVMs that evolve in time, which we call time-adaptive SVMs (TA-SVMs).
40
The time complexity of a TA-SVM is similar to that of the plain SVM. It can be analyzed in two stages, the kernel computation and the solution of the optimization problem. The new kernel can be computed in O(n 2 ), as we show in Appendix B. What is left after this is solving a conventional SVM optimization problem, which is also O(n 2 ) using for example sequential minimal optimization (SMO) [43]. Overall, TA-SVM, in its basic form, has the same scaling problems as plain SVM. For the estimation task this is not critical, as one usually needs to solve the problem only once. But basic TA-SVM does not work by making updates of previous solutions, it always looks for a new global solution. Then, for the prediction and extrapolation tasks our method requires to solve an O(n t 2 ) problem each time a new batch arrives, where n t is the total number of instances at time t. A. Connections with Other Areas There are connections between TA-SVM and methods from other areas, so it is interesting to discuss them at this point. In online learning [41], [44], [45], the objective is to learn the correct input–output relationship as fast as possible from data that arrive sequentially, one instance at a time. Algorithms for online learning mostly work by making updates from the solution at the previous step. Many concept drift methods, including for example ensemble methods or the FLORA series described before, also update their classifiers as a function of the last batch. Some of these methods can use very short batches or even learn from one point at a time, making them closer to online learning strategies. The main difference between the two settings is that online learning does not necessarily track concept changes, and therefore no efforts are made to forget out-of-date information in most cases. In particular, algorithms for sudden drift are more related to online learning, given their common objective of fast convergence to the optimal solution. In a recent work, [46] introduced an online learning algorithm for kernel methods, the passive-aggressive (PA) algorithm. At each step, PA modifies its solution trying to give the right label to the newly arrived instance while keeping the solution similar to the previous step by using a coupling term, similar to our formulation (2). The general idea of keeping a tradeoff between local accuracy and smoothness of the time evolution of the solution is similar to ours, but with two main differences. One is determined by the different objectives of the methods. TA-SVM is able to use information from both past and future times (when available), because it fits (offline) the full sequence of SVMs at the same time, while PA looks for the best current solution using only past information. This becomes particularly relevant for the estimation task, as defined before. The second difference is that TA-SVM looks actively for a maximum margin solution, while PA does not update its solution if the margin of new points is greater than a given value, because of its passive nature. Other authors have also used the idea of constraining the possible updates of the current solution [47], [48]. For example, [49] projects the update to a convex set, which allows establishing shifting bounds on the total loss for some general additive regression algorithms.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TA-SVM is in fact more related to the area of multiple task learning (MTL) [50], [51]. MTL algorithms are designed to learn several related tasks at the same time, profiting from the relevant information available in the different tasks. Our method makes use of the same idea. In TA-SVM, each of the related tasks is the classification problem corresponding to a given time window. As we are assuming a gradual drift, time-neighbor tasks should be similar to each other. Recently, [52] introduced a framework for MTL with kernel methods. In their formulation, the relation between the different tasks can be regulated using an appropriate homogeneous quadratic function. TA-SVM can be viewed as a particular case of this formulation, using the first and third terms of (2) as the regularizer. As we mentioned in the introduction, we use three different tasks to evaluate TA-SVM, namely estimation, prediction, and extrapolation. The estimation task is more related to the MTL problem, giving the same importance to all problems, while the prediction and extrapolation tasks are more tied to the online learning problem and its emphasis on the current hypothesis. B. Illustrative Example As a first example of the potential of TA-SVM, we apply it to the artificial sliding Gaussians dataset. This is a twoclass problem, in which each class is sampled from a Gaussian distribution. Both classes drift together, following a sinusoidal trajectory on a 2-D input space. We generated n = 500 points according to 2i π 2i π − π + 0.2yi + ε1 , sin − π + 0.2yi + ε2 xi = 500 500
where i = 1, . . . , 500, ε1,2 are sampled from a normal distribution with zero mean and σ = 0.1, and yi is a balanced random sequence of ±1. Fig. 1 shows a realization of the dataset at three different times. We used the first 450 points as training set, and generated in each case a second realization of 450 points to use as validation set, in order to select the optimal values of γ , C, and the length l of the window used by SW. Fig. 2 shows the sequence of hyperplanes obtained with TA-SVM (m = n) and SW-SVM. In this latter case, for each point xi we trained an SVM using a specific time window of length 2l + 1 centered on that point (when this is not possible, here and in all other experiments we used l points from one side and all the available points from the other). To improve the readability of the figure, we show only one in every ten consecutive SVMs. It is evident that the coupled solution of TA-SVM produces a more regular less noisy sequence of classifiers than the use of independent optimal SWs. C. Dependence on γ As a second demonstrative example, we evaluated the dependence on γ of TA-SVM solutions. In this case, we used the rotating hyperplane dataset, which is a set of 500 points sampled from a uniform distribution in a d-dimensional hypercube [−1, 1]d [35], [53], [54]. The decision boundary for the two classes is a slowly rotating hyperplane (passing through
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
1
1
1
0
0
0
−1
−1 −1
Fig. 1.
41
0 (a)
1
−1 0 (b)
−1
1
−1
0 (c)
1
Sliding Gaussians dataset at (a) t = 25, (b) t = 175, and (c) t = 475 time units. In each figure, the last 25 generated points are filled.
2π
1
3π/2 Angle
0
π
Real Low γ Mid γ High γ
π/2 −1 0 (a)
−1
1
0
0
100
200 300 Time (i)
400
500
1 Fig. 3. TA-SVM solutions for the rotating hyperplane problem using different γ values.
0
−1 0 (b)
−1
1
Fig. 2. Sequence of hyperplanes obtained as a solution of the sliding Gaussians dataset with (a) TA-SVM and (b) SW-SVM.
the origin). The direction of the hyperplane is defined by its normal vector v, which in this experiment follows the law: cos(2πi /500) 10 v 2 (i ) = sin(2πi /500)
v 1 (i ) =
v 3,...,d (i ) = 0. Point xi has class yi = sign(xi v(i )). In this first experiment with this dataset, we used d = 2, m = n, and three different γ values that show the typical responses of our method. The C parameter was set to 1 in all cases, because we checked that the solutions are almost independent of C in this problem. In Fig. 3, we plot the real angle between v and the first axis, as a function of time. We also show the solutions found by TASVM for three values of γ . Using a low γ (∼102 , dashed line), the TA-SVM solution is too flexible, following particularities
of the training set. For an adequate mid-γ value (∼105, dashdotted line), there is an optimal solution, with a balance between local accuracy and global flexibility. Last, for a high γ (∼108, dotted line), the change over time of the hyperplane is highly penalized, therefore it remains almost constant and similar to the solution that can be found by a classical SVM. For this particular dataset, as v does a complete turn, both classes are almost uniformly distributed and the optimal static solution is a null vector. The soft and erratic trajectory of TA-SVM corresponds to the angle of a nearly null vector in this case. D. Dependence on m In the previous example, we used the maximum flexibility of TA-SVM, m = n, and regularized it by optimizing the value of γ . TA-SVM has other simple way to control its complexity, i.e., to use a shorter sequence of SVMs (m < n), with the added advantage of a reduced computational burden. In Fig. 4, we show the evaluation of this possibility. Again, we used 100 realizations of the rotating hyperplane dataset, with d = 2. We considered three settings: one classifier for each training data (m = n, full line), one classifier for every two data points (m = n/2, dotted line) and one for every eight points (m = n/8, dashed line). In (a), we show the corresponding results as a function of γ . The first observable result is that the optimal values of γ decrease when using fewer classifiers. This is easy to explain considering that shorter sequences are naturally less flexible (simply because there are fewer hyperplanes and hence fewer
42
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
0.1
m=n m = n/2 m = n/8
Test Error
0.09 0.08
l: 8
0.07 l: 64
0.06
l: 16 l: 32
0.05 0.04 1 10
102
103
104
105
106
γ (a) 0.2
m=n m = n/2 m = n/8 l: 8
Test Error
0.19 0.18 0.17
l: 64 0.16
l: 16 l: 32
0.15 101
102
103
104
105
106
γ (b) Fig. 4. Test errors for the rotating hyperplane problem as a function of γ . (a) Results using d = 2 and different values of m. (b) Same as before, but for a noisy dataset with 10% flipped labels.
adjustable parameters) and thus require a lower diversity penalization. A second observation is that the best result is obtained with m = n, and that in this case the use of fewer SVMs produces a small decrease in performance. This result is a consequence of using a noiseless dataset, as will become clear when analyzing (b) of the same figure. Horizontal lines represent the error rates produced by SW-SVM with different lengths (l). It is interesting to note that in all cases there is a wide range of γ values for which TA-SVM outperforms the results of the optimal l. We repeated the full experiment using a noisy dataset, in which 10% of the labels were randomly switched. We show the corresponding results in Fig. 4(b). Qualitatively, the results are similar to the noiseless dataset. TA-SVM results are better than SW-SVM over a wide range of γ values. The only difference is that in this noisy case the best performance is obtained using eight points per hyperplane (m = n/8, dashed line). In this case, the higher flexibility of the m = n models allow them to learn some noisy characteristics of the datasets, and the problem cannot be avoided using a stronger coupling. Using more points per hyperplane allows TA-SVM to filter some noise locally, at each SVM, thereby improving the performance of the sequence. E. STAGGER As a last example, we applied TA-SVM to the most used benchmark in drifting concepts methods [8], [30], [32]–[34],
i.e., the STAGGER dataset [26]. The dataset has three categorical inputs, each one taking three possible values. The dataset has 120 training instances, and the concept changes abruptly two times, one every 40 instances. This is a particularly challenging problem for TA-SVM, because there are only sudden drifts of the concept in this case. With this dataset, we are demonstrating the capabilities of our method in the most unfavorable situation. We first generated the training sequence of 120 data points, each time sampling with repetition from the set of 27 possible instances, and labeling each point with the right concept for each time step. As in [8], we generated a similar test sequence but with 100 points at each one of the 120 time steps. We also generated a third sequence with 100 points at each time step to use as a validation sequence.2 At each time step i , we trained a sequence of classifiers using x1,...,i , and used the last classifier in the sequence to predict point xi in each one of the 100 test points (i.e., the prediction setting). For both methods, we used one SVM per training instance (n = m) and the independent validation set to select the optimal γ and l values for the full sequence, as in previous datasets. Again, we used a fixed C value in this noiseless dataset, because we verified that both methods are nearly independent of C also in this case. The full experiment was repeated 100 times. In Fig. 5, we show the average accuracy of both methods as a function of time. In both (a) and (b), the two vertical dashed lines correspond to the concept changes. In (a), we show the performance of independent SWs for three different lengths. For a short window, there is a quick response to changes, but there is also a lack of information about the concept. On the contrary, for the biggest window the adapting times are bigger, but the final performance is better. The optimal length compromises between both situations. In (b), we compare the optimal settings for both methods. TA-SVM is equivalent to or better than independent SW-SVM even in this (most unfavorable) dataset. IV. E MPIRICAL E VALUATION In this section, we compare the new method with other state-of-the-art strategies for drifting problems under the three settings discussed in the introduction: prediction, estimation, and extrapolation. We use the same artificial datasets and the STAGGER concepts introduced in the previous section, and the real-world electricity pricing dataset [55]. Unless stated otherwise, we report the mean classification error, with its standard deviation, over 100 independent realization of the training sets. In all cases, we use independent validation sets to optimize the internal parameters of the methods under evaluation. We always use the same realizations of the training, validation, and test sets for all the methods. 2 The use of a validation sequence is not typical in the previous literature on the STAGGER dataset. The setting we use in this example is therefore not comparable with previous works. We use it as a demonstration of the capabilities of TA-SVM in its optimal setting. In Section IV, we use this dataset to evaluate TA-SVM in the standard setting.
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
1
0.22
0.18
0.6 Test Errors
Accuracy
0.8
0.4 Optimal Window Small Window Big Window
0.2 0
20
40
60 80 Time Steps (a)
0.14
120 0.06 5
0.8 Accuracy
TA-SVM GF-SVM SW-SVM DWM-SVM
0.10 100
1
10 Dimensions
15
20
Fig. 6. Prediction test errors as a function of the dataset dimension for the rotating hyperplane problem.
0.6 0.4 0.2 0
43
TA-SVM Sliding Windows
20
40
60 80 Time Steps (b)
100
120
Fig. 5. Results for the STAGGER dataset. (a) Behavior of independent SVMs using overlapping sliding time windows. (b) Direct comparison of optimal settings of both methods.
A. Prediction In the prediction task, we are given a subset of data points spanning some period of time and our goal is the prediction of the next arriving data point (which should have no drift from the last one). To evaluate the different methods in this case, we use the same settings as for the STAGGER dataset in the previous section. We compare TA-SVM with other three methods that we described in Section II: SW, GF, and DWM, in all cases using linear SVMs as classifiers. In a first evaluation we use the rotating hyperplane dataset, but with a uniform rotation in this case: vi = (cos(2πi /500), sin(2πi /500)). In all cases, we generated 100 training sets with 500 points, and for each time i and each training set we generated an independent validation and test sequence with 100 points. We use different values of d in order to evaluate the performance of the methods in highdimensional spaces. For each training set (and all the corresponding validation and test sets), we applied a random rotation of the original space, to produce a dataset in which all the variables are relevant to the concept. We use one classifier per data point (which is the most flexible setup) for the four methods. For TA-SVM, this means that we set m = n (which could be not optimal, as we showed in Fig. 4). In the case of SW-SVM and GF-SVM, for each point xi we fit an SVM using a specific time window of length 2l + 1 centered on that point. It is worth mentioning that in this case consecutive time windows are almost completely overlapped. For DWM-SVM,
this is the default setting where the ensemble is updated at each time step. In Fig. 6, we compare the performance of all methods as a function of the number of dimensions in the dataset. SW and GF are the best methods for d = 2, probably because of some small overfitting due to the nonoptimal setting of TA-SVM. On the other hand, they are clearly more affected by the increase in the number of dimensions. This can be explained considering that, even if SVMs are known to work well in high-dimensional spaces, there are always more chances of producing solutions with bad generalization in this case. SW and GF can use bigger time windows (that is more training points) to increase their performance, but this has the added cost of allowing a bigger concept change in the considered window. TA-SVM can deal more efficiently with this problem, because it searches for a global solution of the problem, sharing information among all classifiers as a consequence of the coupling. DWM starts with a relatively low performance, but it is also less affected by the “curse of dimensionality” than SW and GF, probably because it also shares information among the sequence of classifiers, as we discussed before. We repeated the evaluation using the rotating Gaussians dataset. This dataset is very similar to the previous one. The main difference is that the classes are sampled not from uniform distributions at each side of the hyperplane but from normal distributions centered at +vi /2 and −vi /2, both with σ = 0.3. The optimal solution of this problem is more difficult for SVMs because it has some class overlapping and fewer points on the decision boundary. We consider two scenarios here. In the first one, all other settings of the problem are equal to the previous case. The corresponding results are shown in Fig. 7(a). Qualitatively, the behavior of all four methods is the same as in the previous dataset. There is a bigger gap for low-dimensional problems and also a bigger decay of the performance of SW and GF for high-dimensional datasets. In the second scenario, we considered a faster drift of the classes, including two full turns of the Gaussians around the origin at the same time. In this case, we generated sequences of 500 points using vi = (cos(2πi /250), sin(2πi /250)).
44
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
1.0 0.8
0.11
Accuracy
Test Errors
0.13
TA-SVM GF-SVM SW-SVM DWM-SVM
0.09
0.6
TA-SVM GF-SVM SW-SVM DWM-SVM
0.4 0.07 5
10 15 Dimensions (a)
20
0.2
0
20
40
60 Time Steps
80
100
120
Fig. 8. Average prediction accuracy of the methods tested in this paper on the STAGGER dataset, as a function on time. Error bars show the 95% confidence interval.
0.17
Test Errors
0.15 TABLE I P REDICTION A CCURACY ON THE STAGGER D ATASET, AVERAGED OVER
0.13
R EALIZATIONS AND T IME , FOR THE M ETHODS T ESTED IN T HIS PAPER .
0.11
I N PARENTHESIS W E S HOW THE S TANDARD D EVIATION OF THE M EAN A CCURACIES
0.09
Method
0.07 5
10 15 Dimensions (b)
SW-SVM GF-SVM TA-SVM DWM-SVM
20
Fig. 7. Prediction test errors as a function of the dataset dimension for the rotating Gaussians problem (a) for a dataset including a full turn of the Gaussians and (b) for the same problem, but including two full turns of the classes, i.e., a faster drift.
In Fig. 7(b), we show the corresponding results. It is easy to see that there is a general increase in error levels, associated with the faster drifting of the classes. Again, SW and GF are better than the other methods for low-dimensional datasets. TA-SVM still outperforms the other methods for high-dimensional datasets, but DWM-SVM does not work well in any situation in this case. We also evaluated the STAGGER dataset in the prediction task. In this case, we followed as much as possible the settings used in previous works with this dataset. According to this, we do not use an external validation sequence, we only use the training and test sequences described in the previous section. We replace the external validation with an internal fourfold cross validation using the training sequence available at each time step in order to set the optimal values of the free parameters of all the methods. In Fig. 8, we show the average prediction accuracy as a function of time. TA-SVM shows the fastest response to concept drift, but after that DWMSVM shows a better convergence to the optimal decision. The bias to a continuous drift in TA-SVM reflects in a slower convergence rate in this case. In Table I, we show the same results averaged over time. Overall, TA-SVM shows a very good performance in this problem involving sudden concept drift. DWM-SVM slightly outperformed the new method, but most of the difference is in the first concept, before any concept drift.
Accuracy (%) 87.14 87.67 90.00 90.83
(0.12) (0.12) (0.10) (0.12)
B. Estimation As we stated in the introduction, in the estimation task we train a sequence of classifiers using a given dataset and then we test the complete sequence of classifiers in an independent test set involving the same time span of the training set. The objective is to evaluate all the classifiers at the same time, not only the last one as in the previous task. In this case, for each training set of 500 points we generated 100 equivalent validation and test sets. We optimized all internal parameters using the validation sets and then we evaluated the resulting classifiers on the test sets. We assume that we know the time step i at which each point in the test set was measured, so we can use the corresponding classifier from the trained sequence. For the SW and GF methods we use symmetric time windows of length 2l + 1, centered at time i . We do not use DWM in this case because it was not designed for the estimation task. In DWM, the classifier corresponding to time i can only use information from the points in its past. The other three methods can see the full training set, which would make the comparison unfair. In this evaluation, we fixed the number of dimensions at d = 2 (for all the datasets) and varied the number m of classifiers in the sequence, in order to evaluate the dependence of the methods with the m/n ratio. In the first place, we used the same rotating hyperplane dataset as in the prediction task. In Fig. 9(a), we show the results for the three methods included this time. We also evaluated a noisy version of this dataset, (b), in which a random 10% of the labels were switched. In (c) of the same figure we show the results corresponding to
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
TA-SVM GF-SVM SW-SVM
0.038
0.148 Test Errors
Test Errors
0.040
45
0.036
0.146
0.144 0.034 0.032
0.142 0
5
10 n/m (a)
15
20
5
10 n/m (b)
15
20
0
5
10 n/m (d)
15
20
0.244
Test Errors
0.05 Test Errors
0
0.04
0.03
0.240 0.236 0.232
0.02 0.228 0
5
10 n/m (c)
15
20
Fig. 9. Estimation test errors as a function of the number m of classifiers in the sequence (a) for the rotating hyperplane problem, (b) for the same problem with 10% noise, (c) for the sliding Gaussians problem, and (d) for the same problem using normal distributions with bigger σ .
the sliding Gaussians dataset, using the same settings as in Section III-B. Finally, in (d) we use a second version of the sliding Gaussian generated with σ = 0.3, i.e., with more overlap of the classes, and the other setting equal to (c). In the four situations (two datasets times two noise levels), the qualitative results are similar. The overall performance of SW and GF deteriorates when fewer classifiers are used in the sequence (higher n/m values). Clearly, when using fewer SVMs the concept drift becomes more relevant to the problem. The effect is clearer for the low-noise situations, (a) and (c). For noisy situations, as we discussed before, there is always certain trade-off between noise and drift. TA-SVM clearly outperforms SW and GF in the estimation task. This was expected, as TA-SVM was designed to learn accurately all the sequence at once. We already showed in Section III-D that TA-SVM usually works better when using n/m > 1, in particular in noisy situations. This result is evident here in all cases except for (c), which is a problem with low-noise and a relatively fast drift of the decision boundary. C. Extrapolation The extrapolation task is an extension of the prediction task, in which we are interested in the prediction of several steps ahead into the future. In this case, we need to extrapolate
the position of the decision boundary some steps into the future, starting from the last classifier in the sequence.3 Our method does not assume any functional form for the time evolution of the sequence of classifiers, the only constraint is that neighboring hyperplanes should be close to each other, so we do not have a principled way to determine the position of each future classifier. In consequence, we must choose an appropriate external method to make an extrapolation based on the position of each classifier in the sequence. In this paper, we use a simple linear extrapolation, but a more complex model could be applied if required (and if there is enough data available). For the experiments in this task, we generated training sequences with 450 points, and for each one of these training sets we generated 100 test sequences with 500 points each. At each run, we optimized all internal parameters using the first 450 points of the test sequences as validation sets and then evaluated the resulting classifiers on the last 50 points of 3 A simple way to do it would be to use an extended sequence of classifiers with hyperplanes located in the future period that we want to predict, and let TA-SVM choose the position of each one. Unfortunately, this procedure will not have the effect we are looking for. As we do not have training points for the future period, the solution will be a compromise between only two penalties: the last term in (2), which will make the solution to stay at the position of the last hyperplane before the extrapolation period, and the first term in (2), which will move the solution to the null vector in the extrapolation period.
46
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
0.4
0.26
TA-SVM GF-SVM SW-SVM
0.3
0.22 Test Errors
Test Errors
0.24
0.20 0.18 0.16
TA-SVM GF-SVM SW-SVM
0.2
0.1
0.14 0
5
10 n/m (a)
15
0.0
20
0.42
0.34
Test Errors
Test Errors
0.36
0.32 0.30
10
20
30 Steps (a)
40
50
30 Steps (b)
40
50
TA-SVM GF-SVM SW-SVM
0.36
0.30
0.24 0
5
10 n/m (b)
15
20
10
20
Fig. 10. Extrapolation test errors as a function of the number m of classifiers in the sequence (a) for the sliding Gaussians problem and (b) for the same problem using distributions with bigger σ .
Fig. 11. Extrapolation test errors as a function of the number of predicted steps into the future (a) for the sliding Gaussians problem and (b) for the same problem using distributions with bigger σ .
these test sets. As in the previous case, we fixed the number of dimensions at d = 2 and varied the number m of classifiers in the sequence, from n/m = 1 to n/m = 20. To extrapolate the position of the decision boundary, we used a simple linear extrapolation of the values of each component of w (wi (t) = αi t + βi ) and of b (b(t) = αb t + βb ), fitting the coefficients of the linear models (αi,b and βi,b ) using all the SVMs in the last 50 training points. The number of SVMs included in the extrapolation goes from 50 for n/m = 1 to only 2 for n/m = 20. We do not use DWM in this case because it produces an ensemble of SVMs and there is no simple way to extrapolate the position of the decision boundary for this method. For this evaluation, we used the sliding Gaussians dataset in the same two settings that we explained in the estimation task. In Fig. 10, we show the corresponding results. All methods become more unstable in this task, as indicated by the bigger error bars, because we are superposing two error sources, the fitting of the classifiers, and the extrapolation of their position. In Fig. 10(a), we can see again a typical behavior of TASVM, with the best performance at n/m > 1. For n/m = 20, all methods show similar poor results, mainly associated with bad extrapolations of the decision boundaries. For the noisy situation in Fig. 10(b), the difference between TA-SVM and the other methods is clearer. In Fig. 11, we show, for the same two datasets, the evolution of the classification error as a function of the number of time steps into the future predicted
by all methods. The results correspond to n/m = 10. In the low-noise situation, (a), TA-SVM shows the best performance for all but the maximum number of time steps, when all methods are equivalent. With more noise present, (b), TASVM is clearly superior in all situations. D. Real-World Case: Electricity Pricing As a last evaluation of TA-SVM we considered a realworld problem, the electricity pricing dataset [55]. The dataset contains 45 312 instances collected at regular 30-min intervals during 30 months, from May 1996 to December 1998. The data was first obtained directly from the electricity supplier in New South Wales, Australia. There are five attributes in total. The first two date the record in day of week (1 to 7) and half-hour period (1 to 48). The last three attributes measure the current demand, the demand in New South Wales, the demand in Victoria, and the amount of electricity scheduled for transfer between the two states. The target is a binary value indicating whether the price of electricity will go up or down. Following Harries [55], we considered batches of 1-week length. At each week, we train the classifiers with all previous batches and predict the next batch (i.e., the 336 instances in the current week). We considered this setting as a prediction task, and correspondingly for our method we use the last SVM in the sequence to make predictions. In order to select the free parameter of other methods, we used a simple validation
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
TABLE II P REDICTION A CCURACY ON THE E LECTRICITY P RICING D ATASET, AVERAGED OVER T IME , FOR A LL M ETHODS T ESTED IN T HIS PAPER . T HE ROWS L ABELED SVM S HOW THE R ESULTS O BTAINED WITH S TATIONARY SVM Method SVM GF-SVM SW-SVM DWM-SVM TA-SVM SVM GF-SVM SW-SVM DWM-SVM TA-SVM
Kernel Linear Linear
Gaussian Gaussian
Accuracy (%) 63.3 65.1 65.3 63.3 65.6 66.1 67.2 67.8 66.9 68.9
scheme. At each step, we set aside the last available week in the training set as a validation set (not the current week, which we want to predict, but the previous one). Once we selected the parameters that are optimal for this validation set, we train again all methods using the complete training set. As this is a real-world dataset, we do not know in advance what is the real amount of drift in the data. In order to have an estimation of the benefits of using concept drift methods in this case, we also applied a standard SVM using the same procedure as described before (i.e., for each week we determined optimal parameters, trained the SVM, and predicted the current week). In Table II, we show the corresponding results. In the first rows we show the result obtained using a linear kernel, as in all previous datasets. All adaptive methods outperformed the standard SVM in this case, suggesting the actual presence of some concept drift in the dataset. TA-SVM shows the best performance in this case. For reference, Harries used a decision tree with a sliding window in the same problem, reporting 1-week prediction accuracies for various window sizes between 66% and 67.7%. Looking for a better solution, we repeated the experiment using a Gaussian kernel in this case. All methods improved with the use of nonlinear classifiers. Again, TA-SVM shows the best performance in this dataset.4 V. C ONCLUSION In this paper, we presented the TA-SVM, which is a new method for generating adaptive classifiers and capable of learning concepts that change with time. The basic idea of TA-SVM is to use a sequence of classifiers, each one appropriate for a small time window but, in contrast to other proposals, learning all the hyperplanes in a global way. Starting from the solution of independent SVMs, we showed that the addition of a new term in the cost function (which penalizes the diversity between consecutive classifiers) produces in fact a coupling of the sequence. Once coupled, the set of SVMs acts as a single adaptive classifier. 4 DWM shows a low performance on this setup. Kolter and Maloof [25] applied DWM to this dataset with better results, but using an online learning setting, learning, and predicting one instance at a time (an easier task for this problem), which differs from Harries’s methodology.
47
We evaluated different aspects of the TA-SVM using artificial drifting problems. In particular, we showed that changing the number of classifiers (the n/m ratio) and the coupling constant γ , we can effectively regularize the sequence of classifiers. We compared TA-SVM with other state-of-the-art methods in three different settings: estimation, prediction, and extrapolation, including problems with small datasets, highdimensional input spaces and noise. TA-SVM showed in all cases to be equivalent to or better than the other methods. Even for the most unfavorable situation for TA-SVM, i.e., the sudden changes of the STAGGER dataset, our new method showed a very good performance. We also applied TA-SVM to a real-world dataset, i.e., Harries’s electricity pricing, with very good results. TA-SVM has two free parameters, m and γ . In our experience, the more efficient way to use them is to fix the n/m ratio in a range of 5 to 10, and then tune γ using an internal cross validation. If the dataset is small or there are indications of high drift levels, one can use n = m to increase the flexibility of the model. The C parameter follows the same rules as in standard SVMs. If there is previous knowledge about noise levels, the C value can be set accordingly. If not, we recommend to begin with a low C value and leave the regularization to the coupling term. There is nothing in our formulation or the derivation of the dual problem that prevents the use of arbitrary kernel functions to evaluate distances and create nonlinear adaptive classifiers. We already used this possibility in the modeling of the electricity pricing domain. The only potential difficulty can arise in the extrapolation setting. For kernel functions corresponding to finite-dimensional feature spaces, it is always possible, in principle, to use our simple extrapolation. However, this cannot be done if the kernel is associated to an infinite-dimensional feature space. If needed, there are some simple ways to make TA-SVM scale efficiently to larger problems. For the prediction or extrapolation tasks, the focus is on the performance of the last classifiers in the sequence. In the prediction case, we only use the last classifier to predict the labels of the test samples. For the extrapolation task, we use only a few SVMs from the end of the sequence in order to extrapolate the solution. As the coupling term in (2) involves only interactions to first time neighbors, the influence of old examples to the actual TA-SVM prediction decays exponentially with time. According to this analysis, very old examples quickly become useless to TA-SVM and can be eliminated from the dataset. In practice, for prediction and extrapolation we will pay a reduced cost by limiting TA-SVM to use a fixed number of the more recent examples. In addition, we can easily force the first hyperplane in this reduced sequence to keep the optimal position found in a previous step, which will reduce even more the loss in performance. Going further in the same direction, we can even arrive at a quasi online version of TA-SVM, where only a few hyperplanes are adjusted at each step. We are currently studying the application of TA-SVM to real problems in slowly drifting systems, in particular to fault prediction in critical mechanical systems. Also, we are
48
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
evaluating the extension of TA-SVM to one-class classification and regression problems. Finally, we are considering the use of partially overlapping windows for TA-SVM. ACKNOWLEDGMENT The authors would like to thank P. F. Verdes and three anonymous reviewers for useful suggestions that considerably improved this manuscript. A PPENDIX A D ERIVING THE D UAL P ROBLEM First, we introduce the notation we use in this section. We consider the case where we want to adjust a sequence of m hyperplanes to a dataset with n points. We use Greek letters for the indices that vary in the number of hyperplanes and Latin letters for the indices that vary in the number of points. As we explained in the main text, the hyperplanes are defined by a vector wµ and a scalar bµ , for µ ∈ {1, . . . , m}. The hyperplane corresponding to the i -th point is defined by (wµi , bµi ). pµ is the set of points {i : µi = µ}. P is an m × n matrix defined as Pµj = 1, if j ∈ pµ , 0 otherwise. Also, we use the kernel matrix K, defined by K i j = yi y j xi x j . As in conventional SVMs, we can always replace this definition with any other including a useful inner product. Other required matrix is Q given by 1, if hyperplanes ν and µ are neighbors, Q µν = 0, otherwise which we symmetrize by Q S = (Q + QT )/2. We also use the notation P ⊙ Q for the entrywise (or Hadamard) matrix product of P and Q: (P ⊙ Q)i j = Pi j Q i j . We start from the problem m m γ 1 Q µν ||wµ − wν ||2 ||wµ ||2 + min wµ ,bµ 2m 2 µ=1 ν=1 n
ξi + (bµ − bν )2 + C i=1
subject to
ξi
≥0
yi (wµi xi + bµi ) − 1 + ξi
≥0
where ||w||2 = w · w. This is the same problem we introduced in the main text, with small differences that help in the search of the solution. Given the symmetry of the term including Q, it is easy to rewrite the problem using Q S m m γ S 1 Q µν ||wµ − wν ||2 ||wµ ||2 + 2m 2 µ=1 ν=1 n
2 +C ξi . + (bµ − bν ) i=1
Then, the corresponding Lagrangian is m m γ S 1 ||wµ ||2 + Q µν ||wµ − wν ||2 L= 2m 2 µ=1 ν=1 n
+ (bµ − bν )2 + C ξi i=1
−
−
n
i=1 n
αi yi (wµi xi + bµi ) − 1 + ξi
(3)
βi ξi
i=1
where αi ≥ 0 and βi ≥ 0. We have to maximize L with respect to αi and βi and minimize it with respect to wi , bi and ξi . At this point, the derivatives with respect to the primal variables should be zero ∂L = 0, ∂ξi
∂L = 0, ∂wµ
∂L = 0. ∂bµ
From these equations, we can eliminate the variables ξi , wµ , and bµ from L and obtain the dual problem. We start with the derivative with respect to ξi ∂L = 0 = C − αi − βi ∂ξi which implies that 0 ≤ αi ≤ C. On the other hand, taking into account that each ξi is multiplied by (C − αi − βi ), (3) becomes m m γ S 1 ||wµ ||2 + Q µν ||wµ − wν ||2 L= 2m 2 µ=1 ν=1 (4) n
2 + (bµ − bν ) αi yi (wµi xi + bµi ) − 1 . − i=1
In the case of wµ we have
∂L =0 ∂wµ 1 S wµ + γ (wµ − wν ) − αj yjxj Q µν = m ν j ∈ pµ
which results in 1 S Q µν (wµ − wν ) = αj yjxj. wµ + γ m ν j ∈ pµ
Defining the matrix M as
S /m, 1 + γ κ Q µκ Mµν = S /m, −γ Q µν we can write wµ as wµ =
j
−1 Mµµ α y x . j j j j
if µ = ν, otherwise
(5)
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
Using this, we can rewrite the term ||wµ ||2 in (4) −1 −1 ||wµ ||2 = Mµµ Mµµ αα yy xx . i j i j i j i j µ
µ
With this equality, we can simplify the obtained L 1 L = − α T P T M−1 P ⊙ K α 2
2 γ S bµ − bν + αi − αi yi bµi . (8) Q µν + 4m µ,ν
ij
On the other hand −1 −1 P T M−2 P = Mµµ Mµµ . i j ij
i
(6)
γ m
ij
−1 −1 × Mµµ − Mνν j j −1 −1 2 = Mµµ Mµµ Dµµ i j −
−1 Mµµ i
−1 Mνµ j
µν
S Q µν
×αi α j yi y j xi x j as Q S is symmetric and using the m × m diagonal matrix defined by S if µ = ν κ Q µκ , Dµν = 0, otherwise. Given that T −1
P M (D − Q S )M−1 P i j −1 S −1 −1 −1 = Dµµ − Mµµ Q µν Mµµ Mνµ Mµµ j i j i µ
µν
S ||w − w ||2 as we can write µν Q µν µ ν S Q µν ||wµ − wν ||2 µν
P T M−1 (D − Q S )M−1 P ⊙ K α.
(7)
Using (6) and (7) we can write (4) as 1 T T −2 α P M P⊙K α L= 2m γ T T −1 α P M (D − Q S M−1 P ⊙ K)α + 2m
2 γ S Q µν bµ − bν − α T P T M−1 P ⊙ K α + 4m µ,ν +
i
αi −
αi yi bµi .
i
The matrices M, D, and Q S are related by the equation
I + γ D − QS . M= m
bµ
m ν=1
S Q µν
−
m ν=1
S bν Q µν
=
αi yi
(9)
i∈ pµ
(10)
Since (D − Q S ) is singular, given that (D − Q S )1 = 0, we can write n γ γ αi yi . 0 = 0b = 1 D − Q S b = 1Ph = n n i=1
In this case, the solution to the system (10) is + m b= D − Q S Ph γ
(11)
[where (D − Q S )+ is the pseudoinverse of (D − Q S )]. We still need to eliminate the bµ from L. The part that depends on b is
2 γ S αi yi bµi Q µν bµ − bν − 4m µ,ν i
2 γ S Q µν bµ − bν − bT Ph. (12) = 4m µ,ν
= 2α T
which, defining h i = αi yi , we can write as γ D − Q S b = Ph. m
µ
ij
i∈ pµ
which gives
S ||w − w ||2 term can be rewritten as The µν Q µν µ ν
−1 −1 S S Mµµ − Mνν Q µν Q µν ||wµ − wν ||2 = i i
αi α j yi y j xi x j
m γ S ∂L =0= Q µν (bµ − bν ) − αi yi ∂bµ m ν=1
µ
µν
i
Now we should use the derivatives with respect to bµ . In the case
µ
With this and the definition of K, we have ||wµ ||2 = α T P T M−2 P ⊙ K α.
µν
49
We can write this as γ S Q (bµ − bν )2 − bT Ph 4m µ,ν µν γ S 2 Q µν bµ − 2bµ bν + bν2 − bT Ph = 4m µ,ν
γ 2 S T S = b Q − b Q b − bT Ph 2m µ,ν µ µν γ T b D − Q S b − bT Ph = 2m bT Ph = − bT Ph 2 bT Ph =− 2 + m T T = − h P D − Q S Ph 2γ + m T S T =− α P D−Q P ⊙Y α 2γ
50
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
where Y = yyT . Taking into account the last equality, L becomes
1 L = − α T P T M−1 P ⊙ K + P T (M − I)+ P ⊙ Y α 2 αi + i
which has the form
1 αi L = − α T Rα + 2 i
with the matrix R defined accordingly. Finally, the dual problem is 1 αi (13) max − α T Rα + α 2 i
subject to
0 ≤ αi ≤ C αi yi = 0
which is the same quadratic minimization problem with restrictions solved in SVM (with a different matrix R). In consequence, any technique employed to solve the conventional SVM problem can be used here, as, for example, SMO [43]. A PPENDIX B C OMPLEXITY E VALUATION As follows from (13), the complexity of the whole problem is given by the computation of the matrix R and the solution of the optimization problem. As we mentioned before, this last step is equivalent to a conventional SVM optimization problem, which is O(n 2 ). The computation of R involves the inversion of M and the computation of the pseudoinverse of (M − I). The general solutions of these problems are costly but in our case, given that we consider only interactions to first time neighbors, both problems can be solved analytically. After this, the computation of P T M−1 P and P T (M − I)+ P is trivial, given that (P T M−1 P)i j = Mµ−1 iµj (P T (M − I)+ P)i j = (M − I)+ µi µ j . Hence, the computation of each element of the Hadamard product is O(1), which means that the computation of R is O(n 2 ), i.e., no greater than the optimization step. R EFERENCES [1] J. C. Schlimmer and R. H. Granger, “Beyond incremental processing: Tracking concept drift,” in Proc. 5th Nat. Conf. Artif. Intell., Irvine, CA, 1986, pp. 502–507. [2] R. Klinkenberg and T. Joachims, “Detecting concept drift with support vector machines,” in Proc. 17th Int. Conf. Mach. Learn., San Mateo, CA, 2000, pp. 487–494. [3] R. Vicente, O. Kinouchi, and N. Caticha, “Statistical mechanics of online learning of drifting concepts: A variational approach,” Mach. Learn., vol. 32, no. 2, pp. 179–201, Aug. 1998.
[4] P. L. Bartlett, S. Ben-Dabid, and S. R. Kulkarni, “Learning changing concepts by exploiting the structure of change,” Mach. Learn., vol. 41, no. 2, pp. 153–174, Nov. 2000. [5] K. Stanley, “Learning concept drift with a committee of decision trees,” Dept. Comput. Sci., Univ. Texas, Austin, Tech. Rep. UTAI-TR-03-302, 2003. [6] A. Tsymbal, “The problem of concept drift: Definitions and related work,” Dept. Comput. Sci., Trinity College Dublin, Dublin, Ireland, Tech. Rep. TCD-CS-2004-15, Apr. 2004. [7] T. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, “Experience with a learning personal assistant,” Commun. ACM, vol. 37, no. 7, pp. 81–91, 1994. [8] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, Apr. 1996. [9] M. Salganicoff, “Tolerating concept and sampling shift in lazy learning using prediction error context switching,” Artif. Intell. Rev., vol. 11, nos. 1–5, pp. 133–155, Feb. 1997. [10] W. N. Street and Y. Kim, “A streaming ensemble algorithm (SEA) for large-scale classification,” in Proc. 7th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining. San Francisco, CA, 2001, pp. 377–382. [11] S. H. Bach and M. A. Maloof, “Paired learners for concept drift,” in Proc. IEEE Int. Conf. Data Mining, Los Alamitos, CA, 2008, pp. 23–32. [12] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers—part I: Detecting nonstationary changes,” IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1145–1153, Jul. 2008. [13] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers—part II: Designing the classifier,” IEEE Trans. Neural Netw., vol. 19, no. 12, pp. 2053–2064, Dec. 2008. [14] D. P. Helmbold and P. M. Long, “Tracking drifting concepts by minimizing disagreements,” Mach. Learn., vol. 14, no. 1, pp. 27–45, Jan. 1994. [15] P. L. Bartlett, “Learning with a slowly changing distribution,” in Proc. 5th Annu. Workshop Comput. Learn. Theory, Pittsburgh, PA, 1992, pp. 243–252. [16] R. D. Barve and P. M. Long, “On the complexity of learning from drifting distributions,” in Proc. 9th Annu. Workshop Comput. Learn. Theory, San Mateo, CA, 1996, pp. 170–193. [17] Y. Freund and Y. Mansour, “Learning under persistent drift,” in Proc. 3rd Eur. Conf. Comput. Learn. Theory, London, U.K., 1997, pp. 109–118. [18] C. Alippi, G. Boracchi, and M. Roveri, “Just in time classifiers: Managing the slow drift case,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, 2009, pp. 114–120. [19] R. Klinkenberg and I. Renz, “Adaptive information filtering: Learning in the presence of concept drifts,” in Workshop Notes of the ICML/AAAI Workshop Learning for Text Categorization. Menlo Park, CA: AAAI Press, 1998, pp. 33–40. [20] C. Lanquillon, “Enhancing test classification to improve information filtering,” Ph.D. thesis, Faculty Comp. Sci., Univ. Magdeburg, Magdeburg, Germany, 2001. [21] K. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to kernel-based learning algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 181–201, Mar. 2001. [22] V. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [23] G. L. Grinblat, P. M. Granitto, and H. A. Ceccatto, “Time-adaptive support vector machines,” Inteligencia Artif., vol. 12, no. 40, pp. 39–50, 2008. [24] M. Karnick, M. Ahiskali, M. D. Muhlbaier, and R. Polikar, “Learning concept drift in nonstationary environments using an ensemble of classifiers based approach,” in Proc. IEEE Int. Joint Conf. Neural Netw., Hong Kong, China, Jun. 2008, pp. 3455–3462. [25] J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An ensemble method for drifting concepts,” J. Mach. Learn. Res., vol. 8, pp. 2755–2790, Dec. 2007. [26] J. C. Schlimmer and R. H. Granger, Jr., “Incremental learning from noisy data,” Mach. Learn., vol. 1, no. 3, pp. 317–354, 1986. [27] G. Castillo, J. Gama, and P. Medas, “Adaptation to drifting concepts,” in Proc. Progress Artif. Intell., 11th Portuguese Conf. Artif1. Intell. (EPIA), LNCS 2902. Beja, Portugal, 2003, pp. 279–293. [28] R. Klinkenberg, “Learning drifting concepts: Example selection versus example weighting,” Intell. Data Anal., vol. 8, no. 3, pp. 281–300, Aug. 2004. [29] T. Joachims, “Estimating the generalization performance of a SVM efficiently,” in Proc. 17th Int. Conf. Mach. Learn., San Francisco, CA, 2000, pp. 431–438.
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
[30] I. Koychev and R. Lothian, “Tracking drifting concepts by time window optimization,” in Proc. 25th SGAI Int. Conf. Innov. Tech. Appl. Artif. Intell., New York, 2005, pp. 46–59. [31] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers in nonstationary conditions,” in Proc. Int. Joint Conf. Neural Netw., Orlando, FL, 2007, pp. 1014–1019. [32] I. Koychev, “Tracking changing user interests through prior-learning of context,” in Adaptive Hypermedia, LNCS 2347. New York: SpringerVerlag, 2002, pp. 223–232. [33] M. Maloof and R. Michalski, “Selecting examples for partial memory learning,” Mach. Learn., vol. 41, no. 1, pp. 27–52, Oct. 2000. [34] I. Koychev, “Gradual forgetting for adaptation to concept drift,” in Proc. ECAI Workshop Current Issues Spatio-Temporal Reason., Berlin, Germany, 2000, pp. 101–106. [35] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” in Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Washington D.C., 2003, pp. 226– 235. [36] J. Gao, B. Ding, J. Han, W. Fan, and P. Yu, “Classifying data streams with skewed class distributions and concept drifts,” IEEE Internet Comput., vol. 12, no. 6, pp. 37–49, Nov.–Dec. 2008. [37] S. Hashemi, Y. Yang, Z. Mirzamomen, and M. Kangavari, “Adapted one-versus-all decision trees for data stream classification,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 624–637, May 2009. [38] R. Elwell and R. Polikar, “Incremental learning in nonstationary environments with controlled forgetting,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, 2009, pp. 771–778. [39] M. D. Muhlbaier, A. Topalis, and R. Polikar, “Learn ++.NC: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes,” IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 152–168, Jan. 2009. [40] Z. Kolter and M. Maloof, “Dynamic weighted majority: A new ensemble method for tracking concept drift,” in Proc. 3rd IEEE Int. Conf. Data Mining, Nov. 2003, pp. 123–130. [41] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,” Inform. Comput., vol. 108, no. 2, pp. 212–261, Feb. 1994. [42] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: Cambridge Univ. Press, 2000. [43] J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods-Support Vector Learning. Cambridge, MA: MIT Press, 2000, pp. 185–208. [44] N. Littlestone, “Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,” Mach. Learn., vol. 2, no. 4, pp. 285– 318, Apr. 1987. [45] S. Ferrari and M. Jensenius, “A constrained optimization approach to preserving prior knowledge during incremental training,” IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 996–1009, Jun. 2008. [46] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-agressive algorithms,” J. Mach. Learn. Res., vol. 7, pp. 551–585, Dec. 2006. [47] J. Kivinen and M. Warmuth, “Exponentiated gradient versus gradient descent for linear predictors,” Inform. Comput., vol. 132, no. 1, pp. 1– 63, Jan. 1997. [48] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176, Aug. 2004. [49] M. Herbster and M. K. Warmuth, “Tracking the best linear predictor,” J. Mach. Learn. Res., vol. 1, pp. 281–309, Sep. 2001. [50] R. Caruana, “Multi-task learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, Jul. 1997. [51] S. Thrun and L. Pratt, Learning to Learn. Norwell, MA: Kluwer, 1997. [52] T. Evgeniou, C. M. Micchelli, and M. Pontil, “Learning multiple tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, pp. 615–637, Dec. 2005. [53] W. Fan, “Systematic data selection to mine concept-drifting data streams,” in Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Seattle, WA, 2004, pp. 128–137. [54] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen, “Dynamic integration of classifiers for handling concept drift,” Inform. Fusion, vol. 9, no. 1, pp. 56–68, Jan. 2008. [55] M. Harries, “Splice-2 comparative evaluation: Electricity pricing,” School Comput. Sci. & Eng., Univ. New South Wales, Sydney, Australia, Tech. Rep. NSW-CSE-TR-9905, 1999.
51
Guillermo L. Grinblat was born in Miramar, Buenos Aires, Argentina, in 1976. He received the Licenciate degree in computer sciences from the National University of Rosario, Rosario, Argentina, in 2006. He currently holds a Fellowship at the French Argentine International Center for Information and Systems Sciences, Rosario. He is also a Teaching Assistant at the National University of Rosario. His current research interests include drifting problems, kernel methods, and deep architectures.
Lucas C. Uzal was born in Pergamino, Argentina, in 1982. He received the Licentiate and M.Sc. degrees in physics from Balseiro Institute, San Carlos de Bariloche, Argentina, in 2005 and 2006, respectively. He has been with the French Argentine International Center for Information and Systems Sciences, Rosario, Argentina, since 2007, on a research grant from Consejo Nacional de Investigaciones Cientficas y Tecnolgicas, Rosario. His current research interests include complex systems and time series analysis.
H. Alejandro Ceccatto was born in Argentina in 1953. He received the M.Sc. degree in physics from the Universidad Nacional de Rosario, Rosario, Argentina, in 1979, and the Ph.D. degree in physics from the Universidad Nacional de La Plata, La Plata, Argentina, in 1985. He was a Post-Doctoral Fellow at the Department of Applied Physics, Stanford University, Stanford, CA, in 1988, and at the Institut fuer Theoretische Physik, Universitaet zu Koeln, Cologne, Germany, in 1989. Since 1995, he has been the Director of the Intelligent Systems Group, Instituto de Fisica Rosario, Rosario. He is currently a full Professor at the Universidad Nacional de Rosario, and Director of the French Argentine International Center for Information and Systems Sciences, Rosario. He has supervised 20 M.Sc. and 11 Ph.D. theses.
Pablo M. Granitto was born in Rosario, Argentina, in 1970. He received the Degree in physics, and the Ph.D. degree, also in physics, in 1997 and 2003, respectively, both from Universidad Nacional de Rosario (UNR), Rosario, Argentina. He was a Post-Doctoral Researcher at Istituto Agrario San Michele all’Addige, Trento, Italy. Since 2006, he has been a full-time Researcher at Consejo Nacional de Investigaciones Cientficas y Tecnolgicas, Rosario, and UNR. He leads the Machine Learning Group at the French Argentine International Center for Information and Systems Sciences, Rosario. His current research interests include application of modern machine learning techniques to agroindustrial and biological problems, involving feature selection, clustering, and ensemble methods.
52
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Optimum Spatio-Spectral Filtering Network for Brain–Computer Interface Haihong Zhang, Member, IEEE, Zheng Yang Chin, Member, IEEE, Kai Keng Ang, Member, IEEE, Cuntai Guan, Senior Member, IEEE, and Chuanchu Wang, Member, IEEE
Abstract— This paper proposes a feature extraction method for motor imagery brain–computer interface (BCI) using electroencephalogram. We consider the primary neurophysiologic phenomenon of motor imagery, termed event-related desynchronization, and formulate the learning task for feature extraction as maximizing the mutual information between the spatio-spectral filtering parameters and the class labels. After introducing a nonparametric estimate of mutual information, a gradient-based learning algorithm is devised to efficiently optimize the spatial filters in conjunction with a band-pass filter. The proposed method is compared with two existing methods on real data: a BCI Competition IV dataset as well as our data collected from seven human subjects. The results indicate the superior performance of the method for motor imagery classification, as it produced higher classification accuracy with statistical significance (≥95% confidence level) in most cases. Index Terms— Brain–computer interface, motor imagery electroencephalography, spatio-spectral filtering.
I. I NTRODUCTION HE necessity of developing high-performance brain– computer interface (BCI) is rapidly growing alongside advances in neural devices and demands from rehabilitation, assistive technology, and beyond [1], [2]. Among the various useful signals for electroencephalogram (EEG) based BCI [3], motor imagery [4] is probably the most common one. It refers to the imagination or mental rehearsal of a motor action without any real motor output. The primary phenomenon of motor imagery electroencephalography (EEG) is event-related desynchronization (ERD) [4], [5], which is the attenuation of the rhythmic activity over the sensorimotor cortex in the µ (8–14 Hz) and β (14–30 Hz) rhythms. ERD can be induced by both imagined movements in healthy people or intended movements in paralyzed patients [6]. Previous studies have demonstrated that, based on ERD analysis, it is feasible to classify imagined movements of left hand, right hand, feet, and tongue [4], [7], [8]. A complementary phenomenon called Bereitschafts
T
Manuscript received April 3, 2010; revised August 1, 2010 and September 2, 2010; accepted September 28, 2010. Date of publication November 9, 2010; date of current version January 4, 2011. This work was supported by the Science and Engineering Research Council of the Agency for Science, Technology and Research, Singapore. The authors are with the Institute for Infocomm Research, Agency for Science, Technology and Research, 138632, Singapore (e-mail:
[email protected];
[email protected];
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2084099
potential is a nonoscillatory characteristic of motor imagery EEG, and can be also used for BCI [9]. This paper will focus on the ERD. For decoding different motor imaginations from EEG, the essential task is to distinguish the respective ERD signals. Neurologically, the spatial pattern of the ERD provides a clue. For instance, movements of left hand/right hand are associated with activities in the contralateral (right/left) motor cortex areas [4]. However, localization of the ERD sources is impeded by the EEG’s poor spatial specificity caused by volume conduction and coherency [10], [11]. Furthermore, the ERD is sensitive to artifacts cased by muscle activities or by visual cortex activities, since their frequency ranges highly overlap while the ERD signal is rather weak [12]. Besides, both the spatial pattern and the particular rhythm vary among people, requiring subject-specific learning [5]. Therefore, from a signal processing point of view, it is important to design a feature extraction mechanism that can learn to capture effective spatial and spectral features associated with the ERD, for each particular person. As a recent survey [13] indicates, considerable efforts have been devoted to this topic by the signal processing, machine learning, and artificial neural networks communities. Particularly, spatial filtering techniques are widely used to extract discriminative spatial features of ERD in multichannel EEG. Techniques such as independent component analysis [14] and beam-forming [15] were introduced, while the most commonly used technique thus far is the common spatial pattern (CSP) [4], [16], [17]. As [18] shows, CSP can yield significantly higher accuracy in motor imagery classification than various independent component analysis methods. CSP consists of a linear projection of time samples of multichannel EEG onto a few vectors that correspond to individual spatial filters. Mathematically, the projection matrix is constructed by maximizing the separability, in terms of the Rayleigh coefficient [17], between motor imagery EEG classes. The coefficient is determined by the intraclass covariance matrices of EEG time samples, while its maximization can be readily solved by generalized eigenvalue decomposition. Usually, CSP works together with a subject-specific bandpass filter to select the particular rhythm of the ERD. To learn the band-pass filter and the spatial filters in a unified framework, several extensions of CSP have been devised. In [19], the authors embedded a first-order finite impulse
1045–9227/$26.00 © 2010 IEEE
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
response filter into CSP. In view of the limited capability of first-order filters to choose frequency bands, a higher order finite impulse response (FIR) filter was proposed in [20], while a sophisticated regularization method was necessary to make the solution robust. More recently, Wu et al. [21] proposed an iterative learning method, in which an FIR filter and a classifier were simultaneously parameterized and optimized in the spectral domain, alternately with optimization of spatial filters using CSP in the spectral domain. More recently, another method called filter bank common spatial pattern (FBCSP) [22] introduced a feature selection algorithm to combine a filter bank framework with CSP. It decomposed EEG data into an array of passbands, performed CSP in each band, and selected a reduced set of features from all the bands. An offline study [23] suggested its higher performance over the above-mentioned iterative learning method. Furthermore, its efficacy was demonstrated in the latest BCI Competition [24], where it served as the basis of all the winning algorithms in the EEG categories. FBCSP was further improved in [25] by employing a robust maximum mutual information criterion for feature selection. (Another method [8] used the maximum mutual information principle but in a different formulation to select spatial components from independent component analysis). However, learning optimum spatio-spectral filters is still an open issue. Extensions of CSP often inherit its limitation in exploring spatial patterns. Specifically, as shown in the Appendix and in [26, Sec. 10.2], CSP is equivalent to minimizing a classification error bound for two unimodal multivariate Gaussian distributions only. As [13, p. R43] puts it, it can also be sensitive to artifacts in the training data, as a single trial contaminated with artifacts can unfortunately cause extreme changes to the filters. In this paper, we present an information-theoretic approach to learning the spatio-spectral filters. Particularly, the approach constructs an optimum spatio-spectral filtering network (OSSFN) that optimizes the filters by maximizing the mutual information between the feature vectors and the corresponding class labels. As mentioned earlier, the maximum mutual information criterion was employed in [25] for feature selection, where numerical optimization of spatial filters was not considered. By contrast, this paper addresses the more challenging and interesting issue of feature extraction, which involves numerical optimization of spatial filters together with selection of a band-pass filter. Therefore, one of the major contributions of this paper is the introduction of a nonparametric mutual information estimate to formulate the objective for spatio-spectral feature extraction. Importantly, based on this new formulation, we devise a gradient-based method for optimization of spatial filters jointly with a band-pass filter. We conduct an experimental study to assess the proposed method while comparing with existing methods including CSP and FBCSP. The study collects motor imagery data from seven human subjects in our lab. The publicly available BCI Competition IV Dataset I is also used. The study performs randomized cross-validation to assess the classification accuracy with a linear support vector machine, and runs t-test
53
TABLE I L IST OF S YMBOLS Symbol z(t) x(t) y(t) W wl a; A ω; p; P H I (A, ) na nω
Description A block of raw n c -channel EEG signal; t ∈ [0 L] Signal after spectral filtering using a bandpass filter h Signal after spatial filtering ∈ R(n c ×nl ) , spatial filtering matrix with spatial filter vectors in columns ∈ R(n c ×1) , the lth spatial filter vector in W A particular feature vector ∈ R(nl ×1) for z(t); feature vector variable symbol A particular class label; class label variable symbol Probability density function and probability function of a random variable Entropy of a random variable Mutual information between A and Number of samples (z(t)) in training data Number of class-ω samples in training data
to verify the statistical significance of the results between different methods. The rest of this paper is organized as follows. Section II describes the proposed method, and formulates the maximum mutual information based learning problem. Section III derives a numerical solution. Section IV describes the experimental study and the results, followed by discussions in Section V. Section VI finally concludes this paper. II. OSSFN For the convenience of readers, Table I describes a list of essential mathematical symbols. The architecture of the proposed filtering network OSSFN is illustrated in Fig. 1. It learns and performs consecutive bandpass filtering, spatial filtering, and log power integral to extract discriminative features for motor imagery classification. The input of the network is a time window of n c -channel EEG waveforms z(t) (without loss of generality, we assume t ∈ [0 L] in the time window), and the output is a feature vector a that represents the mean power of spatio-spectral components of z(t). The procedure of transforming the EEG block of z(t) into the feature vector a comprises the following steps. 1) Spectral filtering: A band-pass filter that extracts a specific rhythmic activity of the ERD, it produces the band-pass-filtered signal x. 2) Spatial filtering: A linear projection of z that transforms x into a lower dimensional signal y y(t) = WT x(t).
(1)
Here, the superscript T denotes the transpose operator. Each column in the transformation matrix W ∈ R(nc ×nl ) determines one of the nl spatial filters. Therefore, each element in y describes the activity of a particular spatial component. 3) Log power integral: A process that computes ERD features as the mean power of y in the time window L 2 1 y(t) dt . (2) a = log L 0 Each of the element in a represents the mean band power of a particular spatial component in W.
54
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Training Program
Optimization Alogrithm
Mutual Information Estimate
Log Power Integration
Spatial Filtering
Unseen (test) Data
Class labels
Evaluation
(.)2
∫
ln
(.)2
∫
ln
(.)2
∫
ln
Features
Output Feature : a(t)
Spectral Filtering
Class labels
Input EEG: x(t)
EEG Trial
Training Data
Classifier
Test Program Results Fig. 1. Diagram of the proposed network for extracting motor imagery EEG features. A motor imagery EEG block, in the form of time-windowed multivariate waveforms x(t), is processed firstly by a spectral (band-pass) filter to pick up subject-specific responsive rhythm activity, and subsequently by a linear transformation (in the form of spatial filters) and log power integration. The output feature vector describes the mean power of particular spatio-spectral components associated with motor imagery. The network takes a maximum mutual information approach to optimizing the spectral filter and the spatial filters.
The logarithm operation has been widely used since the introduction of CSP in [16], which describes its purpose as “to approximate normal distribution of the data.” We would like to note that another positive effect of the logarithm operation is the reduced dynamic range, which facilitates the subsequent processing, e.g., by a classifier. In addition, extreme feature values (suspects of artifacts) in some EEG blocks can be largely reduced before the corrupted information (such as intraclass variance) is fed into the learning machine. Our BCI experience suggests that the logarithm operation can improve classification accuracy.
where H () (or H (A)) is the entropy of the class label (or the feature vector). ω is a particular class label (e.g., ω = 1 or ω = 2 represents left- or right-hand motor imagination). H (A|) is the conditional entropy of the obtained feature vector for a particular class. H (|A) is then the conditional entropy of the class label given the obtained feature vector. Now we define the objective function for learning. Since the feature vector a is determined by the band-pass filter h and the spatial filters W, the objective is to maximize I (A, ) with respect to h and W
This paper introduces mutual information [27] to formulate the objective function for the learning machine. Consider the mutual information between the feature vector variable A and the class label variable
Let us discuss the relevance of mutual information to objective function for discriminative learning. The mutual information I (A, ) is the reduction of uncertainty by the feature vector [27], the entropy H () is the uncertainty about class label, while after observing the feature vector, the uncertainty reduces to the conditional entropy H (|A). An earlier paper [28] has connected the maximum mutual information criterion to minimum Bayes error via lower and upper bounds. A recent paper [29] further studied the
I (A, )
=
H () − H (|A)
=
H (A) − H (A|) H (A|ω)P(ω) H (A) −
=
ω∈
(3)
{h opt , Wopt } = max I (A, ). {h,W}
(4)
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
relationship between maximum mutual information and other criteria for feature extraction, though in the context of linear feature extraction rather than in the present nonlinear context (see the processing steps above). Importantly, that paper concludes that maximum mutual information is Bayesian optimum under more general conditions than others. Coincidently, recent years have seen attempts [30], [31] to address linear feature extraction problems through using the maximum mutual information principle. III. L EARNING A LGORITHM The technical challenge to achieve the objective in (4) primarily lies in the fact that the objective function (mutual information) is a function of probability density functionals and cannot be expressed in explicit form generally. To address this problem, we propose a learning method below that first introduces a mutual information estimation method and then derives a gradient-based optimization algorithm. A. Mutual Information Estimate Since the mutual information in (3) is dependent on the entropies, we approximate it by first estimating the entropies. The entropy of the feature vector variable and the conditional entropy are, respectively, given by (5) H (A) = − p(a)log ( p(a)) da a
and
H (A|ω) = −
p(a|ω)log [ p(a|ω)] da.
(6)
a
The entropy A can be viewed as an expectation of the function log( p(a)) [32, Sec. 5]. Suppose a set of n a empirical samples of feature vector a is available: ai , i = 1, . . . , n a . The entropy can be estimated by H (A) = ∼ =
−E[log( p(a))] na 1 log( p(ai )). − na
(7)
where r denotes the term a − ai , ψ usually takes a diagonal matrix form called the bandwidth matrix. The diagonal elements in the bandwidth matrix determine the smoothness of the kernel. We choose the following bandwidth for the kernel: n
ψk,k
a ζ = (aik − a¯k )2 na − 1
(11)
i=1
where a¯k is the empirical mean of {aik }, i.e., the kth element in the feature vector samples. We use the normal optimal smoothing strategy [34] to set the coefficient, i.e., ζ = (4/3n a )0.1 . By introducing (9) into (7), the entropy H (A) is approximated by ⎧ ⎫ na na ⎨1 ⎬ 1 H (A) ∼ log ϕ(ai − a j ) . (12) = Hˆ (A) = − ⎩ na ⎭ na j =1
i=1
The conditional intraclass entropy Hˆ (A|ω) is estimated similarly. We replace the entropies in (3) by the estimates Hˆ (A) and Hˆ (A|ω). This results in a sample-based estimate of mutual information. The full expression of the estimate is omitted since it is straightforward from the above. B. Subspace Gradient Descent Learning
In this section, we derive a numerical solution to maximizing the mutual information estimate with respect to spatial filters in W in conjunction with a band-pass filter. For simultaneous optimization of all the spatial filter vectors in W, we consider a joint vector by concatenating all the spatial filters T w ˆ = w1T . . . wlT . . . wnTl . (13) As described earlier, the mutual information I (A, ) is estimated from all the feature vector samples {ai }. Since each of the samples in turn is a function of w, ˆ we have n
a ∂ I (A, ) ∂ai ∂ I (A, ) = . ∂w ˆ ∂ai ∂w ˆ
i=1
Similarly
55
(14)
i=1
H (A|ω)
= ∼ =
−E[log( p(a|ω))] 1 − log( p(ai )). n ω a ∈ω
(8)
The partial derivative ∂ I (A, )/∂ai can be computed by differentiating (3) to give ∂ H (A) ∂ H (A|ω) ∂ I (A, ) = − P(ω) ∂ai ∂ai ∂ai
i
The underlying probability density function can also be estimated from the samples using kernel density estimation [33] na 1 ϕ(a − ai ). (9) p(a) ˆ = na
where ω is the class label of the sample ai . To compute ∂ I /∂ai , the partial derivatives ∂ H (A)/∂ai and ∂ H (A|ω)/∂ai are required. To compute ∂ H (A)/∂ai , differentiate (12) with respect to ai , which gives
i=1
Using Gaussian for the kernel function ϕ, the Gaussian kernel density estimation is well known for its capability for general data analysis [33], [34]. A multivariate Gaussian function is given by nl
1
ϕ(r) = (2π)− 2 |ψ|− 2 e
− 21 r T ψ −1 r
(10)
(15)
na na ∂ϕ[a j − ak ] 1 ∂ H (A) 1 βj =− ∂ai na na ∂ai j =1
k=1
where βj =
na 1 ϕ[a j − ak ] na k=1
(16)
−1
(17)
56
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
and ∂ϕ(a j − ak ) = ∂ai ⎧ −1 ⎪ ⎨−ϕ(ai − ak )ψ (ai − ak ), if i = j −ϕ(ai − a j )ψ −1 (ai − a j ), if i = k ⎪ ⎩ 0, otherwise
where bl is a coefficient vector that determines wl bl = b1 , b2 , . . . , bnu .
(18)
the computation of the partial derivative ∂ H (A|ω)/∂ai is performed similarly. ˆ we first consider To compute the partial derivative ∂ai /∂ w, a particular element, say, the lth element ail in ai . From (13), the partial derivative of this element with respect to wl is 2 L ∂log L1 0 wlT xi (t) dt ∂ail = ∂wl ∂wl 1 L T T (t)dt w ∂ w x (t)x i l l i L 0 1 = · a e il ∂wl 2 = wT Rxi (19) Leail l where xi (t) denotes the EEG sequence in the i th trial, and L (20) xi (t)xiT (t)dt. Rxi = 0
Since ai j is dependent on wl only
∂ai j = 0 if j = l (21) ∂wl the partial derivative of a with respect to w ˆ is thus ∂ai = ∂w ˆ ⎡ ⎤ 2 T 0 ··· 0 Leai1 w1 R xi ⎢ ⎥ 2 T 0 0 ⎢ ⎥ Leai2 w2 R xi ⎢ ⎥. .. .. .. .. ⎢ ⎥ . . . . ⎣ ⎦ 2 T 0 0 ··· ain wn R xi l l Le
(22)
Now we can compute the gradient by introducing the above equation to (14). However, a practical issue arises for multichannel EEG and multiple spatial filters. Consider an example in which EEG has n c = 59 channels and W contains nl = 2 filters. The number of free parameters would be 2 × 59 = 118. Gradientbased optimization in this high-dimensional space would be difficult. To address this issue, we propose a subspace optimization approach in below. Consider a n u -dimensional (n u ≪ n c ) subspace U , linearly spanned by the n c -dimensional column vectors as bases in a matrix U ∈ R(nc ×nu ) (23) U = u1 , u2 , . . . , uk , . . . , unu
where uk denotes the kth basis vector. A spatial filter vector wl in the subspace can be expressed by nu blk uk = Ubl (24) wl = k=1
(25)
Hence, bl is the low-dimensional representation of the spatial filter wl . In the subspace U, simultaneous optimization of the spatial filters is equivalent to simultaneous optimization of the concatenated coefficient vectors T (26) bˆ = b1T b2T . . . bnTl .
Now consider the partial derivatives of I (A, ) with respect to bˆ na ∂ I ∂a ∂ I (A, ) = . (27) ˆ ˆ ∂a i ∂b ∂b i=1
Substitution of (24) into (2) gives L 2 1 ail = log (Ubl )T xi (t) dt . L 0
(28)
Similar to (19), differentiating (28) gives 2 L ∂log L1 0 (Ubl )T xi (t) dt ∂ail = ∂bl ∂bl T 1 L T dt (Ub ) ∂ x (t)x (t) (Ub ) l i i l L 0 1 · = eakl ∂bkl 2 = (29) (Ubl )T Rxi U. Leakl Therefore ⎤ ⎡ 2(Ub1 )T R U · · · 0 a xi i1 ⎥ ⎢ Le 2(Ub2 )T ⎥ ⎢ 0 0 ∂ai Leai2 R xi U ⎥ ⎢ =⎢ ⎥. (30) .. . .. .. ⎥ ⎢ . ∂ bˆ . ⎦ ⎣ T 2 Ubnl Rxi U 0 ··· Leain u
Now, introducing the above equation to ∂ I (A, )/∂ bˆ (expressed in a similar form as (14) by substituting bˆ for w), ˆ we can compute the gradient of the mutual information estimate with respect to bˆ of low dimensionality. This effectively reduces the number of free parameters for learning. In the earlier example, the number of free parameters will reduce from 118 to 8 in a n u = 4-D subspace. How to optimally construct the subspace U is, however, beyond the scope of this paper. Tentatively, we simply use spatial filters by CSP (band-pass filter selected by FBCSP) as the subspace bases. We would like to stress that the proposed optimization procedure, as a general approach, is neither tailored nor dedicated to the CSP or FBCSP subspace. We expect that more effective subspace construction methods will be devised. As mentioned earlier, the subject-specific sensorimotor rhythm of the ERD must be selected for effective extraction of spatial patterns associated with the ERD. To this end, we need to maximize the mutual information estimate with respect to spatial filters in conjunction with a band-pass filter. Inspired by previous works [22], [35] that choose the optimum band-pass
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
filter from an array of filter banks, we propose a joint spatiospectral filter learning algorithm below (Fig. 2) in a filter bank framework. Briefly, the algorithm first decomposes the EEG data into an array of frequency bands that cover the range of possible ERD rhythm, performs spatial filter optimization in each band, and then selects the band with maximum mutual information estimate. IV. E XPERIMENTS AND R ESULTS This section reports an offline analysis of the proposed method for extracting the ERD features. A. Materials: Motor Imagery EEG Datasets 1) BCI Competition IV Dataset I: The dataset [24] consists of both human and artificially generated motor imagery data. We consider human EEG data only, which were collected from four healthy subjects using the EEG amplifier of BrainAmp MR plus with 59 channels sampled at 1000 Hz. Each subject participated in two data collection sessions with different protocols as described below. In the calibration session, a visual cue was displayed on a computer screen to the subjects who then started to perform motor imagery tasks according to the cue. The cue represented specific motor imagery tasks: each subject chose two classes of motor imagery tasks from left hand, right hand, or foot. Specifically, subject “a” chose {left, foot}, “b” chose {left, right}, “f” chose {left, foot}, “g” chose {left, right}. Each subject performed a total of 200 motor imagery tasks (balanced between the two tasks) each in the [0 4]-s window after the cue. Consecutive motor imagery tasks were interleaved with a 4-s break. In the evaluation session, the subjects followed the soft voice commands from an instructor to perform motor imagery tasks of varying length between 1.5 and 8 s. Consecutive tasks were also interleaved with a varying length interval from 1.5 to 8 s. This session was meant for offline validation of motor imagery classification algorithms for self-paced BCI (see [36]). Our study uses the down-sampled data (provided by the organizer) at 10-Hz sampling rate, with all the 59 channels employed for spatio-spectral feature extraction. The 59 channels are AF3, AF4, F5, F3, F1, Fz, F2, F4, F6, FC5, FC3, FC1, FCz, FC2, FC4, FC6, CFC7, CFC5, CFC3, CFC1, CFC2, CFC4, CFC6, CFC8, T7, C5, C3, C1, Cz, C2, C4, C6, T8, CCP7, CCP5, CCP3, CCP1, CCP2, CCP4, CCP6, CCP8, CP5, CP3, CP1, CPz, CP2, CP4, CP6, P5, P3, P1, Pz, P2, P4, P6, PO1, PO2, O1, and O2. 2) Our Motor Imagery Data Set: The data were recorded in our laboratory from seven healthy male subjects. Each subject performed 160 tasks of motor imagery (including 80 left-hand and 80 right-hand tasks). Similar to the calibration session of the BCI Competition dataset, the data collection procedure used visual cues to prompt the subjects to perform motor imagery tasks for 4 s each.
57
Consecutive motor imagery tasks were interleaved with a 6-s break. The EEG data were recorded using a NuAmps amplifier with 25 channels sampled at 250 Hz. The 25 channels, including F7, F3, Fz, F4, F8, FT7, FC3, FCz, FC4, FT8, T7, C3, Cz, C4, T8, TP7, CP3, CPz, CP4, TP8, P7, P3, Pz, P4, and P8, cover the full scalp. The data collection and study was approved by the National University of Singapore Institutional Review Board with reference code 08-036. Given the considerable difference in data collection setup in terms of EEG amplifiers and motor imagery task protocols, effective unification of the two datasets is difficult. Instead, this paper validates the proposed method on the two datasets separately. Furthermore, this allows validation of the proposed method in two different conditions (or effectively three conditions, since the calibration session and the evaluation session in the BCI Competition data were of different protocols), which is an important consideration for studying generalization performance. B. Selection of Hyperparameters The following describes how we set the hyperparameters for feature extraction and classification. First, selection of a time interval in the motor imagery tasks is almost a common practice in learning motor imagery EEG. This paper selects the time interval [1 4] s after the cue. The first 1-s period after the cue is excluded since it contains the spontaneous responses (evoked potentials) to the cue stimulus [37, Sec. V]. As the BCI Competition evaluation set has varying duration of motor imagery tasks, we consider the same time interval and remove those motor imagery tasks of less than 4 s long. Consequently, the number of remaining motor imagery tasks in the evaluation data ranges from 111 to 126 in the four subjects. Second, the filter banks (an array of band-pass filters) are constructed to continuously cover a wide frequency range. Specifically, a total of eight Chebyshev Type II filters (though other type of filters can also be used instead) are built with center frequencies spanning from 8 to 32 Hz at a constant interval in the logarithm domain. Consequently, the center frequencies are respectively 8, 9.75, 11.89, 14.49, 17.67, 21.53, 26.25, and 32 Hz. All the filters all have a uniform Q-factor (bandwidth-to-center frequency) of 0.33 as well as an order of 4. The filter banks process each of the EEG blocks separately after they are extracted from the selected time interval mentioned above. The number of spatial filters to be constructed is also an important hyperparameter. This paper considers the learning of two spatial filters only, corresponding to a transformation matrix W (1) of two column vectors. Consequently, the feature vector is a bivariate. C. Mutual Information Surface and Selected Spatial Patterns Here we use the calibration data from the BCI Competition dataset to investigate the surface of the mutual information versus spatial filters. We visualize the mutual information estimate in a low-dimensional space U (see Section III-B), in which each point defines a particular spatial filter. As more
58
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Input:Training EEG data that comprises N sample blocks of {z(t)}, each block has a specific class label; Output:The filtering network as depicted in Fig.1, with optimum parameters for spatial filters and the selection of the optimum band-pass filter; Step 1: Construct an array of ns band-pass filters that covers the EEG rhythms of motor imagery, then filter {z(t)} to yield {x m (t)} for m = 1, . . . , ns; Step 2: For each band-pass filter’s output {xm(t)}: 1) Construct a discriminative spatial filter subspace: a) Compute the empirical covariance matrices of the two classes: x0 and x1; b) Compute the eigenvectors and eigenvalues of x0−1x0 (refer to equation(37)); c) Select nu eigenvectors that correspond to the largest and smallest eigenvalues λ, sort the eigenvectors from large to small eigenvalues, use these eigenvectors as the bases U for the low dimensional subspace for parameterization of spatial filters; 2) Set the initial parameters of the spatial filters b01 = [1, 0, . . . , 0], b02 = [0,. . . , 0, 1], b03 = [0, 1, 0, . . . , 0], and so on. This setting effectively chooses the top and bottom spatial filters generated by CSP or FBCSP. 3) Set iteration count k = 0, repeat the following steps until convergence whereby the criterion is defined as the change of the mutual information estimate being smaller than a small threshold ζ: a) Compute the spatial filters W using bIter inequation(24): W = Ubˆ k where bˆ k = [bk1, bk2, . . . , bkn ], l b) Use W to update the feature vector according to equations (1) and (2); ∂I ˆ c) Compute the gradient b = ∂ bˆ using equation (27); d) Perform a linear search with a stepfactor s, alternately selected from the range [−1 1] with an interval of 0.01: i) Set bˆ bˆ (s) = bˆ k + s (31) 1 ˆ 2 b 2
where .2 represents the l−2 norm of the gradient vector. ii) Compute the mutual information estimate I (s) with the spatial filters defined by bˆ (s); e) Update the parameter vectors for spatial filters using the optimum update step sopt = argmaxs I(s) ˆ f) Update the mutual information Ik = I(sopt) and bk = ˆb(s ); opt g) Compute the change in mutual information by δ = Ik − Ik−1 (Ik−1 = I( s = 0)ifunassigned); if δ < ζ or the iteration count k is larger than a preset number, continue to next step; otherwise go back to step a); h) Set the optimum spatial filters for the frequency bandas Wm = U bˆ k , and set the corresponding mutual information I m = I k; Step 3:Select the optimum frequency band mopt by mopt = argminm Im, and finally set the spectral filter to be the mopt band-pass filter, and the spatial filter to be Wmopt. Fig. 2.
Learning algorithm for the spatio-spectral filtering network.
than 2-D surfaces would be difficult to visualize, we consider a 2-D subspace spanned by the first pair (i.e., the top and the bottom) of CSP filters from the frequency band selected by FBCSP [22]. Fig. 3 uses color image presentations to illustrate the result for each of the four subjects in the BCI Competition data. Each pixel represents a spatial filter, while the value of the corresponding mutual information estimate is denoted by the color. In three of the subjects, including “a,” “b,” and “g,” a peak mutual information estimate appears near the point [0 1], which represents the bottom spatial filter from FBCSP. However, in “f,” there is no such peak found near [0 1]. Instead, a peak is prominent near the point [1 0], which corresponds to the top filter from FBCSP. Hence, we use the top filter for the FBCSP mark in “f,” while using the bottom one in the others. The result suggests favorable conditions for the proposed method. First, the surface is smooth that facilitates gradientbased optimization. Second, target peaks on the mutual information surface often have an FBCSP filter in the vicinity, which validates the use of FBCSP spatial filters for optimization initialization.
Furthermore, Fig. 4 shows the top three spatial patterns (each with a particular frequency band) which together maximize the mutual information measure for each of the subjects in the competition dataset. The patterns are consistent with neurophysiological principles on motor imagery except for subject “b.” For example, the spatial patterns for subject “g” show that the two most discriminative patterns correspond to EEG sources that originate from the motor cortex of the right and left hemisphere. Furthermore, the frequency bands of the selected spatial patterns are mostly from the Beta rhythm except for the second spatial pattern of subject “a.” D. Classification Results This paper compares the proposed method in comparison with CSP and FBCSP, using five rounds of fivefold crossvalidation study. FBCSP shares the same band-pass filter array as described in Section IV-B. Literally, it is the proposed method before optimization. Thus, it selects only one frequency band and two spatial filters. CSP is implemented by following [16] and using a [8 30]-Hz Chebyshev Type II band-pass filter. The top and the bottom filters by CSP are selected.
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
0.1
Mutual Information 0.12
0.2
0.5
Mutual Information 0.24 0.28
0.7
0.6 0.7
0.8 Local Maximum
0.8
0.9
b2
0.9 1
1 1.1
FBCSP Filter
1.1
1.2 1.3
1.2
1.4 1.5 −0.5
0 b1
0.5
Subject ‘a’
1.3 −0.2
0 b1
0.2
Subject ‘b’
0.44
59
Mutual Information 0.46
Mutual Information 0.2 0.3
−0.5
0.3
−0.4
0.4
−0.3
0.5
−0.2
0.6
−0.1
0.7
0
0.8
0.1
0.9
0.2
1
0.3
1.1
0.4
1.2
0.5 0.5
1 b1 Subject ‘f’
1.5
1.3
0
0.5 b1
1
Subject ‘g’
Fig. 3. Surface of mutual information estimate over a bivariate b (see 24), i.e., the coefficient vector that defines a spatial filter. Notes: see Section IV-C for details; the four graphs are for each of the four subjects in the BCI Competition IV Dataset I. The axes b1 and b2 denote the first and the second elements of the coefficient vector b in (24) that defines a spatial filter. The value of mutual information estimate is indicated by the color according to the overhead color bar. See Section. III-B for the description of parameterization of spatial filters in the subspace of spatial filters. Here the subspace is spanned by two spatial filters by FBCSP. Therefore, e.g., the point [1 0] represents to the first spatial filter by FBCSP. The previous FBCSP filter and the local optimum filter are annotated, respectively, by a square and a circle in each graph.
foot [27.7 36.3]Hz
left [6.7 9.3]Hz
foot [6.7 9.3]Hz
left [14.8 20.6]Hz
left [6.7 9.3]Hz
right [27.7 36.3]Hz
left [18.0 25.1]Hz
right [18.0 25.1]Hz
Subject ‘a’
Subject ‘b’
Subject ‘f’
Subject ‘g’
Fig. 4. Spatial patterns of motor imagery EEG. In each column, the two spatial features selected according to maximal mutual information between classes, are plotted in the form of spatial patterns (see [16]). The positions of the electrodes are superimposed as black dots. The view of the spatial pattern is from the top whereby the nose is facing upward. The motor imagery class and the frequency band of the feature are also given above each plot.
60
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE II C LASSIFICATION A CCURACY (M EAN AND SD) IN BCI C OMPETITION IV D ATASET I Feature extraction method CSP FBCSP OSSFN
p-value OSSFN=CSP OSSFN=FBCSP
Sub
Train–Test
a
Calib. Calib.–Eval.
67.1(11.4) 65.9(4.7)
66.5(13.3) 79.2(7.2)
89.8(9.0) 93.6(3.4)
<0.01 <0.01
<0.01 <0.01
b
Calib. Calib.–Eval.
76.6(10.6) 66.2(8.3)
87.3(5.5) 90.5(0.9)
86.8(5.6) 90.8(0.9)
<0.01 <0.01
0.31 0.07
f
Calib. Calib.–Eval.
65.3(10.7) 60.8(5.3)
87.2(14.4) 92.4(7.6)
93.0(4.7) 96.0(0.8)
<0.01 <0.01
0.05 0.03
g
Calib. Calib.–Eval.
74.6(6.9) 68.2(1.3)
82.5(9.8) 84.4(7.0)
92.4(3.9) 95.3(0.9)
<0.01 <0.01
<0.01 <0.01
Notes: the first column denotes the subjects; the second column denotes two types of cross-validation study (see Section IV-D): “Calib.” stands for five rounds of fivefold crossvalidation test in the calibration session of the data; “Calib.–Eval.” stands for the test that uses the evaluation session to assess the generalization performance of the models built in “Calib.” The two columns to the right side summarize the statistical significance test (paired t-test) results, and the p-values represent the probabilities of null hypotheses: the proposed method produces the same mean accuracy as CSP (OSSFN = CSP), and it produces the same mean as FBCSP (OSSFN = FBCSP). The p-value is in BOLD style if ≤ 0.05, meaning that the null hypotheses is rejected at >95% confidence level. TABLE III C LASSIFICATION A CCURACY (M EAN AND SD) IN O UR D ATASET Sub 1 2 3 4 5 6 7
Feature CSP 82.0(9.8) 84.7(10.9) 70.6(10.9) 87.3(5.4) 64.0(11.5) 63.1(13.2) 78.1(7.5)
extraction methods FBCSP OSSFN 86.7(5.2) 87.7(5.5) 88.4(5.6) 90.7(5.1) 79.1(7.0) 80.0(7.2) 87.6(4.9) 89.8(5.5) 68.2(6.8) 67.8(8.2) 66.2(7.4) 71.1(7.2) 86.0(5.7) 89.3(7.2)
p-Value OSSFN = CSP OSSFN = FBCSP <0.01 0.25 0.01 <0.01 <0.01 0.57 0.02 0.05 0.20 0.85 <0.01 <0.01 <0.01 <0.01
Note: Refer to the notes under Table II or Section IV-D for explanation.
The cross-validation technique assesses how the results generated by the methods will generalize to an independent dataset. Each round of the fivefold cross-validation involves partitioning a sample of data into five subsets, alternately performing the learning on one subset (called the training set) and validating the learned model on the others (aggregated as the test set). Five rounds of cross-validation are performed using different partitions of data in order to reduce variability. The partitions are randomly generated using the cross-validation function “crossvalind” in the MATLAB Bioinformatics toolbox. The cross-validation study creates a total of 25 pairs of training and testing tasks. Depending on the size of the total data for each subject, the number of EEG blocks is 160 (or 128) in each training set and 40 (or 32) in each test set for the BCI Competition data (or our data). To ensure a valid comparison between different methods, they all use the same data partitions in cross-validation. Particularly on the BCI Competition dataset, a special crossvalidation is performed to evaluate the method’s generalization performance. It begins with the cross-validation on the calibration set. Then the trained models are applied to both the
cross-validation test set (part of the calibration set) and the whole evaluation set. A classifier is used to assess the performance of the feature extraction methods for classifying motor imagery classes. The classifier learns and predicts the class labels from the features generated by CSP, FBCSP, and the proposed method, respectively. The accuracy rate is taken as the performance measure. For the classifier, we consider the linear support vector machine (SVM), since it is widely used in the field. Particularly, we use the default implementation in the library of support vector machine toolbox [38] (tuning of the regularization parameter in SVM is not performed according to our experience). Tables II and III summarize the result. In a total of 15 cases (each case is a particular subject session), the proposed method significantly outperformed (at 95% confidence level) CSP in 14 cases or FBCSP in 10 cases. Compared to FBCSP, the largest boost in classification accuracy was in subject “a,” with the mean accuracy rate increased from 66.5% to 89.9% in the calibration test and from 79.2% to 93.6% in the calibration– evaluation test.
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
V. D ISCUSSION The experimental results have demonstrated the efficacy of the proposed approach. Compared with the state-of-the-art FBCSP, the proposed method produced higher classification accuracy, in 10 cases with statistical significance, in 3 cases without statistical significance. Furthermore, it did not deteriorate classification accuracy with statistical significance in the remaining two cases. Moreover, for the proposed method the classification accuracy in the calibration session and that in the calibration– evaluation test (calibration models applied to evaluation data) are similar. For example, the method yielded a mean classification accuracy of 89.9% for subject “a” in the cross-validation in the calibration data. The sample models applied to the evaluation set yielded a mean accuracy of 93.6%, which was even slightly higher. The slightly better performance in the evaluation set is interesting, and calls for future studies. CSP was initially designed for two-class paradigms only [8]. Extensions to multiclass paradigms have been suggested, but are based on heuristics. In comparison, the mutual information formulation for learning naturally deals with multiclass problems. Hence, further work may look into the use of the present method for multiclass motor imagery classification. However, we need to note that it will be a challenge to construct effective subspace for low-dimensional representation of spatial filters in the multiclass contexts. Besides, there is a connection between the current optimization procedure and FBCSP. The spatial filter learning algorithm runs in a low-dimensional representation subspace, instead of original space for multichannel EEG, in order to ease the optimization problem. In the subspace, any spatial filter can be expressed as a combination of the subspace bases, and tentatively we use FBCSP to form the bases. Nevertheless, the optimization procedure, as a general approach, is neither tailored nor dedicated to FBCSP-created subspaces. This means that one may devise more effective subspace construction methods and run the optimization procedure there, and expect improved performance. There is a growing awareness of the importance of selfpaced BCI that allows the user to operate BCI at any time at will, thus providing more natural and potentially faster interactions [39]. This paper, on the other hand, used a cue-based classification scheme to examine the efficacy of the optimum spatio-spectral filtering method. Nevertheless, subject-specific motor imagery models, learned through cuebased calibration, are still necessary in initialization of selfpaced BCI systems [36], [40], [41] for each user. Thus, it is interesting to investigate the proposed method for self-paced BCI in future studies. VI. C ONCLUSION In this paper, we have considered extracting spatio-spectral features of the ERD for motor imagery classification. We formulated the learning of optimum spatio-spectral filters as a maximum mutual information problem. We proposed a gradient-based optimization approach to solve the problem. To make the solution robust and efficient, we developed a
61
subspace spatial filtering learning approach in which spatial filters were parameterized by lower dimensional vectors. The experimental results attest to the efficacy of the proposed method. Compared to CSP and FBCSP, the method produced significantly higher classification accuracy in most cases and it did not deteriorate classification accuracy with statistical significance in the rest few cases. We expect that more effective subspace construction methods will be devised to further improve the performance and extend the methods for multiclass motor imagery classification. VII. ACKNOWLEDGMENT The authors would like to thank the organizers of the BCI Competition IV [24] and the providers of the BCI Competition Dataset I [41]. A PPENDIX A R ELATIONSHIP OF OSSFN WITH CSP AND FBCSP This section briefly reviews the CSP and the FBCSP algorithms and then discusses their relations to the proposed OSSFN. 1) CSP: CSP computes features whose variances are optimum for discriminating two classes of EEG measurements [16]. The method is based on simultaneous diagonalization of two covariance matrices of each class. In summary, the spatially filtered signal y(t) of a single-trial EEG x(t) is given in (1), where WT is the CSP projection matrix. The spatial filtered signal y(t) maximizes the differences in the variance of the two classes of EEG by solving the eigenvalue decomposition problem [17]
x0 W = x1 W
(32)
where x0 (or x1 ) is the covariance matrix of the band-pass filtered EEG of motor imagery class 0 (or 1), and is the diagonal matrix that contains the eigenvalues of W. Refer to [26, Sec. 10.2]. From a pattern classification perspective, CSP is equivalent to optimum linear transformation which minimizes the Bhattacharyya bound for two zero-mean unimodal Gaussian classes of EEG time samples. Below is a brief explanation. The Bhattacharyya bound refers to an upper bound of Bayesian classification error given as p(x|ω0 ) p(x|ω1 )dx (33) ǫ B (x) = P(ω0 )P(ω1 ) where p(x|ω0 ) and p(x|ω1 ) are the conditional probability density functions for the multichannel EEG x of the two classes ω0 and ω1 . After transforming x linearly into y using W, the Bhattacharyya bound becomes ǫ B (y) = P(ω0 )P(ω1 ) p(y|ω0 ) p(y|ω1 )dy. (34) If the conditional density functions are both zero-mean Gaussians with covariance matrix x0 and x1 , maximizing
62
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
the Bhattacharyya bound is equivalent to solving the dual problem [26, Sec. 10.2, eqs. (10.42) and (10.43)] −1 (35) W = W
x−1
y x y1 0 0 1
x−1
y1
x1 W = W y−1 (36) 0 0 where y0 = WT x0 W and y1 = WT x1 W are the covariance matrix of y in class ω0 and ω1 . Note that in the implementation of [16], the covariance matrix was computed as the mean of trial-based covariance matrices after normalization (dividing each of the covariance matrices by its trace). A solution to the above two equations exists for W being the
x0 and x−1
x1 . Since these two eigenvectors of both x−1 1 0 −1 )−1 , they share matrices are related by x−1
= (
x0 x1 x0 1 the same eigenvector matrix and the solution is given by
x0 W = x1 W
(37)
where is the diagonal matrix of eigenvalues. The eigenvectors thus obtained are exactly the same as the ones obtained using CSP. Besides, as [26] puts it, the minimization of the Bhattacharyya bound is associated with the maximization of the following measure J=
n i=1
" ! 1 +2 . log λi + λi
(38)
Therefore, in order to minimize the Bhattacharyya bound, the eigenvectors corresponding to the largest (λi + λ1i + 2) terms are to be selected. 2) FBCSP algorithm: The FBCSP [22] processes input EEG with an array of band pass filters, and applies CSP to the EEG data after each band-pass filtering. It then concatenates the extracted CSP features from each filter band to form a joint feature vector. The FBCSP algorithm then selects from the joint feature vector a discriminative set of features, by employing a mutual information-based feature selection algorithm. Therefore, we can summarize the relationship between OSSFN with CSP and FBCSP as below. 1) Spatial filtering: CSP and FBCSP produce optimum linear transformation for EEG samples of unimodal Gaussians, on the other hand, OSSFN employs a samplebased nonparametric mutual information estimate as the objective function, and can explore complex data structures. 2) Spectral filtering: CSP itself does not address selection of band-pass filter, FBCSP selects pairs of spatial features produced by CSP from each band-pass filter in a filter bank, where the selection is performed to maximize a mutual information estimate. OSSFN selects the optimum band-pass filter in conjunction with optimization of spatial filters in each band-pass filter, resulting in optimum spatio-spectral features for trial-by-trial EEG classification.
R EFERENCES [1] J. R. Wolpaw, “Brain-computer interfaces as new brain output pathways,” J. Physiol., vol. 579, no. 3, pp. 613–619, Mar. 2007. [2] A. Nijholt and D. Tan, “Brain-computer interfacing for intelligent systems,” IEEE Intell. Syst., vol. 23, no. 3, pp. 72–79, May–Jun. 2008. [3] J. R. Wolpaw, N. Birbaumer, D. J. MacFarland, G. Pfurtscheller, and T. M. Vaughan, “Brain-computer interfaces for communication and control,” Clin. Neurophysiol., vol. 113, no. 6, pp. 767–791, Jun. 2002. [4] G. Pfurtscheller, C. Neuper, D. Flotzinger, and M. Pregenzer, “EEGbased discrimination between imagination of right and left hand movement,” Electroencephalogr. Clin. Neurophysiol., vol. 103, no. 6, pp. 642– 651, Dec. 1997. [5] J. Muller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, “Designing optimal spatial filtering of single trial EEG classification in a movement task,” Clin. Neurophysiol., vol. 110, no. 5, pp. 787–798, May 1999. [6] A. Kübler, F. Nijboer, J. Mellinger, T. M. Vaughan, H. Pawelzik, G. Schalk, D. J. McFarland, N. Birbaumer, and J. R. Wolpaw, “Patients with ALS can use sensorimotor rhythms to operate a brain-computer interface,” Neurology, vol, 64, no. 10, pp. 1775–1777, May 2005. [7] G. Dornhege, B. Blankertz, G. Curio, and K.-R. Müller, “Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms,” IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp. 993–1002, Jun. 2004. [8] M. Grosse-Wentrup and M. Buss, “Multiclass common spatial patterns and information theoretic feature extraction,” IEEE Trans. Biomed. Eng., vol. 55, no. 8, pp. 1991–2000, Aug. 2008. [9] B. Blankertz, G. Dornhege, C. Schafer, R. Krepki, J. Kohlmorgen, K.R. Müller, V. Kunzmann, F. Losch, and G. Curio, “Boosting bit rates and error detection for the classification of fast-paced motor commands based on single-trial EEG analysis,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 11, no. 2, pp. 127–131, Jun. 2003. [10] P. L. Nunez, R. Srinivasan, A. F. Westdorp, R. S. Wijesinghe, D. M. Tucker, R. B. Silberstein, and P. J. Cadusch, “EEG coherency I: Statistics, reference electrode, volume conduction, laplacians, cortical imaging, and interpretation at multiple scales,” Electroencephalogr. Clin. Neurophysiol., vol. 103, no. 5, pp. 499–515, Nov. 1997. [11] P. L. Nunez, R. B. Silberstein, Z. Shi, M. R. Carpenter, R. Srinivasan, D. M. Tucker, S. M. Doran, P. J. Cadusch, and R. S. Wijesinghe, “EEG coherency II: Experimental comparisons of multiple measures,” Clin. Neurophysiol., vol. 110, no. 3, pp. 469–486, Mar. 1999. [12] I. I. Goncharova, D. J. McFarland, T. M. Vaughan, and J. R. Wolpaw, “EMG contamination of EEG: Spectral and topographical characteristics,” Clin. Neurophysiol., vol. 114, no. 9, pp. 1580–1593, Sep. 2003. [13] A. Bashashati, M. Fatourechi, R. K. Ward, and G. E. Birch, “A survey of signal processing algorithms in brain-computer interfaces based on electrical brain signals,” J. Neural Eng., vol. 4, no. 2, pp. R32–R57, Jun. 2007. [14] L. Qin, L. Ding, and B. He, “Motor imagery classification by means of source analysis for brain-computer interface applications,” J. Neural Eng., vol. 1, no. 3, pp. 135–141, 2004. [15] M. Grosse-Wentrup, C. Liefhold, K. Gramann, and M. Buss, “Beamforming in noninvasive brain-computer interfaces,” IEEE Trans. Biomed. Eng., vol. 56, no. 4, pp. 1209–1219, Apr. 2009. [16] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, “Optimal spatial filtering of single trial EEG during imagined hand movement,” IEEE Trans. Rehabil. Eng., vol. 8, no. 4, pp. 441–446, Dec. 2000. [17] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K.-R. Müller, “Optimizing spatial filters for robust EEG single-trial analysis,” IEEE Signal Process. Mag., vol. 25, no. 1, pp. 41–56, Jan. 2008. [18] M. Naeem, C. Brunner, R. Leeb, B. Graimann, and G. Pfurtscheller, “Separability of four-class motor imagery data using independent components analysis,” J. Neural Eng., vol. 3, no. 3, pp. 208–216, Sep. 2006. [19] S. Lemm, B. Blankertz, G. Curio, and K.-R. Müller, “Spatio-spectral filters for improving the classification of single trial EEG,” IEEE Trans. Biomed. Eng., vol. 52, no. 9, pp. 1541–1548, Sep. 2005. [20] G. Dornhege, B. Blankertz, M. Krauledat, F. Losch, G. Curio, and K.R. Müller, “Combined optimization of spatial and temporal filters for improving brain-computer interfacing,” IEEE Trans. Biomed. Eng., vol. 53, no. 11, pp. 2274–2281, Nov. 2006. [21] W. Wu, X. R. Gao, B. Hong, and S. K. Gao, “Classifying single-trial EEG during motor imagery by iterative spatio-spectral patterns learning (ISSPL),” IEEE Trans. Biomed. Eng., vol. 55, no. 6, pp. 1733–1743, Jun. 2008. [22] K. K. Ang, Z. Y. Chin, H. Zhang, and C. Guan, “Filter bank common spatial pattern (FBCSP) in brain-computer interface,” in Proc. Int. Joint Conf. Neural Netw., Hong Kong, China, 2008, pp. 2391–2398.
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
[23] K. K. Ang, C. Guan, K. S. G. Chua, B. T. Ang, C. W. K. Kuah, C. Wang, K. S. Phua, Z. Y. Chin, and H. H. Zhang, “A clinical evaluation of non-invasive motor imagery-based brain-computer interface in stroke,” in Proc. 30th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., Vancouver, BC, Canada, Aug. 2008, pp. 4178–4181. [24] BCI Competition IV [Online]. Available: http://www.bbci.de/competition/ [25] H. Zhang, C. Guan, and C. Wang, “Spatio-spectral feature selection based on robust mutual information estimate for brain-computer interfaces,” in Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., Minneapolis, MN, Sep. 2009, pp. 2391–2398. [26] K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic, 1990. [27] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. New York: Wiley, 2006. [28] M. Ben-Bassat, “User of distance measures, information measures and error bounds in feature evaluation,” in Handbook of Statistics, vol. 2, P. Krishnaiah and L. Kanal, Eds. Amsterdam, The Netherlands: North Holland, 1982, ch. 35, pp. 773–791. [29] S. Petridis and S. J. Perantonis, “On the relation between discriminant analysis and mutual information for supervised linear feature extraction,” Pattern Recognit., vol. 37, no. 5, pp. 857–874, May 2004. [30] J. M. Sotoca and F. Pla, “Supervised feature selection by clustering using conditional mutual information-based distances,” Pattern Recognit., vol. 43, no. 6, pp. 2068–2081, Jun. 2010. [31] P. A. Estevez, M. Tesmer, C. A. Perez, and J. M. Zurada, “Normalized mutual information feature selection,” IEEE Trans. Neural Netw., vol. 20, no. 2, pp. 189–201, Feb. 2009. [32] P. Viola and W. M. Wells, III, “Alignment by maximization of mutual information,” Int. J. Comput. Vis., vol. 24, no. 2, pp. 137–154, Sep. 1997. [33] D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization. New York: Wiley, 1992. [34] A. W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. New York: Oxford Univ. Press, 1997. [35] N. Yamawaki, C. Wilke, Z. Liu, and B. He, “An enhanced timefrequency-spatial approach for motor imagery classification,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 14, no. 2, pp. 250–254, Jun. 2006. [36] R. Scherer, F. Lee, A. Schlögl, R. Leeb, H. Bischof, and G. Pfurtscheller, “Toward self-paced brain-computer communication: Navigation through virtual worlds,” IEEE Trans. Biomed. Eng., vol. 55, no. 2, pp. 675–682, Feb. 2008. [37] G. Pfurtscheller and C. Neuper, “Motor imagery and direct braincomputer communication,” Proc. IEEE, vol. 89, no. 7, pp. 1123–1134, Jul. 2001. [38] C.-C. Chang and C.-J. Lin. (2001). LIBSVM: A Library for Support Vector Machines [Online]. Available: http://www.csie.ntu.edu.tw/∼cjlin/libsvm [39] B. Obermaier, G. R. Muller, and G. Pfurtscheller, “Virtual keyboard controlled by spontaneous EEG activity,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 11, no. 4, pp. 422–426, Dec. 2003. [40] M. Fatourechi, R. K. Ward, and G. E. Birch, “A self-paced braincomputer interface system with a low false positive rate,” J. Neural Eng., vol. 5, no. 1, pp. 9–23, Mar. 2008. [41] B. Blankertz, G. Dornhege, M. Krauledat, K.-R. Müller, and G. Curio, “The non-invasive Berlin brain-computer interface: Fast acquisition of effective performance in untrained subjects,” NeuroImage, vol. 37, no. 2, pp. 539–550, Aug. 2007.
Haihong Zhang (M’07) received the Ph.D. degree in computer science from the National University of Singapore, Singapore, in 2005. He joined the Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore, and later became a Principal Investigator for multimodal neural decoding. He is currently a Senior Research Fellow at the same institute. His current research interests include machine learning, pattern recognition, and brain signal processing for highperformance brain–computer interfaces.
63
Zheng Yang Chin (M’08) received the Masters degree in electrical engineering from the National University of Singapore, Singapore, in 2008. He has been working on the brain–computer interface project at the Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore, since 2005. His current research interests include signal processing and machine learning techniques for biomedical signal analysis.
Kai Keng Ang (S’05–M’07) received the B.A.Sc. (1st Hons.), M.Phil., and Ph.D. degrees in computer engineering from Nanyang Technological University, Singapore, in 1997, 1999, and 2008, respectively. He was a Senior Software Engineer with Delphi Automotive Systems Singapore Pte Ltd., Singapore, from 1999 to 2003, working on embedded software for automotive engine controllers. He is currently a Senior Research Fellow working on brain–computer interface at the Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore. His current research interests include computational intelligence, machine learning, pattern recognition, and signal processing.
Cuntai Guan (S’91–M’92–SM’03) received the Ph.D. degree in electrical and electronic engineering from Southeast University, Nanjing, China, in 1993. He is a Principal Scientist & Program Manager at Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore. From 1993 to 1999, he worked on speech recognition in universities, research institutes, and industries. In 2003, he established the Brain-Computer Interface Laboratory at Institute for Infocomm Research, Singapore. His current research interests include brain-computer interface, neural signal processing, machine learning, pattern classification, and statistical signal processing, with applications to neuro-rehabilitation, health monitoring, and cognitive training. Dr. Guan is an Associate Editor of Frontiers in Neuroprosthetics.
Chuanchu Wang (M’07) received the B.Eng. and M.Eng. degrees in communication and electrical engineering from the University of Science and Technology of China, Hefei, China, in 1988 and 1991, respectively. He is currently a Research Manager at the Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore. His current research interests include electroencephalogram based brain–computer interfacing and related applications.
64
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
24-GOPS 4.5-mm2 Digital Cellular Neural Network for Rapid Visual Attention in an Object-Recognition SoC Seungjin Lee, Student Member, IEEE, Minsu Kim, Student Member, IEEE, Kwanho Kim, Member, IEEE, Joo-Young Kim, Member, IEEE, and Hoi-Jun Yoo, Fellow, IEEE
Abstract— This paper presents the Visual Attention Engine (VAE), which is a digital cellular neural network (CNN) that executes the VA algorithm to speed up object-recognition. The proposed time-multiplexed processing element (TMPE) CNN topology achieves high performance and small area by integrating 4800 (80 × 60) cells and 120 PEs. Pipelined operation of the PEs and single-cycle global shift capability of the cells result in a high PE utilization ratio of 93%. The cells are implemented by 6T static random access memory-based register files and dynamic shift registers to enable a small area of 4.5 mm2 . The bus connections between PEs and cells are optimized to minimize power consumption. The VAE is integrated within an object-recognition system-on-chip (SoC) fabricated in the 0.13-µm complementary metal–oxide–semiconductor process. It achieves 24 GOPS peak performance and 22 GOPS sustained performance at 200 MHz enabling one CNN iteration on an 80 × 60 pixel image to be completed in just 4.3 µs. With VA enabled using the VAE, the workload of the object-recognition SoC is significantly reduced, resulting in 83% higher frame rate while consuming 45% less energy per frame without degradation of recognition accuracy. Index Terms— Cellular neural network, object-recognition, saliency map, visual attention.
I. I NTRODUCTION ECENTLY, there has been increasing interest in vision chips for pattern recognition applications such as autonomous vehicles, mobile robots, and human–computer interfaces [1], [2]. These chips must be able to execute complex vision algorithms in real time while consuming as little power as possible for long battery life. This is difficult to achieve due to the high computational cost of vision algorithms. For example, a popular local-feature-based object-recognition algorithm [3] requires nearly 1 s to process a single 640 × 480 pixel
R
Manuscript received October 13, 2009; revised September 18, 2010; accepted September 20, 2010. Date of publication November 11, 2010; date of current version January 4, 2011. S. Lee, M. Kim, J.-Y. Kim, and H.-J. Yoo are with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail: seungjin@ eeinfo.kaist.ac.kr;
[email protected];
[email protected];
[email protected]). K. Kim was with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea. He is currently with the Digital Media & Communication Research and Development Center, Samsung Electronics, Gyeonggi-do 443-742, Korea (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2085443
frame on a modern personal computer. The prevalent trend for performance enhancement has been to simply increase the level of parallelism, many of the recently proposed systemson-chip (SoCs) contain tens to hundreds of processors operating in the multiple-instruction multiple-data mode [1] to exploit task-level parallelism, or the single-instruction multiple-data (SIMD) mode [2] to exploit data-level parallelism. However, regardless of the configuration, the increase in processor count comes at the cost of increased power consumption, which not only drains precious battery power but also increases the heat dissipation requirements of the system. Thus, a new approach that can speed up vision algorithms while simultaneously reducing power consumption is highly desirable. In this paper, we present the visual attention engine (VAE) [4], which is a hardware accelerator optimized for the saliencybased VA algorithm [5], which mimics the VA mechanism of the human brain [6]. The object-recognition flow with VAE is illustrated in Fig. 1. The VAE executes the saliencybased VA algorithm prior to the detailed feature extraction and feature matching stages, thereby reducing their workload. As a result, the VAE not only reduces the processing time required per frame of object-recognition but also reduces the energy required per frame. The VAE is implemented as part of a multiprocessor objectrecognition SoC [7] that integrates the VAE together with eight processing element clusters (PECs), a matching accelerator (MA), and a host RISC processor as shown in Fig. 2 (chip photograph shown in Fig. 15). The VAE executes the saliencybased VA algorithm on the input image to obtain regions of interest (ROIs) for further processing. The eight PECs, which are each responsible for one-eighth of the input image, perform scale-invariant feature transform (SIFT) [3] feature extraction on the ROIs in parallel. The MA performs nearest neighbor matching of the extracted SIFT features against an object database. The RISC processor performs task management and flow control. Since the VAE is an added stage to the conventional pattern recognition flow, its processing speed must be sufficiently high in order to minimize the latency of the VA algorithm, which itself requires a considerable amount of computations. Also, it should be energy efficient and occupy a small area since it is targeted for integration within an object-recognition SoC. Cellular neural networks (CNNs) [8] are known to be a very
1045–9227/$26.00 © 2010 IEEE
LEE et al.: DIGITAL CELLULAR NEURAL NETWORK FOR RAPID VISUAL ATTENTION IN AN OBJECT-RECOGNITION SoC
Feature Extraction
Feature Matching
Ext I/F
Main processor
65
MA
VAE
“Can” Features
Input
Object DB
(a) Attention Deployment
On-chip Network Features Extraction
Feature Matching
VAE “Can” CNN
Input
Features
Salient Object
PEC 1
PEC 2
PEC 3
PEC 4
PEC 5
PEC 6
PEC 7
PEC 8
Object DB
(b)
Fig. 1. Steps of object-recognition (a) without VA and (b) with VA. VA can speed up object-recognition by reducing the number of features that must be analyzed by subsequent steps.
efficient hardware architecture for executing the saliency-based VA algorithm. Their locally connected topology is optimally suited for local convolution operations such as Gaussian and Gabor filtering. Thanks to CNN’s propagation property, even operations requiring large kernel sizes are readily supported, as demonstrated in [9], in which Gabor-like filtering that requires very large kernel size is achieved on CNNs using just 3 × 3 sized templates. For the VAE, we chose a digital CNN implementation that is easily integrated within the multiprocessor SoC. The VAE implements a new CNN architecture integrating 120 PEs interleaved with 4800 (80 × 60) cells. Named the timemultiplexed processing element (TMPE) topology, it allows the VAE to achieve very high PE utilization and thus high performance. As a result, the VAE is able to execute the VA algorithm in a short time, making the execution time overhead of calculating VA small compared to the large reductions in execution time of the main object-recognition. This paper is organized as follows. Section II will introduce the saliency-based VA algorithm, which is the basis for this paper. Section III will describe the architecture of the VAE, starting with the derivation of the TMPE CNN topology. The detailed implementation of the VAE circuits will be discussed in Section IV, followed by the detailed operation of the VAE in Section V. Section VI will give the implementation details with chip measurements. II. VA A LGORITHM In the human brain, VA [6] is the mechanism responsible for focusing the limited resources of the brain on the most important information in the visual field. The same principle can be applied to machine vision, as it is very challenging to perform object-recognition on real-time video inputs using complex algorithms such as SIFT, especially when the field of view is cluttered with distractors. In order for VA to be useful for object-recognition, it must be able to select target objects while rejecting regions containing distractors. This paper employs the saliency-based
Fig. 2. Object-recognition SoC including the VAE. The VAE performs saliency-based VA to reduce the overall workload of the entire SoC.
VA model proposed by Itti, Koch, and Niebur [5]. On top of the previous work of Koch and Ullman [10], the model is based on the observation that attention in primate vision is asserted as a result of competition between salient features in the visual stimuli. Testing on natural images has shown that the saliency-based model closely mimics human VA behavior [11]. In addition, its practical usefulness to object-recognition is demonstrated in [12], where it is used to pick out target objects superimposed onto natural backgrounds. The detailed steps of the VA algorithm are illustrated in Fig. 3. The resolution of the input image of the algorithm is 80 × 60 pixels, or one-fourth of the 320 × 240 pixel image used for object-recognition. The one-fourth scaled image of 80 × 60 pixels corresponds to the highest resolution image required by the saliency-based VA model [5]. The most timeconsuming calculations of the VA algorithm are the Gabor filtering, image wide normalization, and Gaussian filtering. These are the main operations targeted by the VAE. III. A RCHITECTURE A. CNNs CNNs are a type of neural network composed of a 2-D cell array with each cell having synapse connections to and from all cells in its neighborhood [8]. The following is the differential equation describing the operation of a CNN cell: d x c (t) = −x c (t) + ad y d (t) + bd u d + i (1) dt d∈Nr d∈Nr ⎧ c ⎪ x (t) < −1 ⎨−1, (2) y c (t) = x c (t), −1 ≤ x c (t) ≤ 1 ⎪ ⎩ c 1, x (t) > 1.
The variable x c is the internal state of the cell, y c is its output, and u c is its input. Nr denotes the neighborhood of the cell, which is usually the 3 × 3 window (including itself) centered around the cell. Thus, x d , y d , and u d denote the internal state, output, and input, respectively, of neighborhood cells. The coefficients ad , bd , and i are constants that make up a template which defines the behavior of the CNN. Equations (1) and (2) describe the operation of CNNs in continuous time and can be directly modeled by analog
66
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Input Image 80 × 60 24 bit RGB
Off-chip Memory
Pixel Operations Shift Register Array
PE Intensity 8 bit Unsigned
R-G Opponency 8 bit Signed
Analog Multiplier
B-Y Opponency 8 bit Signed
Multi-scale Generation
Shift Register 0°
Gabor Filters 45° 90°
Analog Storage
(a)
SR
SR
SR
SR
SR
SR
SR
SR
SR
Accumulator
(b)
Register File
135°
Center-surround Absolute Difference
(c)
Fig. 4. Comparison of CNN implementations. (a) Conventional analog. (b) Conventional digital. (c) TMPE topology.
Combination & Normalization
Orientation
Intensity
Color
Σ “Saliency Map” 20 × 15 8 bit Unsigned
Fig. 3. Detailed steps of the saliency-based VA algorithm. The final resulting saliency map indicates the regions of the input image that are most likely to be important.
circuit components. In digital CNNs, the first-order Euler approximation of (1) and (2) is used, given as x c (k + 1) = ad y d (k) + bd u d + i (3) d∈N
d∈N
r ⎧ r c ⎪ x (k) < −1 ⎨−1, y c (k) = x c (k), −1 ≤ x c (k) ≤ 1 ⎪ ⎩ 1, x c (k) > 1.
(4)
By examining (1)–(4), it can be seen that CNN hardware must provide memory, inter-cell communication, and processing capabilities. Memory is needed to store the state and input of the cells, inter-cell communication is needed to access the states and inputs of neighbor cells, and processing is needed to apply CNN templates. Fig. 4(a) and (b) summarizes the topologies of conventional analog and digital CNN implementations. Analog CNNs employ a fully parallel architecture in which each cell has its own set of analog multipliers and memory [13], [14]. This results in very fast operation on the order of several hundred 8-bit equivalent GOPS [13]. However,
actual performance is limited by the I/O speed, which is only in the several hundred MByte range, due to digital-to-analog converter and analog to digital converter overhead. In addition, analog CNNs suffer from switching noise and low equivalent accuracy in the range of 7 b. Meanwhile, digital CNNs replace analog multipliers with digital multiplier-accumulators (MACs) units, and analog memories with static random access memory (SRAM) and flip-flops for more robust operation. However, due to the large size of MAC units compared to analog multipliers, it is not possible to employ a fully parallel architecture. In [15], a systolic array architecture was proposed to emulate CNN operation. It provides only limited scalability due to its complex architecture and its performance is bottlenecked by its dependence on off-chip memory accesses. More recently, a mixed-mode approach was presented [16], in which intercell connections and accumulation are handled in the analog domain while memory storage is handled in the digital domain. Although this approach seems to take the best of analog and digital circuits, its actual performance suffers from the long digital-to-analog and analog to digital conversion time. In [17], a processor array approach is used in which each processor is responsible for a portion of the image. The processors are controlled in a SIMD way, meaning a single controller is responsible for decoding program instructions and broadcasting them to each processor. We propose a digital CNN topology, named the TMPE topology, which combines the flexibility of the digital CNN approach with the high performance of a fully parallel cell topology adopted by analog CNNs. This is achieved by integrating the required number of fully parallel cells with
LEE et al.: DIGITAL CELLULAR NEURAL NETWORK FOR RAPID VISUAL ATTENTION IN AN OBJECT-RECOGNITION SoC 60 Processing Elements 20 × 60 Cells
PE
North 8 Read Bus B 8 Read Bus A
60 Processing Elements 20 × 60 Cells
20 × 60 Cells
20 × 60 Cells
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
67
Shift Register (8bit)
West
East
OpA
OpB ALU
Register File (4×8bit)
Result Cell
I/O
On-chip Network Interface
Fig. 5. cells.
Control Signal Sequencer Program Flow Controller
I/O
8
Instruction Memory (2 KB)
Block diagram of the VAE. A total of 120 PEs are shared by 4800
a smaller number of shared digital PEs, as illustrated in Fig. 4(c). In this highly scalable topology, the cell array size can easily be scaled because of the small size of each cell. Processing speed can be scaled as needed by varying the number of PEs per row of cells. Meanwhile, the parallel inter-cellular communication capability of the cells minimizes communication overhead to ensure that the PEs are always optimally utilized. The detailed operation of this topology will be further explained in Section V. B. VAE Architecture Fig. 5 shows the block diagram of the VAE which implements the TMPE CNN topology. The 4800 (80 × 60) cells required to execute the VA algorithm are organized into four 20 × 60 cell arrays, while the 120 PEs that provide a peak performance of 24 GOPS at 200 MHz are organized into two columns of 60 PEs. Each column of PEs is shared by two cell arrays, which results in each PE being shared by 40 cells. The simplified block diagram of the cells and PEs are shown in Fig. 6. A strict role division between the cells and PEs was enforced, the cells are responsible for the storage and inter-cellular communication of CNN variables, while the PEs handle the calculations. This means that the PEs do not even have an accumulation register which is required for CNN operation. Instead, accumulation is performed directly on each cell’s register file. The lack of an accumulation register within the PE is due to the time-multiplexed sharing of each PE among different cell columns. Each cell contains a register file with four 8-bit entries for storing the variables required for CNN operation. Although (3) and (4) require each cell to store three variables, i.e., u c , x c , and y c , only two variables are needed when applying the full signal range (FSR) model [18], in which y c = x c . As a result, the four entry register files can store up to two real-valued CNNs. Complex-valued CNN operation, which is required for Gabor-type filtering [9], requires all four register file entries, two for storing the real and imaginary parts of c c + j x imag ), one for storing u c , and the state x c (x c = x real one as a temporary register when calculating the complex convolution. The quad-directional shift registers of each cell
PE
Write Bus South
Fig. 6. Cell-PE interconnection. Each PE can access 40 cells through two read buses and one write bus.
are crucial to the VAEs operation since they are the only means of inter-cell communication. For CNN operation, either the input u c or state x c is loaded from the register file into the shift register and shifted to neighboring cells in the north, east, west, and south directions. Since all shift registers operate in unison, as a cell’s data is shifted out to a neighbor in a certain direction, data from the neighbor in the opposite direction will be shifted in. Each global shift operation takes only one cycle to complete, resulting in a very low data access overhead of only 2.4% for CNN operation. The PEs are responsible for updating the CNN states stored in the cells. For this, two read buses and one write bus connect each PE to 40 cells. When processing data, a PE reads data from the register file and shift register of a cell through read bus A and read bus B, and writes back the results to the register file through the write bus. It should be noted that for each read–execute–write operation, the PE has access to only one cell. This means it cannot by itself combine data stored in different cells, that is achieved by the shift registers in the cells. Each PE is capable of executing an 8-b MUL or MAC with result saturation in a single cycle to accelerate CNN operation. This amounts to 24 GMACS peak performance at 200 MHz operating frequency. The sustained performance for CNN operation is only slightly lower at 22.3 GMACS, thanks to high PE utilization of 93%. The operation of the cells and PEs is coordinated by the control block shown at the bottom of Fig. 5. The VAE program is stored in 2 Kbytes of instruction memory. The program flow controller decodes the instructions and provides basic looping capability. The control signal sequencer generates the lowlevel timing signals for controlling the cells and PEs. Data I/O of the cells’ data takes place through the on-chip network interface. IV. D ETAILED I MPLEMENTATION A. VAE Cell Area reduction is the most critical aspect of the VAE cell’s design, due to the large number of cells that must be integrated. The 4800 cells account for more than 60% of the total area of the VAE. The register file and shift register of the cell are optimized to reduce area. First, the register file is implemented
68
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
6T SRAM Cell
w|0
Input Mux
Output Demux
S_en
w|1
read_en A
w|3
E_en
East
N_en
w|2
N_en
North
W_en
South
E_en
read_en B
Read Bus B
Read Bus A
S_en W_en
West Shift Register
keeper /shift /pq D
/equalize
Q
load_en write_en
shift Register File
Write Bus /Write Bus
using 6T SRAM cells as shown in Fig. 7. To further reduce the area, peripheral circuits such as the asynchronous timing signal generator is not included in each cell. Instead, a global asynchronous timing generator that is located in each 20 × 60 cell array distributes the asynchronous timing signals to each cell through a balanced H-tree buffer configuration. Power consumption of the buffer tree is minimized by gating the control signals by column, since only two columns of cells are active at once due to pipelined operation. Compared to a standard cell D flip-flop based implementation, 55% cell area reduction is achieved by using SRAM cell-based register files with global timing signal generators. Cell area is further reduced by employing a dynamic negative-channel metal-oxide-semiconductor (NMOS) passtransistor based MUX/DEMUX scheme shown in Fig. 8. In this scheme, the value of dynamic node D, which is precharged to the supply voltage, is evaluated through one of many possible paths selected by the signals “N_en,” “E_en,” “S_en,” “W_en,” and “load_en” before being latched by the pulsed latch. A weak keeper is employed to protect node D against noise components due to crosstalk and charge sharing. Compared to a static standard cell MUX/DEMUX design that requires positive-channel metal-oxide-semiconductor and NMOS transistors, this reduces cell area by 40%. B. PE The PEs execute the basic functions required for CNN operation such as MAC, MUL, and ADDI in a single cycle as shown in Fig. 9. The limiter located after the adder can simulate the nonlinear output function of CNNs. Additional functions such as ABS and MIN/MAX allow for some basic general purpose image manipulations. Each PE is shared by a group of 40 cells through two read buses and one write bus. The read buses, which are
Fig. 8. Bit slice of a VAE cell’s shift register. NMOS pass-transistor-based I/O switching is used for small area.
Shift Register Mode (Data I/O)
Fig. 7. Bit slice of a VAE cell’s register file. 6T SRAM-based storage elements are used for small area.
Read Bus B (Left)
Read Bus B (Right)
Read Bus A (Left)
Read Bus A (Right)
shift_in clk shift_clk shift_out
Operand A
Operand B
clk
Global Immediate
8 × 8 MUL SHIFT
ABS MAX SGN ...
ALU
clk left_en Write Bus (Left)
Result Left
/Write Bus (Left)
Fig. 9.
Result Right
clk right_en Write Bus (Right) /Write Bus (Right)
PE circuit and read/write buses.
single-ended to save routing resources, are operated in dynamic mode. Since NMOS pass transistors are used in place of complementary metal–oxide–semiconductor (CMOS) pass gates, cell area and bus loading can be reduced. On the other hand, the write buses are double-ended to facilitate reliable write operation to the SRAM cells of the register files. The read and write buses are split into left and right segments at the PE to reduce wire capacitance and resistance. Originally, each bus wire would have to span a width of 1750 µm, which includes the width of 40 cells and 1 PE. However, with bus splitting, this is reduced to just 570 µm,
LEE et al.: DIGITAL CELLULAR NEURAL NETWORK FOR RAPID VISUAL ATTENTION IN AN OBJECT-RECOGNITION SoC
PE Column 0 (Data I/O Mode)
Direction of Pipeline Operation Cell 2
Cell 1
Cell 3 stage
read buf 1
READ
read buf 2
Cell 4
cycle
ALU write buffer
Cell 20 2
3
4
5
40
Cell 2
Cell 3
Cell 4
Cell 5
Cell 40
Cell 1
Cell 2
Cell 3
Cell 4
Cell 39
Cell 40
Cell 1
Cell 2
Cell 3
Cell 38
Cell 39
1 Read
Cell 40
1
WRITE
cycle stage
Cell 21
PE
Cell 1
EXECUTE
2 Read
41
42
Cell 40
3 4 5 Read Write Read Write Read Write
/equalize
Cell Control Signals
69
wordline
Cell Array
8
PE 0
8
8
8
PE 60
8
8
8
PE 1
8
8
8
PE 61
8
8
8
PE 2
8
8
8
PE 62
8
8
8
PE 3
8
8
8
PE 63
8
8
8
PE 52
8
8
PE 112
8
8
PE 53
8
8
PE 113
8
8
PE 54
8
8
PE 114
8
Cell Array
Cell Array
PE 55
8
8
PE 115
8
8
8
PE 56
8
8
PE 116
8
8
8
PE 57
8
8
PE 117
8
8
8
PE 58
8
8
PE 118
8
8
8
PE 59
8
8
PE 119
8
32
read enable write enable
PE Column 1 (Data I/O Mode)
32
32
Shift_in0 Shift_clk Shift_out0
Cell Array
8
32
Shift_in1 Shift_clk Shift_out1
On-chip Network Interface measured waveform
Fig. 10. Pipelined PE operation. Consecutive read and write stages in a single cycle enable 1 op/cycle throughput of the PEs.
which is the width of just 20 cells. As a result, the bus capacitance, which includes the wire capacitance as well as the parasitic capacitance of the access transistors, is reduced by over 50% and the read bus wire delay from the farthest cell on the bus to the PE is just 240 ps. Fig. 10 illustrates the pipelined execution of the PEs. The data in the cells are processed column by column in a threestage read, execute, write pipeline. As a result, 1 op/cycle throughput is achieved and it takes 42 cycles for the PEs to execute one instruction on the entire cell array. Pipelined operation necessitates read and write of register files in the same cycle. However, since control signal generation circuitry is shared among cells in an array, reading and writing of cell data cannot occur simultaneously. This is solved by allocating the first half of each cycle for reading and the second half for writing. As explained earlier, the PE does not contain an accumulating register but instead accumulates directly to a cell’s register file. The downside of this is that only 8 b can be used for accumulation. This poses two potential problems: accumulator overflow and degradation of precision. The overflow problem is solved by scaling the multiplication result with a barrel shifter as shown in Fig. 9. Despite this, the degradation of precision may still limit the VAE in some applications that, for example, require large kernel sizes. However, for the VA application, 8 bits proved to be sufficient. C. Data I/O For data I/O, the PEs are connected to one another in a shift register configuration to shift data to and from the controller as shown in Fig. 11. For this, the “Operand A” registers of each PE can be operated in a shift register mode through the shift_in, shift_out, and shift_clk signals, as described in Fig. 9. The “Operand A” registers of four consecutive PEs are grouped as one 4-byte shift register to match the 4-byte per cycle I/O throughput of the controller. For writing to the cell array, an entire column of data is first shifted in to a PE column and then written to a column of cells at once through the write
On-chip Network
Fig. 11. Data I/O path of the VAE. All data access to the cell arrays are performed through the PEs.
bus. When reading from the cell array, data from an entire column of cells are read into a PE column either through read bus A or read bus B, and then shifted out to the controller. Each column operation takes 1 cycle for cell access and 15 cycles for shifting the data, which translates to 1280 cycles or 6.4 µs for array-wide write or read. D. Controller The controller is responsible for decoding VAE programs into control signals for the cell and PE arrays and data input/output. Each VAE instruction is 16 bits wide so that the 2 Kbyte instruction memory can hold 1024 instructions. The VAE instructions are fine grained, meaning that every operation such as load, shift, and ALU operations are programmable. Although this results in larger program size, since CNN templates must be broken down into multiple instructions, it affords high flexibility to the programmer for non-CNN tasks. The task of decoding instructions is split between the flow control block and the sequencer. The flow control block takes care of instruction-level flow control including branching and program start/halt. The sequencer decodes the current instruction into actual sequences of control signals for the cell and PE arrays. The sequence of control signals can be as long as 42 cycles for instructions involving PE processing like multiply-accumulate or just 1 cycle for instructions like load and shift. V. D ETAILED O PERATION Fig. 12 shows the programmer’s model of the VAE. From the programmer’s viewpoint, the VAE can be abstracted into a vector machine with four 80 × 60-dimensional 8b vector registers, denoted r0 ∼ 3, and one 80 × 60-dimensional 8b vector shift register, denoted sr. Three main types of operations are possible on the VAE. First, the vector shift register contents can be loaded from one of the vector registers. This takes place globally for all cells and thus only one cycle is needed. Second, the contents of the vector shift register can be shifted in the north, east, south, or west directions. This also takes just one
70
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Operation
loa d
2
3
rea d
N E
W S
SR
Vector Register (80 × 60 × 8 bit) w rite
rea d
R0 R1 R2 R3 1 2 3
Fig. 12.
1
i6
Global Coeff.
i3 i3
i0 i7 i0 i7
i4
i1 i8 i1 i8
i5 i5
i6 k7
r0
Fig. 13.
(5)
i0
i4
i3
i0
9 10
7 8
11 12
?
i6
i3
i7 k7
? ?
? i0 i7
i4
sr (shift register)
? ?
? i1 i8
i5
r0 r1 r2 r3
i2
Cell 13 14
i1 i2 i5 k7 i8 k7 k7 r1 o0 o1 o2 o3 o4 o5 o6 o7 o8
i2 i2
5 6
i2 i5 k4 i6 i7 i8 k4 r1 k4 ko40 ko41 o2 o3 o4 o5 o6 o7 o8
Using vector notation, a 3 × 3 CNN template T, which defines the behavior of a CNN, can be expressed as
17 18
15 16
i4
i6 k8
i1
o6
o3
o0 o7
i6 k6
i3
i0 i7 k6
i1 i2 i5 k6 i8 k6 k6 r1
i4
o1 o2 o4 o5 o8
i1 i2 i5 k8 i8 k8 k8 r1 o0 o1 o2 o3 o4 o5 o6 o7 o8 i3
i0
i7 k8
i4
o4 = i0·k0 + i1·k1 + i2·k2 + i3·k3 + i4·k4 + i5·k5 + i6·k6 + i7·k7 + i8·k8
Procedure for convolution with a 3 × 3 kernel.
PE pipelines is 93%. Convolution using larger kernels (5 × 5, 7 × 7, and so on) can also be efficiently achieved by simply extending the spiraling shift sequence. According to (6), a CNN iteration is completed by executing two convolutions and one addition. First, a convolution is calculated for the state output X(k) and is repeated for the input U. Then the threshold value i , which is normally a negative value, is added to the state to obtain the final state X(k+ 1). For a 3 × 3 CNN template with all nonzero coefficients, this process takes 858 cycles or 4.3 µs at 200 MHz. B. VA
b6 b7 b8
Letting U equal the 80 × 60-dimensional input image matrix and X(k) equal the 80 × 60-dimensional CNN state matrix at time k, one time step in the CNN Euler approximation (3) applied to the entire 80 × 60 array can be expressed as X(k + 1) = X(k) ∗ A + U ∗ B + i.
3 4
i3
A. CNN Operation
T = {A, B, i }
b0 b1 b2 and B = b3 b4 b5 .
i4
13. shift east 14. r1 = r1 + sr*k2 15. shift north 16. r1 = r1 + sr*k5 17. shift north 18. r1 = r1 + sr*k8
7. shift south 8. r1 = r1 + sr*k3 9. shift south 10. r1 = r1 + sr*k0 11. shift east 12. r1 = r1 + sr*k1
2
→1 cycle →1 cycle →42 cycles
cycle. The third operation type, i.e., the PE operation, takes up to three operands (one of r0 ∼ 3, sr, and a global coefficient), performs a PE operation, and saves the result back to r0 ∼ 3. This takes 42 cycles to complete as a result of the pipelined PE operation explained earlier.
a3 a4 a5 a6 a7 a8
? ?
1. sr = r0 2. r1 = r1 + sr*k4 3. shift north 4. r1 = r1 + sr*k7 5. shift west 6. r1 = r1 + sr*k6
Simplified programmer’s model of the VAE.
where A =
3 × 3 Cell Region of the VAE:
k0 k1 k2 k3 k4 k5 k6 k7 k8
Procedure
i6
Load Shift Register (e.g., sr = r0) Shift Shift Register (e.g., shift north) PE operation (e.g., r1 = r2 + sr*8)
a0 a1 a2
r1 = r0
Convolution (3×3 kernel) :
Processing Elements
1
(register file)
Vector Shift Register (80 × 60 × 8 bit)
(6)
Although (4) originally required an additional saturation step, this step is eliminated by assuming the FSR model, in which the saturation function is always applied to the PE output, and thus y c = x c . Equation (6) shows that CNN operation is mainly composed of a 2-D convolution with a kernel. Fig. 13 visualizes an efficient procedure for calculating the convolution, which involves a spiral shifting motion to minimize data access overhead and maximize PE utilization. First, the register array holding the input is copied into the shift register array. Then the shift register array is shifted in a spiraling motion. From the viewpoint of the center cell, the state value of each of its neighbor cells becomes available after each shift operation. The convolution is obtained by multiplying the shift register value with the appropriate coefficient and accumulating the result back to the result after each shift. Thanks to the efficient shift pattern and single cycle shift operations, data communication overhead is only 2.4% and utilization of the
The saliency-based VA requires Gaussian filtering, Gaborlike filtering, and center-surround operations. Gabor-like filtering, unlike the other operations, requires a complex-valued CNN and is examined in more detail here. The CNN template for a Gabor-like filter is given by ⎤ ⎡ 0 e− j ω yo 0 A = ⎣e j ωxo −(4 + λ2 ) e− j ωxo ⎦ 0 e j ω yo 0 ⎡ ⎤ 0 0 0 B = ⎣0 λ2 0⎦ and i = 0. (7) 0 0 0 Here, ω xo and ω yo are the angular frequencies of the Gaborlike filter orthogonal to the x and y axes, respectively, and λ is a constant that is inversely proportional to the cut-off distance of the low pass envelope of the Gabor-like filter. The matrix A contains both real and imaginary parts. Letting Areal and Aimag equal the value of the real and imaginary parts of A, then (6) can be transformed into Xreal (k + 1) = Xreal (k) ∗ Areal −Ximag (k) ∗ Aimag + U ∗ B (8) and Ximag (k + 1) = Xreal (k) ∗ Aimag + Ximag (k) ∗ Areal .
(9)
LEE et al.: DIGITAL CELLULAR NEURAL NETWORK FOR RAPID VISUAL ATTENTION IN AN OBJECT-RECOGNITION SoC
71
Operation
Execution time (µs)
Percentage of total
Gaussian Filter Gabor-type Filter Center-Surround Filter Other Total
521 1210 326 383 2440
21% 50% 13% 16%
Cell Array
PE Array
Cell Array
Cell Array
Cell Array
PE Array
1260 μm
TABLE I E XECUTION T IME S UMMARY
Controller 3600 μm
TABLE II C HIP I MPLEMENTATION S UMMARY Process technology
0.13-µm eight metal CMOS
Area Number of cells Number of PEs Operating voltage Operating frequency Peak performance Sustained performance Active power consumption
4.5 mm2 4800 (80×60) 120 1.2-V 200 MHz 24 GOPS 22 GOPS 84 mW
“BONE-V2” Object Recognition SoC Fig. 15.
Fig. 14.
45°
90°
Although (8) and (9) seem considerably more complex than (6), thanks to the large number of zero entries in the template, the number of required MAC operations for one step of the Gabor-type filter template is only 19, which is actually equal to that of (6). Only the storage requirement is increased, since all four registers r0 ∼ r3 are required to compute the complex CNN. As a result, one iteration of the Gabor-type filter takes 831 cycles or 4.15 µs, which is less than (6) due to omission of the threshold addition. The impulse responses of Gabor-type filters with varying orientation are shown in Fig. 14. In all cases, the angular frequency ω is equal to 1.5 radians, and λ is equal to 0.66, which corresponds to a 6-dB cutoff distance of 1.5 pixels. Fifteen iterations were used in each case, resulting in 65 µs execution time. Table I summarizes the operations performed by the VAE and the time spent on each operation. Gabor-type filtering, which is used for extracting specific edge orientations, takes more than half of the total execution time, due to the large number of iterations required for each operation. VI. I MPLEMENTATION R ESULTS The 4.5-mm2 VAE was fabricated as part of the 36-mm2 object-recognition SoC [7] shown in Fig. 2 using a 0.1µm eight metal logic CMOS technology. The cell arrays are custom-designed to minimize the area while the PE and controller blocks are synthesized. The photograph of the resulting chip is shown in Fig. 15, and the chip features are summarized
PEC 0 RISC NoC
PEC 6
PEC 1
PEC 5
PLL
PEC 7
PEC 2
PEC 3
PEC 4
Attention (VAE) Attention Point Selection Local Feature Based Object-Recognition
135°
Gabor-type filter impulse response (imaginary component).
MA
Chip photograph.
80×60 Reduced Image
0°
VAE
(Parallel Processors) 320×240 Input Image
“turtle” Recognition Result
detected local features
With VAE
Without VAE
Number of Extracted Local Features Recognition Frame Rate Energy per Frame
Fig. 16.
Without VAE 279 12 42 mJ
With VAE 100 22 23 mJ
Improvement
64% ↓ 83% ↑ 45% ↓
Object-recognition results.
in Table II. Power consumption is 84 mW when running at 200 MHz and 1.2-V. The 120 PEs have a peak performance of 24 GOPS and show a utilization rate of 93% during CNN operation, thereby sustaining an effective performance of 22 GOPS. Fig. 16 shows the result of applying VAE to an objectrecognition SoC designed for real-time (>15 frames per second) operation on 320 × 240 video input. The VAE rapidly performs a saliency-based attention algorithm on the 80 × 60 scaled image and outputs a pixel map marking the ROIs. This map is used later by the parallel processors of the object-recognition SoC to perform detailed local-feature-based object-recognition only on regions of high saliency. For object images with background clutter, the average number of local
72
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
features is drastically reduced when the VAE is activated, increasing frame rate by 83% and reducing energy per frame by 45% without degrading the recognition rate. The energy overhead of the VAE itself is only 0.2 mJ/frame thanks to the short processing time of 2.4 ms/frame. VII. C ONCLUSION In this paper, we have presented a CNN-based hardware acceleration block specialized for executing the VA algorithm. Named the VAE, it adopts the TMPE topology to achieve sustained performance of 22 GOPS while consuming just 84 mW. This enables it to complete a CNN iteration on an 80 × 60 pixel image in 4.3 µs and thus complete the VA algorithm in just 2.4 ms. By applying the VAE, objectrecognition speed is increased by 83% with no negative effects on recognition accuracy. R EFERENCES [1] D. Kim, K. Kim, J.-Y. Kim, S. Lee, and H.-J. Yoo, “An 81.6 GOPS object recognition processor based on NoC and visual image processing memory,” in Proc. IEEE Custom Integr. Circuits Conf., San Jose, CA, Sep. 2007, pp. 443–446. [2] S. Kyo, S. Okazaki, T. Koga, and F. Hidano, “A 100 GOPS in-vehicle vision processor for pre-crash safety systems based on a ring connected 128 4-way VLIW processing elements,” in Proc. IEEE Symp. VLSI Circuits, Honolulu, HI, Jun. 2008, pp. 28–29. [3] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004. [4] S. Lee, K. Kim, M. Kim, J.-Y. Kim, and H.-J. Yoo, “The brain mimicking visual attention engine: An 80×60 digital cellular neural network for rapid global feature extraction,” in Proc. IEEE Symp. VLSI Circuits, Honolulu, HI, Jun. 2008, pp. 26–27. [5] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998. [6] R. Desimone and J. Duncan, “Neural mechanisms of selective visual attention,” Annu. Rev. Neurosci., vol. 18, pp. 193–222, Mar. 1995. [7] K. Kim, S. Lee, J.-Y. Kim, M. Kim, D. Kim, J.-H. Woo, and H.-J. Yoo, “A 125GOPS 583 mW network-on-chip based parallel processor with bio-inspired visual-attention engine,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, San Francisco, CA, Feb. 2008, pp. 308–615. [8] L. O. Chua and L. Yang, “Cellular neural networks: Theory,” IEEE Trans. Circuits Syst.-I, vol. 35, no. 10, pp. 1257–1272, Oct. 1988. [9] B. E. Shi, “Gabor-type filtering in space and time with cellular neural networks,” IEEE Trans. Circuits Syst. I, vol. 45, no. 2, pp. 121–132, Feb. 1998. [10] C. Koch and S. Ullman, “Shifts in selective visual attention: Toward the underlying neural circuitry,” Hum. Neurobiol., vol. 4, no. 4, pp. 219–227, 1985. [11] L. Itti and C. Koch, “A comparison of feature combination strategies for saliency-based visual attention systems,” in Proc. SPIE Conf. Hum. Vis. Electron. Imaging IV, San Jose, CA, Jan. 1999, pp. 473–482. [12] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up attention useful for object recognition?” in Proc. IEEE Comp. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2. Washington D.C., Jun.– Jul. 2004, pp. 37–44. [13] A. Rodríguez-Vazquez, G. Linan-Cembrano, L. Carranza, E. RocaMoreno, R. Carmona-Galan, F. Jimenez-Garrido, R. Dominguez-Castro, S. E. Meana, “ACE16k: The third generation of mixed-signal SIMDCNN ACE chips toward VSoCs,” IEEE Trans. Circuits Syst. I, vol. 51, no. 5, pp. 851–863, May 2004. [14] P. Kinget and M. S. J. Steyaert, “A programmable analog cellular neural network CMOS chip for high speed image processing,” IEEE J. SolidState Circuits, vol. 30, no. 3, pp. 235–243, Mar. 1995. [15] P. Keresztes, Á. Zarándy, T. Roska, P. Szolgay, T. Bezák, T. Hidvégi, P. Jónás, and A. Katona, “An emulated digital CNN implementation,” J. VLSI Signal Process. Syst., vol. 23, nos. 2–3, pp. 291–303, Nov.–Dec. 1999.
[16] M. Laiho, A. Paasio, A. Kananen, and K. A. I. Halonen, “A mixed-mode polynomial cellular array processor hardware realization,” IEEE Trans. Circuits Syst. I, vol. 51, no. 2, pp. 286–297, Feb. 2004. [17] P. Földesy, A. Zarándy, and C. Rekeczky, “Configurable 3-D-integrated focal-plane cellular sensor-processor array architecture,” Int. J. Circuit Theory Appl., vol. 36, nos. 5–6, pp. 573–588, Jul.–Sep. 2008. [18] S. Espejo, R. Carmona, R. Domínguez-Castro, and A. RodríguezVázquez, “A VLSI-oriented continuous-time CNN model,” Int. J. Circuit Theory Appl., vol. 24, no. 3, pp. 341–356, May–Jun. 1996.
Seungjin Lee (S’06) received the B.S. and M.S. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2006 and 2008, respectively. He is currently working toward the Ph.D. degree in electrical engineering and computer science at KAIST. He joined the Semiconductor System Laboratory at KAIST as a Research Assistant in 2006. His current research interests include parallel architectures for computer vision processing, low-power digital signal processors for digital hearing aids, and body area communication.
Minsu Kim (S’07) received the B.S. and M.S. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2007 and 2009, respectively. He is currently with KAIST. His current research interests include network-on-chip based system-onchip design and bio-inspired very-large-scale integration architecture for intelligent vision processing.
Kwanho Kim (S’04–M’09) received the B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 2004, 2006, and 2009, respectively. He is currently with the Digital Media & Communication Research and Development Center, Samsung Electronics, Gyeonggi-do, Korea. His current research interests include very-large-scale integration design for object recognition, architecture, and implementation of network-on-chip-based systemon-chip.
Joo-Young Kim (S’05–M’10) received the B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2005, 2007, and 2010, respectively. He has been with KAIST and involved with the development of parallel processors for computer vision since 2006. His current research interests include parallel architecture, subsystems, and very-largescale integration implementation for bio-inspired vision processors.
LEE et al.: DIGITAL CELLULAR NEURAL NETWORK FOR RAPID VISUAL ATTENTION IN AN OBJECT-RECOGNITION SoC
Hoi-Jun Yoo (M’95–SM’04–F’08) received the Graduate degree from the Electronic Department, Seoul National University, Seoul, Korea, in 1983, and the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1985 and 1988, respectively. His Ph.D. work concerned the fabrication process for GaAs vertical optoelectronic integrated circuits. He was with Bell Communications Research, Red Bank, NJ, from 1988 to 1990, where he invented the 2-D phase-locked vertical-cavity surface-emitting laser array, the front-surface-emitting laser, and the high-speed lateral heterojunction boiler transistors. In 1991, he became a Manager of the Dynamic Random Access Memory (DRAM) Design Group, Hyundai Electronics, Kyoungki-do, Korea, and designed a family of fast 1M DRAMs and 256M synchronous DRAMs. In 1998, he joined the faculty of the Department of Electrical Engineering, KAIST, and is now a Full Professor. From 2001 to 2005, he was the Director of the System Integration and IP Authoring Research Center, Daejoen, funded by the Korean government to promote worldwide IP authoring and its
73
system-on-chip (SoC) to the Minister in the Ministry of Information and Communication, Korea, and National Project Manager for SoCs and Computers. In 2007, he founded the System Design Innovation & Application Research Center at KAIST to research on and develop SoCs for intelligent robots, wearable computers, and bio systems. He has authored two books: DRAM Design (Seoul, Korea: Hongleung, 1996, in Korean) and High-Performance DRAM (Seoul, Korea: Sigma, 1999; in Korean) and wrote chapters for Networks-onChips (New York: Morgan Kaufmann, 2006). His current research interests include high-speed and low-power network on chips, 3-D graphics, body area networks, biomedical devices and circuits, and memory circuits and systems. Prof. Yoo is the Technical Program Committee Chair of the Asian Solid-State Circuits Conference (A-SSCC 2008). He is the recipient of the Electronic Industrial Association of Korea Award for his contribution to DRAM technology in 1994, the Hynix Development Award in 1995, the Korea Semiconductor Industry Association Award in 2002, the Best Research of KAIST Award in 2007, the Design Award of the Asia and South Pacific Design Automation Conference in 2001, and the Outstanding Design Awards of A-SSCC in 2005, 2006, and 2007. He is a member of the executive committee of the International Solid-State Circuits Conference, the Symposium on VeryLarge-Scale Integration Design, and the A-SSCC.
74
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
An Augmented Echo State Network for Nonlinear Adaptive Filtering of Complex Noncircular Signals Yili Xia, Student Member, IEEE, Beth Jelfs, Marc M. Van Hulle, Senior Member, IEEE, José C. Príncipe, Fellow, IEEE, and Danilo P. Mandic, Senior Member, IEEE
Abstract— A novel complex echo state network (ESN), utilizing full second-order statistical information in the complex domain, is introduced. This is achieved through the use of the socalled augmented complex statistics, thus making complex ESNs suitable for processing the generality of complex-valued signals, both second-order circular (proper) and noncircular (improper). Next, in order to deal with nonstationary processes with large nonlinear dynamics, a nonlinear readout layer is introduced and is further equipped with an adaptive amplitude of the nonlinearity. This combination of augmented complex statistics and enhanced adaptivity within ESNs also facilitates the processing of bivariate signals with strong component correlations. Simulations in the prediction setting on both circular and noncircular synthetic benchmark processes and real-world noncircular and nonstationary wind signals support the analysis. Index Terms— Augmented complex statistics, complex noncircularity, echo state networks, widely linear modeling, wind prediction.
I. I NTRODUCTION ECURRENT neural networks (RNNs) are a class of nonlinear adaptive filters with feedback, whose computational power stems from their ability to act as universal approximators for any continuous function on a compact domain [1], [2]. Owing to their rich inherent memory through feedback, RNNs have found applications in the modeling of highly nonlinear dynamic systems and the associated attractor dynamics. They are typically used in the system identification [3], [4], time-series prediction [5], [6], and adaptive noise cancellation settings [7], [8], where for the nonstationary and nonlinear nature of the signals and typically long impulse responses, using the class of static feedforward networks or transversal filters would result in undermodeling [2], [9], [10]. Recently, a class of discrete-time RNNs, called echo state networks (ESNs), have been introduced, with the aim to
R
Manuscript received November 12, 2009; revised August 1, 2010 and September 29, 2010; accepted October 2, 2010. Date of publication November 11, 2010; date of current version January 4, 2011. Y. Xia, B. Jelfs, and D. P. Mandic are with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. (e-mail: yili.xia06; beth.jelfs05;
[email protected]). M. M. Van Hulle is with the Laboratorium Neuro- en Psychofysiologie, Katholieke Universiteit Leuven, Leuven 3000, Belgium (e-mail:
[email protected]). J. C. Príncipe is with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2085444
reduce the complexity of computation encountered by standard RNNs [11]. The principle behind ESNs is to separate the RNN architecture into two constituent components: a recurrent architecture, called the “dynamical reservoir” or “hidden layer,” and a memoryless output layer, called the “readout neuron.” The recurrent architecture consists of a randomly generated group of hidden neurons with a specified degree of recurrent connections, and should satisfy the so-called “echo state property” to maintain stability [12]. This way, the high computational complexity of RNNs is significantly reduced due to the sparse connections among the hidden neurons, in addition, the learning requirements are reduced to only the weights connecting the hidden layer and the readout neuron.1 Many real-world bivariate processes, such as vector fields and directional signals with “intensity” and “direction” components, are most conveniently represented when considered complex-valued [13]. Consequently, in the neural network literature, several important approaches have been extended to the complex domain, examples include coherent neural networks for sensorimotor systems [14], sonar signal prediction and image enhancement by multivalued neurons [15], grayscale image processing by complex-valued multistate neural associate memory [16], and geometric figure transformation via complex-valued backpropagation networks [17]. The first extension of ESNs into the complex domain C was proposed in [18], this network had a linear output mapping, and was trained by the complex-valued Wiener filter, thus making the network second-order optimal for the processing of circular stationary data. Results in adaptive filtering dealing with real-world complex-valued data suggest that, due to the linearity of the output mapping, the degree of universal function approximation exhibited by standard ESNs may not be sufficient. To that end, a nonlinear output layer within ESNs, i.e., the linear mapping followed by a nonlinear activation function, has been proposed in [19]. To deal with common problems experienced in neural network training, such as saturation and slow convergence resulting from the unknown and large dynamics of inputs, the nonlinear output layer of ESNs has further been equipped with an adaptive amplitude of the nonlinearity [19]. Adaptive filtering algorithms in the complex domain C are usually considered generic extensions of their real domain counterparts. For instance, a common assumption explicitly 1 A recent special issue of Neural Networks, vol. 20, no. 3, 2007, edited by H. Jaeger, W. Maass, and J. C. Príncipe, was dedicated solely to ESNs and liquid state machines.
1045–9227/$26.00 © 2010 IEEE
XIA et al.: ECHO STATE NETWORK FOR NONLINEAR ADAPTIVE FILTERING OF COMPLEX NONCIRCULAR SIGNALS
or implicitly exists in the signal processing literature that complete second-order statistical information of a zero-mean complex vector z is contained in the covariance matrix E[zz H ]. However, recent results in so-called augmented complex statistics show that, in general, this leads to suboptimal estimation [20] and that for the generality of complex-valued random processes both the covariance matrix E[zz H ] and the pseudo-covariance matrix E[zzT ] should be considered to completely capture the second-order statistics available. In practice, this is achieved by widely linear modeling [21], which has been proved to be particularly advantageous when processing second-order noncircular (improper) signals for which the probability distributions are not rotation invariant2 [13], [22]. Recently, augmented complex statistics have been introduced into several key learning algorithms, examples include the augmented complex least means square (ACLMS) [23], augmented complex extended Kalman filter [24], and augmented complex real-time recurrent learning [25]. Following on these results, we here introduce augmented statistics into the training of complex ESNs, allowing us to make use of all the available second-order statistical information, and to produce optimal estimates for second-order noncircular (improper) signals. This paper is organized as follows. In Section II, we provide an overview of widely linear estimation and second-order augmented complex statistics. In Section III, the augmented complex ESN and its nonlinear variants are derived. Simulations on both synthetic circular and noncircular signals and real-world nonstationary and noncircular wind signals are given in Section IV, demonstrating the advantage of the augmented ESN over standard complex ESNs. Finally, Section V concludes this paper. II. W IDELY L INEAR M ODELING Consider the real-valued mean squared error (MSE) estimator yˆ = E[y|x] (1) which estimates the values of signal y in terms of another observation x. For zero-mean jointly normal y and x, the linear model solution is yˆ = xT h (2) where h = [h 1 , . . . , h N ]T is a vector of fixed filter coefficients, and the past of the observed variable is contained in the regressor vector x = [x 1 , . . . , x N ]T . In the complex domain, it is assumed that we can use the same form of conditional mean estimator that for real-valued signals given in (1), leading to the standard complex linear minimum mean squared error (MMSE) estimator3 yˆ = z H h
(3)
2 Circular complex processes have rotation-invariant probability distribution functions. 3 Both y = zT h and y = z H h are correct, yielding the same output and the mutually conjugate coefficient vectors. The latter form is more common and the former was used in the original CLMS paper [26], in this paper, we will use the first form.
75
where the symbol (·) H denotes the Hermitian transform operator. However, the real-valued linear estimator in (2) applies to both the real and imaginary parts of complex variables yˆr = E[yr |z r , z i ] yˆi = E[yi |z r , z i ].
(4)
A more general MSE estimator than that in (3) can be expressed as yˆ = E[yr |z r , z i ] + j E[yi |z r , z i ].
(5)
Upon employing the identities z r = (z + z ∗ )/2 and z i = (z − z ∗ )/2j we arrive at yˆ = E[yr |z, z ∗ ] + j E[yi |z, z ∗ ]
(6)
leading to a widely linear estimator for complex-valued data, given by (7) y = zT h + z H g where h and g are complex-valued coefficient vectors. This estimator is suitable for linear MMSE estimation of the generality of complex-valued processes (both circular and noncircular) [21], as it accounts for complete second-order information in C, as shown below. A. Second-Order Augmented Complex Statistics From (7), it is clear that the covariance matrix Czz = E[zz H ] alone does not have sufficient degrees of freedom to describe full second-order statistics in C [20] and, in order to make use of all the available statistical information, we also need to consider the pseudo-covariance matrix Pzz = E[zzT ]. Processes whose second-order statistics can be accurately described by only the covariance matrix, i.e., those for which the pseudocovariance Pzz = 0, are termed second-order circular (or proper). In general, the notion of circularity extends beyond second-order statistics to describe the class of signals with rotation-invariant distributions P[·] for which P[z] = P[ze j θ ] for θ ∈ [0, 2π). In most real-world applications, complex signals are second-order noncircular or improper, and their probability density functions are not rotation-invariant. In practice, to account for the improperness, the input vector z is concatenated with its conjugate z∗ , to produce an augmented 2N × 1 input vector z a z = . (8) z∗ This augmented input, together with the augmented weights wa = [hT , gT ]T , forms a widely linear estimate in (7), and its 2N × 2N augmented covariance matrix is given by [22] z H T Czz Pzz a a Cz z = E . (9) z z = P∗zz C∗zz z∗ This matrix now contains the complete complex secondorder statistical information available in the complex domain, (see [13], [27] for more details).
76
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Input layer u(k)
III. AUGMENTED ESN A. Standard Complex ESN with a Linear Output Mapping Fig. 1 shows the architecture of a standard ESN, which is composed of K external input neurons, L readout neurons, and N internal units. Without loss of generality, we shall address ESNs with one readout neuron (L = 1), as this facilitates the nonlinear adaptive filtering setting within the ESN architecture. The input and internal weights are stored, respectively, in the (N × K ) and (N × N) weight matrices Wip , Win , vector wb comprises the feedback weights connecting the readout neuron and the internal units, vector x(k) is the (N × 1) internal state vector, u(k) represents the (K × 1) input vector, and y(k) is the overall output. The network state at time instant k, denoted by q(k), is a concatenation of the input u(k), internal state x(k), and the delayed output y(k − 1) T q(k) = u(k), . . . , u(k − K + 1), x1 (k), . . . , xN (k), y(k − 1) (10) whereas the internal unit dynamics are described by [12] x(k) = f Wip u(k) + Win x(k − 1) + wb y(k − 1) (11)
where f(·) is a vector-valued nonlinear activation function of the neurons within the reservoir. The echo state property is provided by randomly choosing an internal weight matrix Win and performing scaling to make the spectral radius ρ(Win ) < 1, thus ensuring that the network is stable, the input and feedback weights can be initialized arbitrarily [12]. For an ESN with a linear output mapping, the output y(k) is given by y(k) = qT (k)w(k)
(12)
where w(k) is the weight vector corresponding to the output layer. Its update can be performed, e.g., based on the CLMS algorithm, given by [26] w(k + 1) = w(k) + µe(k)q∗ (k).
(13)
B. Augmented Complex ESN with a Linear Readout Neuron Based on the widely linear model given in Section II, we shall now derive the augmented widely linear stochastic gradient algorithm for the training of complex ESNs, thus making them suitable for processing general complex-valued signals (both circular and noncircular). To this end, we introduce the augmented network state qa (k) as4 qa (k) = u(k), . . . , u(k − K + 1), xa1 (k), . . . , xaN (k), T y(k − 1), u∗ (k), . . . , u∗ (k − K + 1) . (14)
Since the input weights of the ESN stored in matrix Wip are randomly chosen prior to training, we can use another matrix Waip to initialize the weights associated with the conjugate input vector u∗ (k). The internal state transition within the augmented ESN is therefore described by xa (k) = f Wip u(k) + Win xa (k − 1) +wb y(k − 1) + Waip u∗ (k) . (15)
4 This augmented state is not a straightforward application of the widely linear model in (7) and is specific to ESNs.
Internal Layer x(k)
Output Layer
u(k) u(k−1)
. . . u(k−K+1)
Wip
. . .
y(k)
. . .
Win
w Wb
Fig. 1.
Architecture of an ESN.
Due to the specific properties of the ESN architecture, the output y(k) of the augmented ESN with linear output mapping is governed by an asymmetric version of the widely linear model in (7) to yield y(k) = vT (k)h(k) + u H (k)g(k) (16) where v(k) is defined as v(k) = u(k), . . . , u(k − K + T 1), xa1 (k), . . . , xaN (k), y(k − 1) , and h(k) and g(k) denote, respectively, the conventional and conjugate output weight vectors. Note that the ESN has a local feedback (from the output to the internal state) and thus, unlike standard feedback structures [13], [28], the output within the state vector (14) of the ESN does not require augmentation with its conjugate. Therefore, due to local feedback, the conjugate weight vector g(k) is only associated with the conjugate input signal. The standard and conjugate weight vectors are thus of different dimensions, but as with all widely linear models, the conjugate weight vector g(k) = 0 for a circular input signal. C. Training of Augmented ESNs The ESN is trained based on the cost function 1 1 (17) J (k) = |e(k)|2 = e(k)e∗ (k) 2 2 where e(k) is the instantaneous output error e(k) = d(k) − y(k), and d(k) is the desired (teaching) signal. The update of the conjugate weight vector g(k) in (16) is given by g(k + 1) = g(k) − µ∇g J (k) (18) where µ is the learning rate. Note that J (k) is a real-valued function dependent on both output errors e(k) and e∗ (k). It can be shown that the maximum change in the cost function on the error surface occurs in the direction of the conjugate gradient ∂J (k)/∂g∗ (k), i.e., [29]–[31] ∇g J (k) =
∂J (k) . ∂g∗ (k)
Expanding the term ∂ J (k)/∂g∗ (k) gives ∂e∗ (k) 1 ∂e(k) ∗ e(k) ∗ . ∇g J (k) = + e (k) ∗ 2 ∂g (k) ∂g (k)
(19)
(20)
XIA et al.: ECHO STATE NETWORK FOR NONLINEAR ADAPTIVE FILTERING OF COMPLEX NONCIRCULAR SIGNALS
Since e∗ (k) = d ∗ (k) − v H (k)h∗ (k) − uT (k)g∗ (k)
(21)
and ∂e(k)/∂g∗ (k) = 0, we obtain 1 ∇g J (k) = − e(k)u(k) (22) 2 giving the update of the conjugate weight vector g(k) in the form5 g(k + 1) = g(k) + µe(k)u(k). (23) In a similar way, for the update of the conventional weight vector h(k), we have e(k) = d(k) − vT (k)h(k) − u H (k)g(k)
(24)
with the gradient ∇h J (k) = giving
Similar as in the previous section, we can obtain the weight update of the complex nonlinear gradient descent (CNGD) algorithm [13], [32] for ESNs as6 (factor 1/2 is absorbed in µ) (32) w(k + 1) = w(k) + µe(k)′∗ net (k) q∗ (k).
To derive the corresponding augmented CNGD (ACNGD) algorithm for the nonlinear output layer of the augmented ESN, recall that the output y(k) is given by
y(k) = (net (k)) = vT (k)h(k) + u H (k)g(k) . (33) The update of the conjugate weight vector g(k) in (33) is given by g(k + 1) = g(k) − µ∇g J (k).
(34)
From (20), after setting ∂e(k)/∂g∗ (k) = 0, we have
∂e∗(k)
1 ∂e(k) ∂ J (k) = + e∗ (k) ∗ e(k) ∗ ∂h∗ (k) 2 ∂h (k) ∂h (k)
77
(25)
1 ∇h J (k) = − e(k)v∗ (k) 2
(26)
h(k + 1) = h(k) + µe(k)v∗ (k).
(27)
and the update
We can express the updates in (23) and (27) in a compact vector form, by defining the augmented weight vector wa (k) as T wa (k) = hT (k), gT (k) (28)
to give the update of ACLMS algorithm for ESNs ∗
wa (k + 1) = wa (k) + µe(k)qa (k)
(29)
where qa (k) is the augmented network state in (14). Owing to the use of the widely linear model and augmented complex statistics, the augmented ESN has clear theoretical advantages over the standard ESN for the processing of noncircular complex signals. However, due to the sparse nature of the connectivity within the reservoir, the linear output mapping may not be powerful enough for efficient modeling of signals with large nonlinear dynamics. To this end, in the next section, we introduce a nonlinear readout neuron equipped with a trainable amplitude of nonlinearity. D. Complex ESN with a Nonlinear Readout Neuron The output y(k) of the standard ESN with a nonlinear output layer is given by [12]
y(k) = (net (k)) = qT (k)w(k) (30) where (·) is the output activation function, and the weight vector w(k) is updated by minimizing the cost function J (k), given in (17), to give w(k + 1) = w(k) − µ∇w J (k).
(31)
5 The factor 1/2 in (23) has been incorporated into the learning rate µ.
∇g J (k) =
∂e∗ (k) 1 e(k) ∗ . 2 ∂g (k)
(35)
Since e∗ (k) = d ∗ (k) − ∗ (net (k)), and for the complex transcendental functions ∂∗ (net (k))/∂g∗ (k) = (∂(net (k))/∂g(k))∗ , using the chain rule, we arrive at 1 ∇g J (k) = − e(k)′∗ net (k) u(k) (36) 2 giving the update for the weight vector g(k) in the form g(k + 1) = g(k) + µe(k)′∗ net (k) u(k). (37) In a similar way, the update of the weight vector h(k) becomes (38) h(k + 1) = h(k) + µe(k)′∗ net (k) v∗ (k) and the update of the augmented weight vector ∗ wa (k + 1) = wa (k) + µe(k)′∗ net (k) qa (k).
(39)
E. ESNs with an Adaptive Amplitude of Output Nonlinearity The nonlinear output layer has been introduced into ESNs to provide a sufficient degree of nonlinearity for enhanced modeling, however, this does not automatically guarantee optimal modeling, as some parameters, such as the amplitude of the nonlinear readout neuron, need to be chosen empirically [33]. To this end, we now introduce a gradient adaptive amplitude into the output nonlinearity of ESNs. In this case, the activation function can be expressed as (40) y(k) = net (k) = λ(k) net (k)
where the real-valued λ(k) 0 denotes the amplitude of the nonlinearity (net (k)) and (net (k)) is the unit amplitude activation function, for which λ(k) = 1. In the stochastic gradient setting, the parameter λ can be made gradient adaptive as [19], [33] λ(k + 1) = λ(k) − η∇Jλ (k) (41)
6 For most standard nonlinear activation functions in C (transcendental functions such as tanh), we have (∗ )′ =′ ∗ .
78
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
where η is the step size of the algorithm. The gradient 1 ∂ e(k)e∗ (k) ∂J (k) = ∇λ J (k) = ∂λ(k) 2 ∂λ(k) 1 ∗ ∂e(k) ∂e∗ (k) = e (k) (42) + e(k) 2 ∂λ(k) ∂λ(k) and since λ(k) is real-valued, ∂e∗ (k)/∂λ(k) = (∂e(k)/∂λ(k))∗ , giving ∂e(k) = −(net (k)). (43) ∂λ(k) The update for the amplitude of nonlinearity within adaptive amplitude CNGD (AACNGD) algorithm is therefore given by η ∗ ∗ e (k)(net (k)) + e(k) (net (k)) λ(k + 1) = λ(k) + 2 (44) and applies to both standard (CGND) and ACNGD learning algorithms, for more detail, see [13].
0.3 0.25 0.2 0.15 0.1 0.05 0
2 1.5 1 0.5 0 (a)
−5
5
0 −1
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 −15 −10 −5
0 (c)
5
10 15
−0.5
0 −15 −10 −5
0 (b)
0 (d)
0.5
5
1
10 15
Fig. 2. Probability density functions obtained from a doubly white circular noise after applying the widely linear model and tanh nonlinearity. (a) Real part of WL (n(k)). (b) Imaginary part of WL (n(k)). (c) Real part of tanh (n(k)). (d) Imaginary part of tanh (n(k)).
F. Merits of Nonlinearity and Widely Linear Model To illustrate the effect of nonlinearity and the widely linear model, we generated a circular doubly white noise n(k), with zero mean and unit variance, and passed it through a complexvalued tanh nonlinearity, defined in (45), and the widely linear model given by x(k) = WL(n(k)) = 0.6n(k) + 0.8n ∗ (k). Fig. 2(a) and (b) shows that the application of the widely linear model does not change the Gaussian natures of the real and imaginary parts, it only alters the power ratio. Fig. 2(c) and (d) shows that the application of tanh nonlinearity alters the character of distribution, which cannot be achieved by using the widely linear model. It is therefore natural to combine the widely linear model and nonlinear processing in order to deal simultaneously with various aspects of the nature of the data. By introducing the adaptive amplitude of nonlinearity, we have an additional degree of freedom, allowing the nonlinear function to operate in a quasi-linear range if so required by the nature of the data.
µ = 0.005 for all the learning algorithms considered, with the initial amplitude for the AACNGD algorithm λ(0) = 1 and the step size of the adaptive amplitude update η = 0.2. The benchmark circular signal was a stable linear autoregressive (AR)(4) process, given by [2] r (k) = 1.79r (k − 1) − 1.85r (k − 1) + 1.27r (k − 3) −0.41r (k − 4) + n(k) (46) driven by complex-valued doubly circular white Gaussian noise n(k) with zero mean and unit variance. The benchmark noncircular signal was a complex AR moving-average (ARMA) process, whose transfer function was a combination of the MA model in [34] and the stable AR model in (46), given by r (k) = 1.79r (k − 1) − 1.85r (k − 2) + 1.27r (k − 3) −0.41r (k − 4) + 0.2r (k − 5) + 2n(k)
IV. S IMULATIONS To verify the potential of the proposed augmented ESNs compared to standard complex ESNs, we performed simulations on both benchmark synthetic proper and improper signals, and for noncircular real-world wind data. For all signals, experiments were undertaken by averaging 200 independent simulation trials in the adaptive prediction setting. The nonlinearity at the nonlinear output layer of the ESNs was chosen to be the complex tanh function eβ x − e−β x (45) eβ x + e−β x with slope β = 1. Ten neurons were used in the hidden layer, with the internal connection weights having 5% degree of connectivity. The input tap length was K = 1, with no bias input. The values of the randomly selected input as well as internal and feedback weights Wip , Win , and wb were taken from a uniform distribution in the range [−1, +1], and the spectral radius ρ(Win ) was set to be 0.8. The learning rate was (x) =
+0.5n ∗ (k) + n(k − 1) + 0.9n ∗ (k − 1)
(47)
with E{n(k − i )n ∗ (k − j )} = δ(i − j ) E{n(k − i )n(k − j )} = Cδ(i − j )
(48)
where n(k) is the complex-valued doubly circular white Gaussian noise7 and C = 0.95 [34]. The nonlinear and noncircular chaotic Ikeda map signal is given by [35] x(k + 1) = 1 + u (x(k)cos[t (k)] − y(k)si n[t (k)]) y(k + 1) = u (x(k)si n[t (k)] + y(k)cos[t (k)]) (49) where u = 0.9 and t (k) = 0.4 − 6/(1 + x 2 (k) + y 2 (k)). 7 Double whiteness implies the uncorrelated real and imaginary channels, for circular white Gaussian noise, n = nr + j n i , σn2r = σn2i , whereas for noncircular data, σn2r > σn2i , where σn2r and σn2i are, respectively, the powers of the real and imaginary parts.
XIA et al.: ECHO STATE NETWORK FOR NONLINEAR ADAPTIVE FILTERING OF COMPLEX NONCIRCULAR SIGNALS
79
TABLE I C OMPARISON OF D EGREES OF N ONCIRCULARITY s FOR THE VARIOUS C LASSES OF S IGNALS Circular AR(4) (46) 0.0016
Noncircular ARMA (47) 0.9429
{·}
{·}
10
0
−10 −10
−5
ℜ{·} (a)
5
1
0.08
5
0.075
0
0.07
−5
0.065 ℜ{·} (b)
10 20 30
{·}
{·}
Wind (high) 0.8117
Standard ESN Augmented ESN
0.055 0.05
0.1
−1
Wind (medium) 0.4305
0.06
0.2
0
Wind (low) 0.1583
10
−10 −30 −20 −10
10
Ikeda map (49) 0.8936
MSE
s
0.045
0
0.04
−0.1
−2 −1 −0.5 0
ℜ{·} (c)
1 1.5 2
−0.2
0
0.1
ℜ{·} (d)
0.3
1750
0.4
2 0.4
1800
1850 1900 Number of iterations (k) (a)
0.15
2000
Standard ESN Augmented ESN
{·}
0
{·}
1950
−2
0
0.1
−4
−0.2 −0.1 0
ℜ{·} (e)
0.2 0.3
−3 −2 −1 0 ℜ{·} (f)
1
2
MSE
−0.2
Fig. 3. Geometric view of circularity via “real-imaginary” scatter plots. (a) Circular AR(4) process (46). (b) Noncircular ARMA process (47). (c) Noncircular Ikeda map (49). (d) Wind (low) signal. (e) Wind (medium) signal. (f) Wind (high) signal.
The wind data was sampled at 50 Hz and was collected in an urban environment over a one-day period [36], and was represented as a vector of speed and direction in the North– East coordinate system. The wind signal was made complex through combining the wind speed v and direction ϕ to form a complex signal ν = {ve j ϕ }. Based on the changes in the wind intensity, the noncircular wind data were identified as regions of low, medium, and high dynamics. We here perform simultaneous prediction of wind speed and direction, the results corresponding to wind power can be found in [37] and [38]. Fig. 3 shows the scatter plots of the complex signals considered in simulations. Observe the circular symmetry (rotation invariance) for the AR(4) signal (46) and the noncircularity of the ARMA model (47), Ikeda map (49), and wind signals. For a quantitative measurement of the degree of noncircularity of a complex vector z, we used the index s given by [39] s = 1 − det(Cza za )det−2 (Czz )
(50)
where det(·) denotes the matrix determinant operator, the degree of noncircularity s is normalized to within [0, 1], with the value of 0 indicating perfect circularity. Table I illustrates the degrees of noncircularity s for the various classes of signals. Observe the excellent match between the measure of noncircularity in Table I and the scatter plot descriptions in
0.05
0 1750
1800
1850 1900 Number of iterations (k) (b)
1950
2000
Fig. 4. MSEs of one-step-ahead prediction for augmented and standard ESN trained by the AACNGD algorithm. (a) Noncircular Ikeda map (49). (b) Noncircular and nonstationary wind (high) signal.
Fig. 3, for instance, among the wind segments, the wind (low) region was least noncircular, whereas the wind (high) region exhibited strong noncircularity. The standard prediction gain R p 10 log10 x2 / ˆ e2 [d B] was employed to assess the performance, where x2 and ˆ e2 denote, respectively, the variance of the input signal x(k) and the forward prediction error e(k). Table II compares averaged prediction gains Rp (dB) and their standard deviations over 200 independent trials for the standard and augmented ESNs trained by CLMS, CNGD, and AACNGD algorithms, as well as a dual univariate ESN trained by the LMS algorithm8 for the complex-valued signals considered. As expected, the dual univariate ESN, which treats the real and imaginary
8 The dual univariate approach deals with complex-valued data by splitting the input signals into its real and imaginary parts and treating them as independent real-valued quantities [13]. In a dual univariate ESN, two independent reservoirs are generated to model the two components of the input.
80
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE II C OMPARISON OF P REDICTION G AINS R p AND T HEIR S TANDARD D EVIATIONS ( IN B RACKET ) FOR THE VARIOUS C LASSES OF S IGNALS AVERAGED OVER 200 I NDEPENDENT I NITIALIZATIONS OF ESN R p [dB] Dual univariate ESN (LMS) Standard ESN (CLMS) Augmented ESN (ACLMS) Standard ESN (CNGD) Augmented ESN (Augmented CNGD) Standard ESN (AACNGD) Augmented ESN (Augmented AACNGD)
Circular AR(4) 4.1928 (0.5849) 4.6701 (0.6112) 4.5489 (0.4792) 4.6947 (0.5957) 4.5764 (0.4665) 6.6080 (0.5808) 6.5357 (0.4970)
Noncircular ARMA 2.8064 (0.3211) 3.5341 (0.3541) 4.0396 (0.3624) 3.6744 (0.3438) 4.1511 (0.3380) 5.0524 (0.5013) 5.2484 (0.6149)
Ikeda map −0.0271 (0.1846) 2.1134 (0.3619) 3.1967 (0.4258) 2.1558 (0.3005) 3.2666 (0.5435) 2.4679 (0.5673) 3.5912 (0.3053)
Wind (low) 2.3519 (0.5882) 2.4635 (0.4516) 2.6417 (0.3620) 2.6653 (0.4738) 2.8095 (0.3681) 4.0850 (0.3472) 4.2571 (0.3499)
Wind (medium) 4.0615 (0.9153) 4.7938 (0.7519) 5.3029 (0.7781) 5.0529 (0.7853) 5.7438 (0.8054) 6.3256 (0.7399) 6.8286 (0.7968)
Wind (high) 8.1162 (0.9124) 9.5917 (0.9746) 10.2063 (0.9537) 9.9982 (1.0130) 10.7125 (0.9860) 11.7789 (0.9065) 12.1900 (1.0272)
TABLE III P ERCENTAGE OF E NHANCED P ERFORMANCE OF AUGMENTED ESN A LGORITHMS FOR C OMPLEX N ONCIRCULAR S IGNALS Augmented ESN (ACLMS) Augmented ESN (Augmented CNGD) Augmented ESN (Augmented AACNGD)
Ikeda map 98.5% 98% 99%
ESN (CNGD) Augmented ESN (CNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
12.5
11.5 11 10.5 10
12.5
9.5
0.2
12 11.5 11 10.5
0.3 0.4 0.5 0.6 Degree of connectivity (a)
0.7
0.8
9.5 0.1
0.9
0.2
0.3
8
8
ESN (CNGD) Augmented ESN (CNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
7 6.5 6 5.5 5
0.4
0.5 0.6 0.7 Spectral Radius (a)
0.8
0.9
1
ESN (CNGD) Augmented ESN (ACNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
7.5 Prediction gain (dB)
7.5 Prediction gain (dB)
Wind (high) 91% 94.5% 92%
10
9
7 6.5 6 5.5 5
4.5 4 0.05 0.1
Wind (medium) 94% 95% 92.5%
ESN (CNGD) Augmented ESN (Augmented CNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
13 Prediction gain (dB)
12
8.5 0.05 0.1
Wind (low) 91.5% 90% 89.5%
13.5
13
Prediction gain (dB)
Noncircular ARMA 95.5% 95% 93.5%
0.2
0.3 0.4 0.5 0.6 Degree of connectivity (b)
0.7
0.8
0.9
4.5 0.1
0.2
0.3
0.4
0.5 0.6 0.7 Spectral Radius (b)
0.8
0.9
1
Fig. 5. Comparison of performances of standard and augmented ESNs trained by different algorithms, over a range of degrees of connectivity on one-step ahead prediction of the wind (medium) and (high) signals. (a) Wind (high). (b) Wind (medium).
Fig. 6. Comparison of performances of standard and augmented ESNs trained by different algorithms, over a range of spectral radii on onestep ahead prediction of the noncircular wind (medium) and (high) signals. (a) Wind (high). (b) Wind (medium).
parts of complex-valued data as two independent channels, had the worst performance. For the circular AR(4) signal, the performance of the augmented complex ESN was similar to that of standard ESN. For the noncircular signals, there was a significant improvement in the prediction gain when
the augmented ESN was employed. As desired, the advantage of the nonlinear output layer over the linear output mapping was more pronounced in the prediction of the nonlinear synthetic signal and nonlinear and nonstationary real-world wind signals. In practice, due to the randomly generated internal
XIA et al.: ECHO STATE NETWORK FOR NONLINEAR ADAPTIVE FILTERING OF COMPLEX NONCIRCULAR SIGNALS
13.5
9.5
12.5
7.5 7 ESN (CNGD) Augmented ESN (Augmented CNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
6.5 6 5.5 50
Prediction gain (dB)
Prediction gain (dB)
8
100
150
200
250 300 350 Reservoir size (a)
400
450
12 11.5 11 10.5 10 9.5 9 8.5
500
7
8.5
6.5
8
1
2
5.5 5 4.5
ESN (CNGD) Augmented ESN (Augmented CNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
4 5
10
15
20
Reservoir size (b)
3 Prediction horizon M (a)
4
5
ESN (CNGD) Augmented ESN (Augmented CNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
7.5
6
Prediction gain (dB)
Prediction gain (dB)
8.5
3.5
ESN (CNGD) Augmented ESN (Augmented CNGD) ESN (AACNGD) Augmented ESN (Augmented AACNGD)
13
9
81
7 6.5 6 5.5 5 4.5 4
1
2
3 Prediction horizon M (b)
4
5
Fig. 7. Comparison of performances of standard and augmented ESNs trained by different algorithms, over a range of reservoir size on one-stepahead prediction of the noncircular ARMA and wind (medium) signals. (a) Noncircular ARMA. (b) Wind (medium).
Fig. 8. Comparison of performances of standard and augmented ESNs trained by different algorithms on multiple-step-ahead prediction of the wind (medium) and (high) signals. (a) Wind (high). (b) Wind (medium).
reservoir within an ESN, the augmented ESN cannot guarantee enhanced performance over its standard version in every trial, however, as illustrated in Table III, on average, in more than 90% of the trials the widely linear algorithms outperformed the corresponding standard ones. To further illustrate the advantage of using augmented complex statistics within complex-valued ESNs, we compared the MSEs of both the augmented and standard ESNs with adaptive amplitude of nonlinearity for the prediction of the complex-valued synthetic nonlinear and noncircular Ikeda map and the noncircular wind (high) signal. Fig. 4 shows that, in both cases, the augmented ESN with a nonlinear readout neuron trained by the augmented AACNGD outperformed its standard version. We next investigated the influences of two parameters related to the generation of the internal layer, the degree of connectivity, and the spectral radius ρ(Win ) on the performance of standard and augmented ESNs. Figs. 5 and 6 show that, in all the cases, for the prediction of real-world wind (medium) and (high) signals the augmented ESN trained by the augmented AACNGD algorithm achieved the best performance and that for both learning strategies it is desirable to keep a low
degree of connectivity within the reservoir. This conforms to the ESN theory [12] that a small degree of connectivity can perform a relative decoupling of subnetworks with rich reservoir dynamics. The size of the dynamical reservoir is another important parameter that influences the performance of ESNs, as it reflects their universal approximation ability. An ESN with a larger reservoir size can learn the signal dynamics with a higher accuracy [40], as shown in Fig. 7(a) on one-stepahead prediction of the noncircular ARMA process in (47). This, however, applies to stationary signals, whereas for fastchanging nonstationary processes, the larger reservoir caused saturation of internal neurons, resulting in performance degradation, as shown in Fig. 7(b) for the prediction of the nonstationary wind (medium) signal. Observe that, in all the cases, the augmented ESNs outperformed their standard counterparts. In the final set of simulations, we considered multistepahead prediction of the noncircular and nonstationary wind (medium) and (high) data. Fig. 8 shows the prediction gains of ESNs for a prediction horizon M = 1, 2, 3, 4, and 5, in all the cases, the augmented ESN with adaptive amplitude of nonlinearity achieved the best performance.
82
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
V. C ONCLUSION An augmented complex ESN has been introduced for nonlinear adaptive filtering of the generality of complexvalued signals. The proposed ESN has been derived based on augmented complex statistics, thus making it suitable for both second-order circular and noncircular signals. For generalities, a nonlinear output layer has been introduced, and to deal with signals with large dynamics, an adaptive amplitude has been introduced into the output layer of the augmented ESN. The proposed augmented ESNs have been shown to exhibit theoretical and practical advantages over their conventional counterparts. This has been verified through comprehensive simulations on both synthetic noncircular data and real-world wind measurements, and over a range of parameters. ACKNOWLEDGMENT The authors would like to thank Prof. Aihara’s team at the Institute of Industrial Science, University of Tokyo, Tokyo, Japan, for providing wind data used in the simulations. They also acknowledge Gill Instruments, Hampshire, U.K., for providing their ultrasonic anemometers for the wind recordings. R EFERENCES [1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Contr. Signals, Syst., vol. 2, no. 4, pp. 303–314, 1989. [2] D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. New York: Wiley, 2001. [3] K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems using neural networks,” IEEE Trans. Neural Netw., vol. 1, no. 1, pp. 4–27, Mar. 1990. [4] C. H. Lee and C. C. Teng, “Identification and control of dynamic systems using recurrent fuzzy neural networks,” IEEE Trans. Fuzzy Syst., vol. 8, no. 4, pp. 349–366, Aug. 2000. [5] S. Haykin and L. Li, “Nonlinear adaptive prediction of nonstationary signals,” IEEE Trans. Signal Process., vol. 43, no. 2, pp. 526–535, Feb. 1995. [6] D. P. Mandic and J. A. Chambers, “Toward an optimal PRNN-based nonlinear predictor,” IEEE Trans. Neural Netw., vol. 10, no. 6, pp. 1435– 1442, Nov. 1999. [7] S. A. Billings and C. F. Fung, “Recurrent radial basis function networks for adaptive noise cancellation,” Neural Netw., vol. 8, no. 2, pp. 273– 290, 1995. [8] C.-M. Lin, L.-Y. Chen, and D. S. Yeung, “Adaptive filter design using recurrent cerebellar model articulation controller,” IEEE Trans. Neural Netw., vol. 21, no. 7, pp. 1149–1157, Jul. 2010. [9] J. L. Elman, “Finding structure in time,” Cognitive Sci., vol. 14, no. 2, pp. 179–211, Mar. 1990. [10] W. Liu, I. Park, and J. C. Príncipe, “An information theoretic approach of designing sparse kernel adpative filters,” IEEE Trans. Neural Netw., vol. 20, no. 12, pp. 1950–1961, Dec. 2009. [11] H. Jaeger and H. Haas, “Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication,” Science, vol. 304, no. 5667, pp. 78–80, Apr. 2004. [12] H. Jaeger, “The echo state approach to analyzing and training neural networks,” German Nat. Res. Inst. Inform. Technol., Sankt Augustin, Germany, Rep. 148, 2002. [13] D. P. Mandic and S. L. Goh, Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models. New York: Wiley, 2009. [14] A. Hirose, Complex-Valued Neural Networks. New York: SpringerVerlag, 2006. [15] I. N. Aizenberg, N. N. Aizenberg, and J. P. L. Vandewalle, Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Norwell, MA: Kluwer Academic, 2000.
[16] S. Jankowski, A. Lozowski, and J. M. Zurada, “Complex-valued multistate neural associative memory,” IEEE Trans. Neural Netw., vol. 7, no. 6, pp. 1491–1496, Nov. 1996. [17] T. Nitta, “An extension of the back-propagation algorithm to complex numbers,” Neural Netw., vol. 10, no. 8, pp. 1391–1415, Nov. 1997. [18] S. Seth, M. C. Ozturk, and J. C. Príncipe, “Signal processing with echo state networks in the complex domain,” in Proc. IEEE Workshop Mach. Learn. Signal Process., Thessaloniki, Greece, Aug. 2007, pp. 408–412. [19] Y. Xia, D. P. Mandic, M. M. Van Hulle, and J. C. Príncipe, “A complex echo state network for nonlinear adaptive filtering,” in Proc. IEEE Workshop Mach. Learn. Signal Process., Cancun, Mexico, Oct. 2008, pp. 404–408. [20] F. D. Neeser and J. L. Massey, “Proper complex random processes with applications to information theory,” IEEE Trans. Inform. Theory, vol. 39, no. 4, pp. 1293–1302, Jul. 1993. [21] B. Picinbono and P. Chevalier, “Widely linear estimation with complex data,” IEEE Trans. Signal Process., vol. 43, no. 8, pp. 2030–2033, Aug. 1995. [22] P. J. Schreier and L. L. Scharf, “Second-order analysis of improper complex random vectors and process,” IEEE Trans. Signal Process., vol. 51, no. 3, pp. 714–725, Mar. 2003. [23] S. Javidi, M. Pedzisz, S. L. Goh, and D. Mandic, “The augmented complex least mean square algorithm with application to adaptive prediction problems,” in Proc. 1st IARP Workshop Cognitive Inform. Process., 2008, pp. 54–57. [24] S. L. Goh and D. P. Mandic, “An augmented extended Kalman filter algorithm for complex-valued recurrent neural networks,” Neural Comput., vol. 19, no. 4, pp. 1039–1055, Apr. 2007. [25] S. L. Goh and D. P. Mandic, “An augmented CRTRL for complexvalued recurrent neural networks,” Neural Netw., vol. 20, no. 10, pp. 1061–1066, 2007. [26] B. Widrow, J. McCool, and M. Ball, “The complex LMS algorithm,” Proc. IEEE, vol. 63, no. 4, pp. 719–720, Apr. 1975. [27] P. Schreier and L. Scharf, Statistical Signal Processing of ComplexValued Data: The Theory of Improper and Noncircular Signals. Cambridge, U.K.: Cambridge Univ. Press, 2010. [28] C. C. Took and D. P. Mandic, “Adaptive IIR filtering of noncircular complex signals,” IEEE Trans. Signal Process., vol. 57, no. 10, pp. 4111–4118, Oct. 2009. [29] D. H. Brandwood, “A complex gardient operator and its application in adaptive array theory,” IEE Proc. Commun., Radar Signal Process., vol. 130, no. 1, pp. 11–16, Feb. 1983. [30] A. van den Bos, “Complex gradient and Hessian,” IEE Proc. Vis., Image Signal Process., vol. 141, no. 6, pp. 380–383, Dec. 1994. [31] K. Kreutz-Delgado, “The complex gradient operator and the CRcalculus,” Dept. Elect. Comput. Eng., Univ. California, San Diego, Tech. Rep. ECE275A, 2006. [32] G. M. Georgiou and C. Koutsougeras, “Complex domain backpropagations,” IEEE Trans. Circuits Syst. II, vol. 39, no. 5, pp. 330–334, May 1992. [33] E. Trentin, “Networks with trainable amplitude of activation functions,” Neural Netw., vol. 14, nos. 4–5, pp. 471–493, May 2001. [34] J. Navarro-Moreno, “ARMA prediction of widely linear systems by using the innovations algorithm,” IEEE Trans. Signal Process., vol. 56, no. 7, pp. 3061–3068, Jul. 2008. [35] K. Aihara, Applied Chaos and Applicable Chaos. Tokyo, Japan: ScienceSha, 1994. [36] D. P. Mandic, S. Javidi, S. L. Goh, A. Kuh, and K. Aihara, “Complexvalued prediction of wind profile using augmented complex statistics,” Renewable Energy, vol. 34, no. 1, pp. 196–210, Jan. 2009. [37] G. Sideratos and N. Hatziargyriou, “An advanced statistical method for wind power forecasting,” IEEE Trans. Power Syst., vol. 22, no. 1, pp. 258–265, Feb. 2007. [38] R. J. Bessa, V. Miranda, J. C. Príncipe, A. Botterud, and J. Wang, “Information theoretic learning applied to wind power modelling,” in Proc. IEEE World Congr. Comput. Intell., Barcelona, Spain, Jul. 2010, pp. 2409–2416. [39] P. J. Schreier, “The degree of impropriety (noncircularity) of complex random vectors,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Las Vegas, NV, Apr. 2008, pp. 3909–3912. [40] H. Jaeger, “Tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the “echo state network” approach,” German Nat. Res. Inst. Inform. Technol., Sankt Augustin, Germany, Rep. 159, 2002.
XIA et al.: ECHO STATE NETWORK FOR NONLINEAR ADAPTIVE FILTERING OF COMPLEX NONCIRCULAR SIGNALS
Yili Xia (S’09) received the B.Eng. degree in information engineering from Southeast University, Nanjing, China, in 2006, and the M.Sc. (distinction) degree in communications and signal processing from the Department of Electrical and Electronic Engineering, Imperial College London, London, U.K., in 2007. He is currently pursuing the Ph.D. degree at the Imperial College London, where he is a Research Assistant. His current research interests include linear and nonlinear adaptive filters and complex-valued statis-
83
Beth Jelfs received the M.Eng. degree in electronic and software engineering from the University of Leicester, Leicester, U.K. She is currently working toward the Ph.D. degree in electrical and electronic engineering at the Imperial College London, London, U.K. She was with Marconi Optical Components, Northamptonshire, U.K., as a Test Technician in 2000. She is involved on a British Council Research Exchange Grant with the Technische Universität München, München, Germany, and the Max-Planck Institute for Dynamics and Self-Organization, Göttingen, Germany. Her current research interests include adaptive signal processing and signal modality characterization. Ms. Jelfs was a recipient of the British Computer Society Prize for the top graduate from the University of Leicester in 2005.
José C. Príncipe (M’83–SM’90–F’00) received the Honorary Doctorate degree from the Universita Mediterranea, Reggio di Calabria, Italy, the University of Maranhao, São Luis, Brazil, and Aalto University, Helsinki, Finland. He is a Distinguished Professor of electrical and biomedical engineering at the University of Florida, Gainesville, where he teaches advanced signal processing and machine learning. He is the BellSouth Professor and Founding Director of the Computational Neuro-Engineering Laboratory, University of Florida. He has authored five books and published more than 200 book chapters and research papers in refereed journals, and has presented over 380 conference papers. He has directed 65 Ph.D. dissertations and 67 Master’s theses. His current research interests include advanced signal processing and machine learning, brain–machine interfaces, and the modeling and applications of cognitive systems. Dr. Príncipe is a Fellow of the American Institute for Medical and Biological Engineering. He is the recipient of the INNS Gabor Award, the IEEE Engineering in Medicine and Biology Society Career Achievement Award, and the IEEE Computational Intelligence Society Neural Network Pioneer Award. He is Editor-in-Chief of the IEEE R EVIEWS ON B IOMED ICAL E NGINEERING, Past Editor-in-Chief of the IEEE T RANSACTIONS ON BIOMEDICAL E NGINEERING, current Administrative Committee Member of the IEEE Computational Intelligence Society, the IEEE Biometrics Council, and the IEEE Biomedical Engineering Society, member of the Technical Committee on Machine Learning for Signal Processing of the IEEE Signal Processing Society, member of the Executive Committee of the International Neural Network Society, and past President of the International Neural Network Society. He is a former member of the Scientific Board of the Food and Drug Administration, and a member of the Advisory Board of the McKnight Brain Institute, University of Florida.
Marc M. Van Hulle (SM’00) received the M.Sc. degree in electrotechnical engineering and the Ph.D. degree in applied sciences from the Katholieke Universiteit Leuven (K. U. Leuven), Leuven, Belgium. He also received the B.Sc.Econ. and M.B.A. degrees. He received the Doctor Technices degree from Queen Margrethe II of Denmark in 2003, and an Honorary Doctoral degree from Brest State University, Brest, Belarus, in 2009. He is currently a Full Professor at the K. U. Leuven Medical School, where he heads the Computational Neuroscience Group of the Laboratorium voor Neuro- en Psychofysiologie. In 1992, he was with the Brain and Cognitive Sciences Department, Massachusetts Institute of Technology, Boston, as a Post-Doctoral Scientist. He has authored a monograph titled Faithful Representations and Topographic Maps: From Distortion- to Information-Based Self-Organization (John Wiley, 2000; also translated into Japanese), and has published 200 technical papers. His current research interests include computational neuroscience, neural networks, computer vision, data mining, and signal processing. Dr. Van Hulle is an Executive Member of the IEEE Signal Processing Society, and an Associate Editor of the IEEE T RANSACTIONS ON N EURAL N ETWORKS , Computational Intelligence and Neuroscience, and the International Journal of Neural Systems. He is a member of the program committees of several international machine learning and signal processing workshops and conferences. In 2009, he received the SWIFT prize of the King Baudouin Foundation of Belgium for his work on the Mind Speller, which received worldwide press coverage. In 2010, he received the Red Dot Design Award for the Mind Speller, which is one of the most prestigious design prizes in the world.
Danilo P. Mandic (M’99–SM’03) received the Ph.D. degree in nonlinear adaptive signal processing from Imperial College London, London, U.K., in 1999. He is a Reader in signal processing at the Imperial College London. He has been a Guest Professor at Katholieke Universiteit Leuven, Leuven, Belgium, Tokyo University of Agriculture & Technology, Tokyo, Japan, and Westminster University, London, and a Frontier Researcher in RIKEN, Saitama, Japan. He has been working in the area of nonlinear adaptive signal processing and nonlinear dynamics. His publication record includes two research monographs titled Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability (1st ed., Aug. 2001) and Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models (1st ed., Wiley, Apr. 2009), an edited book titled Signal Processing Techniques for Knowledge Extraction and Information Fusion (Springer, 2008) and more than 200 research papers on signal and image processing. Dr. Mandic is a member of the London Mathematical Society. He is a member of the IEEE Technical Committee on Machine Learning for Signal Processing, Associate Editor for the IEEE T RANSACTIONS ON CIRCUITS AND S YSTEMS II , IEEE T RANSACTIONS ON S IGNAL P ROCESSING , IEEE T RANSACTIONS ON N EURAL N ETWORKS , and the International Journal of Mathematical Modelling and Algorithms. He has produced award winning papers and products resulting from his collaboration with industry.
tical analysis.
84
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Learning Pattern Recognition Through Quasi-Synchronization of Phase Oscillators Ekaterina Vassilieva, Guillaume Pinto, José Acacio de Barros, and Patrick Suppes
Abstract— The idea that synchronized oscillations are important in cognitive tasks is receiving significant attention. In this view, single neurons are no longer elementary computational units. Rather, coherent oscillating groups of neurons are seen as nodes of networks performing cognitive tasks. From this assumption, we develop a model of stimulus-pattern learning and recognition. The three most salient features of our model are: 1) a new definition of synchronization; 2) demonstrated robustness in the presence of noise; and 3) pattern learning. Index Terms— Kuramoto oscillators, oscillator network, pattern recognition, phase oscillators, quasi-synchronization.
I. I NTRODUCTION SCILLATOR synchronization is a common phenomenon. Examples are the synchronizations of pace-maker cells in the heart [1], of fireflies [1], of pendulum clocks [2], and of chemical oscillations [3]. Winfree introduced and formalized the concept of biological oscillators and their synchronization [1]. Later, Kuramoto [3] developed a solvable theory for this kind of behavior. To understand how oscillators synchronize, let us consider neural networks. Let A be a neuron that fires periodically. A is our oscillator, with natural frequency given by its firing rate. Now, if another neuron B, coupled to A, fires shortly before A is expected to fire, this will cause A to fire a little earlier than if B did not fire. If you have many neurons coupled to A, each neuron will pull A’s firing closer to its own. This is the overall idea of Kuramoto’s model [3]. In it, a phase function encodes neuron firings. The dynamics of this phase is such that it is pulled toward the phase of other neurons. It can be shown that, if the couplings are strong enough, the neurons synchronize (for a review, see [4]). A question of current interest is the role of neural oscillations on cognitive functions. In theoretical studies, synchronous oscillations emerge from weakly interacting neurons
O
Manuscript received July 28, 2008; revised July 16, 2010, September 16, 2010, and September 23, 2010; accepted September 28, 2010. Date of publication November 11, 2010; date of current version January 4, 2011. E. Vassilieva is with the Laboratoire d’Informatique de l’X, Laboratoire d’Informatique de l’École Polytechnique, Palaiseau Cedex 91128, France (e-mail:
[email protected]). G. Pinto is with Parrot SA, Paris 75010, France (e-mail:
[email protected]). J. A. de Barros is with the Liberal Studies Program, San Francisco State University, San Francisco, CA 94132 USA (e-mail:
[email protected]). P. Suppes is with the Center for the Study of Language and Information, Stanford University, Stanford, CA 94305 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TNN.2010.2086476
close to a bifurcation [5], [6]. Experimentally, Gray and collaborators [7] showed that groups of neurons oscillate. Neural oscillators are apparently ubiquitous in the brain, and their oscillations are macroscopically observable in electroencephalograms [5]. Experiments show not only synchronization of oscillators in the brain [8]–[18], but also their relationship to perceptual processing [9], [10], [12], [15], [19]. Oscillators may also play a role in solving the binding problem [8], and have been used to model a range of brain functions, such as pyramidal cells [20], electric field effects in epilepsy [21], cat’s visual cortex activities [15], birdsong learning [22], and coordinated finger tapping [23]. However, current techniques for measuring synchronized neuronal activity in the brain are not good enough to unquestionably link oscillatory behavior to the underlying processing of cognitive tasks. During the past 15 years, researchers have tried to build oscillator and pattern recognition models inspired by biological data. As a result, diverse computational models based on networks of oscillators have been proposed. Ozawa and collaborators produced a pattern recognition model capable of learning multiple multiclass classifications online [24]. Meir and Baldi [25] were among the first to apply oscillator networks to texture discrimination. Wang did extensive work on oscillator networks, in particular with locally excitatory globally inhibitory oscillator networks [26], employing oscillator synchronization to code pixel binding. Wang and Cesmeli computed texture segmentation using pairwise coupled Van Der Pol oscillators [27]. Chen and Wang showed that locally coupled oscillator networks could be effective in image segmentation [28]. Borisyuk and collaborators studied a model of a network of peripheral oscillators controlled by a central one [29], and applied it to problems such as object selection [30] and novelty detection [31]. In this paper, we apply networks of weakly coupled Kuramoto oscillators to pattern recognition. Our main goal is to use oscillators in a way that allows learning. To allow for a richness of synchronization patterns, and therefore prevent the systemic synchronization of oscillators, we work with weaker couplings than what is required for robust synchronization [4]. Such couplings require us to depart from the standard definition of synchronization, leading us to redefine synchronization in a weaker sense. This paper is organized as follows. Section II motivates our definition of quasi-synchrony in pattern recognition. Section III shows how learning can occur by changes to their frequencies. Section IV applies the oscillator model to image recognition. Finally, we end with some comments.
1045–9227/$26.00 © 2010 IEEE
VASSILIEVA et al.: LEARNING PATTERN RECOGNITION THROUGH QUASI-SYNCHRONIZATION OF PHASE OSCILLATORS
50
II. PATTERN R ECOGNITION WITH W EAKLY C OUPLED O SCILLATORS
1 dφn (t) = f n + 2π dt
N
An Am knm sin [φm (t) − φn (t)] .
(1)
m=1
We define a stimulus s as a set of ordered triples s = (Asn , fns , φns (0)) n∈G S
(2)
with each triple representing the amplitudes, natural frequencies, and initial phases of an oscillator. Intuitively, s is meant to be a model of the brain’s sensory representation of an external stimulus. When a stimulus is presented, the phases of the stimulus oscillators, as well as their natural frequencies and amplitude, match the values in s. In other words, for all oscillators On ∈ G S , when s is presented, f n = f ns , An = Asn , φn (0) = φns (0). A typical phenomenon in a network of Kuramoto oscillators is the emergence of synchronization. Two oscillators are considered synchronized if they oscillate with the same frequency and are phase-locked [33], [34]. Let us consider a six-oscillator example, with two stimulus, O1 and O2 , and four recognition, O3 , O4 , O5 , and O6 , oscillators having couplings kmn = 1, except k12 = k21 = 0. We set f 3 = 10 Hz
f 4 = 15 Hz
(3)
f 5 = 20 Hz
f 6 = 25 Hz
(4)
as the natural frequencies of the recognition oscillators. Since (1) implies varying frequencies, we define the instantaneous frequency of the i th oscillator as the temporal rate of change of its phase, i.e., ωi = dφi /dt. At this point we must make our notation explicit. Both fi and ωi are frequencies, but f i enters in (1) as the natural frequency of an oscillator, and is measured in hertz, whereas ωi is defined as the time derivative of φi , and is measured in radians per second. We emphasize that these two frequencies are not only measured differently, but they are also conceptually distinct. Usually, there is no need to make such distinction, but we will need it later on when we discuss
45
Frequencies (HZ)
40 35 30 25 20 15 10 5
0
0.1
0.2 0.3 Time (sec)
0.4
0.5
Fig. 1. Six-oscillator network response to stimulus f 1s = 40 Hz, f 2s = 45 Hz, As1 = 1, As2 = 1, φ1s (0) = 0, and φ2s (0) = 0. Oscillators do not synchronize. O1 and O2 are the dashed and solid gray lines, and O3 , O4 , O5 , and O6 are the dash-dot, dotted, dashed, and solid black lines.
35 30 25 Frequencies (HZ)
We start with a set of N weakly coupled oscillators O1 , . . . , O N , and split this set in two: stimulus and recognition [32]. Formally, G = {O1 , O2 , O3 , . . . , O N } is the network of oscillators, and G S and G R are the stimulus and recognition subnetworks of G, such that G = G S ⊔ G R . For our purposes, the stimulus subnetwork represents neural excitations due to an external sensory signal, and synchronization pattern in the recognition subnetwork represents the brain’s representation of the recognized stimulus. We assume that synchronizations of oscillators represent information processed in the brain. Each oscillator On in the network is characterized by its natural frequency fn . The couplings between oscillators is given by a set of nonnegative coupling constants, {knm }m=n . For simplicity, we assume symmetry, i.e., knm = kmn for all n and m. Let us assume that we can represent On by a measurable quantity x n (t). If we write x n (t) as x n (t) = An (t) cos φn (t), then φn (t) is the phase and An (t) the amplitude. Assuming constant amplitudes, we focus on phases satisfying Kuramoto’s equation [3]
85
20 15 10 5 0
0
0.1
0.2 0.3 Time (sec)
0.4
0.5
Fig. 2. Six-oscillator network response to stimulus f 1s = 14 Hz, f 2s = 21 Hz, As1 = 4, As2 = 4, φ1s (0) = 0, and φ2s (0) = 0. Oscillators synchronize completely after approximately 150 ms. O1 and O2 are the dashed and solid gray lines, and O3 , O4 , O5 , and O6 are the dash-dot, dotted, dashed, and solid black lines.
learning. Figs. 1–3 show the instantaneous frequency of the oscillators for three different stimuli (the natural frequencies are shown as straight lines, for reference). We can quantify the synchronization (or lack of) in Figs. 1 and 2. In Fig. 3, the situation is different. There, two groups seem to emerge, with frequencies varying periodically within each group. The standard definition states that two oscillators On and Om are synchronized if their frequencies are asymptotically the same, i.e., if dφm dφn (t) − (t) = 0. (5) lim t →+∞ dt dt This definition works nicely for Figs. 1 and 2, but fails for Fig. 3. If we want to say that the oscillators in Fig. 3 are synchronized, we need to propose a different definition. For instance, the periodic variations in the differences between the frequencies of O3 and O4 result from various perturbations
86
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
0.5
35 30
0 Dephasing
Frequencies (HZ)
25 20 15
−0.5 10 5 0
0
0.1
0.2 0.3 Time (sec)
0.4
−1
0.5
Var [sin(n − m )] < ǫ,
0 < ǫ ≪ 0.5.
(6)
The smaller the value of ǫ, the closer quasi-synchronization is to be equivalent to (5). But it is possible for oscillators to be quasi-synchronized and to not satisfy (5). In fact, in Section IV the example only works if we consider quasi-synchronization. Even though ideally ǫ should be as close to zero as possible, throughout this paper we use ǫ = 0.35. This value was chosen because, in our simulations, it allows for a quicker recognition of synchronization (due to its high value) without loss of
0.1
0.2 0.3 Time (sec)
0.4
0.5
0
0.1
0.2 0.3 Time (sec)
0.4
0.5
1
Fig. 3. Six-oscillator network response to stimulus f 1s = 12.5 Hz, f 2s = 22.5 Hz, As1 = 2, As2 = 2, φ1s (0) = 0, and φ2s (0) = 0. O1 and O2 are the dashed and solid gray lines, and O3 , O4 , O5 , and O6 are the dash-dot, dotted, dashed, and solid black lines. Oscillators’ behavior displays varying instantaneous frequencies that seem to show that the group of oscillators O1 , O3 , and O4 oscillate coherently, as well as group O2 , O5 , and O6 .
0.8 0.6 0.4 Dephasing
induced by the other oscillators to which they are connected but not synchronized. To address this point, Kazanovich and Borisyuk [30] proposed that two oscillators are synchronized if their dephasing, i.e., the difference between their phases, is bounded. This definition is not adequate for our purposes, since for a finite time all continuous functions are bounded, and we may not differentiate nonsynchronized oscillators with close natural frequencies from synchronized oscillators undergoing substantial perturbations (see Fig. 3). Therefore, we need a more flexible definition of synchronization. Let and be two continuous random variables independently and uniformly distributed on the interval [0, 2π]. Then sin and sin have zero expectation, and Var(sin( − )) = 0.5. However, if and are perfectly correlated, then Var(sin( − )) = 0. For the example shown in Fig. 3, we illustrate in Fig. 4 the variance of the sine of the phase differences (dephasing). We see that for O3 and O4 the sine of the dephasing is constrained to a small interval, causing its variance to be small. On the other hand, the sine of the dephasing between O3 and O5 , which are intuitively not synchronized, looks like a sine function, and its variance is approximately 0.5. So, we adopt the following. Definition 1: Oscillators On and Om are ǫ-quasisynchronized (or quasi-synchronized) if their phases, represented by n and m , satisfy
0
0.2 0 −0.2 −0.4 −0.6 −0.8 −1
Fig. 4. Sines of the phase differences and their variances for oscillator pairs O3 and O4 (top) and pairs O3 and O5 (bottom). The dashed lines give the sines, and the solid lines show recursive numerical estimations of the variance.
discrimination between patterns. In other words, for our finitetime simulations, very small values of ǫ would take too long to converge, whereas values closer to 0.5 would not discriminate unsynchronized oscillators. To further investigate the differences between quasi- and standard synchronization, it is useful to see how our example behaves when we vary the stimulus. First, let us recall that, in the mean-field approximation, with the assumption of equal weight and all-to-all couplings, the oscillators synchronize when the mean coupling exceeds a critical value K c = 2/(πg(ω0 )), where g(ω) is the density distribution of oscillator frequencies and ω0 its mean (g is assumed symmetric) [3]. Our example violates those assumptions, mainly the all-to-all equal coupling, the symmetric distribution of frequencies, and the large number of oscillators. But, as Fig. 1 shows, if we pick O1 and O2 close to O3 , O4 , O5 , and O6 in a more symmetric way, all oscillators synchronize. On the other hand, if O1 and O2 are far from symmetry, the oscillators do not synchronize (Fig. 2). More interestingly, there are regimes of quasi-synchronization for other frequency distributions. To make this explicit, let us look at Fig. 5, which shows the synchronization patterns for stimulus frequencies varying
VASSILIEVA et al.: LEARNING PATTERN RECOGNITION THROUGH QUASI-SYNCHRONIZATION OF PHASE OSCILLATORS
35
N 40 O
Quasi Sync: 35 S Pattern 1 Y N 30 C Quasi 25 P Sync: 4 Pattern 5 20 P6
P 10 1 5
5
Quasi Sync: Pattern 2 10
Quasi Sync: Pattern 6
Quasi Sync: Pattern 3 Sync
30
No Sync
Frequencies (HZ)
Oscillator 1’s Natural Frequency (HZ)
45
15
87
Quasi Sync: Pattern 4 Quasi Sync: Pattern 6
Quasi Quasi Sync: Sync: Pattern 1 Pattern 5 P6 Pattern 4 No Sync 30 35 40 15 20 25 Oscillator 2’s Natural Frequency (Hz)
25 20 15 10 5
45
Fig. 5. Synchronization regions emerging from a six-oscillator network response to varying frequencies of stimulus oscillators O1 and O2 . O3 , O4 , O5 , and O6 have couplings kmn = 1, except k12 = k21 = 0, and have frequencies given by (3) and (4). Each numbered pattern corresponds to the quasi-synchronization of the following oscillators. (1) O1 and O2 . (2) O1 , O2 , and O3 . (3) O2 , O3 , and O4 . (4) O3 , and O4 . (5) O1 with O2 and O3 with O4 (but not O1 with O3 , and so on). (6) O2 and O3 . The elliptical area around 17.5 Hz corresponds to all oscillators synchronized (as in Fig. 2), whereas the “No synch” areas correspond to no synchronization of oscillators (as in Fig. 1).
from 5 to 45 Hz. The results of Fig. 5 are fairly general, as long as we do not vary the couplings too much, as very strong coupling would yield systematic synchronization whereas very weak coupling would yield no synchronization. But the different areas would be less smooth only if we were to include noise. We see that, given our couplings, synchronization happens when both stimulus oscillators are around 17.5 Hz, which is the mean frequency of O3 , O4 , O5 , and O6 . For this case, all oscillators in the network synchronize, and this is the only pattern that emerges from standard synchronization. If we start to diverge from the original distribution given by O3 , O4 , O5 , and O6 , synchronization starts to disappear. On the other hand, if we use the criteria of quasi-synchronization, a total of eight possible patterns emerge—patterns (1)–(6), plus all oscillators synchronized, plus no oscillators synchronized. We should compare this to the binary sync/no-sync possibilities when we use a stricter sense of synchronization. It is often argued that neural synchronization may be used by the brain because it allows firing rates to reach above a certain response threshold. One possible criticism of Definition 1 is the lack of such feature. Though this may be true in a strict sense, if we look at the simulations shown in Fig. 3, the phases of oscillators lag shortly behind each other. Thus, if we think of oscillators as not being in the same place, time lag effects may yield similar effects. Let us now see how we can use synchronization for pattern recognition. A specific stimulus may give rise to a specific synchronization pattern of the recognition oscillators. We will consider this pattern as the recognition of the stimulus by the network. In order to compute this recognition, we set the following. 1) Parameters: The stimuli, (Asn , fns , φns (0)) n∈G , the S recognition oscillators’ natural frequencies { f n }n∈G R , and the coupling constants {knm }n,m .
0
0
0.1
0.2 0.3 Time (sec)
0.4
0.5
Fig. 6. Six-oscillator network under noisy stimuli. The simulation parameters are the same as in Fig. 3, except that the frequencies of the stimulus oscillators are noisy, with f 1 and f 2 replaced by f 1 + ρ1 (t) and f 2 + ρ2 (t), and with ρi , i = 1, 2 being Gaussian white-noise distributions with mean zero and variance ten.
2) Initial conditions: φn (0) = φns (0) if On ∈ G S , and randomly distributed in [0, 2π] otherwise. 3) Dynamics: For 0 < t ≤ T , T constant, the phases follow Kuramoto’s (1). First, we address the model’s robustness to noise. We start by assuming that the natural frequencies of excitation of the sensory oscillators have a stochastic component. In other words, the natural frequency depends on time as fn (t) = f n +ρn (t), where f n is the original noise-free natural frequency of the stimulus oscillator (On ∈ G S ) and ρn (t) is a zeromean Gaussian white noise. A simulation for the network of six oscillators under noisy stimuli is graphed in Fig. 6. Figs. 6 and 7 show that synchronization is not much affected by the noise, so that the recognition of a stimulus seems to be robust under Gaussian noise. Starting from this observation, we conduct pattern recognition tests in noisy environments. Given a randomly generated set of M stimuli s1 , s2 , . . . , s M , each composed of N S natural frequencies, initial phases, and amplitudes, we use a set of noisy versions of these stimuli (i) s1(i) , s2(i) , . . . , s M , where Q is the number of samples, i=1..Q to obtain the recognition rate. Here we define the recognition rate as the rate of successfully recognized stimuli over all trials. The noisy stimuli are produced in the following way. For stimulus s (i) j , i = 1, . . . , Q, j = 1, . . . , M, there are N S (i)
natural frequencies, f j,r , r = 1, . . . , N S . The time-varying stochastic natural frequencies of oscillator Or in stimulus s j (i) (i) (i) and version i are given by f j,r (t) = f j,r + ρ j,r (t), where (i) are normally distributed around f j,r of Or in s j (dealing f j,r with the slight differences between different occurrences of (i) the same stimulus), and ρ j,r (t) is a zero-mean Gaussian white noise modeling the synaptic noise. To evaluate the ability of a network of oscillators to correctly recognize noisy stimuli, we made simulations according to the following three procedures. 1) Fix a time T , the number N R of recognition oscillators, and their natural frequencies f 1 , f2 , . . . , f N R , and the set of symmetric coupling constants knm .
88
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
1
100
0.8
90 80 Recognition Rate
0.6
Dephasing
0.4 0.2 0 −0.2
70 60 50 40
−0.4
20
−0.6
10 0 15
−0.8 −1
0
0.1
0.2 0.3 Time (sec)
0.4
0.5
1
2 Frequencies 5 Frequencies 10 Frequencies
30
20
25 30 35 40 Number of Response Oscillators
45
50
Fig. 8. Recognition rates as a function of the number of recognition oscillators. Rates were averaged for different random values of frequencies, couplings, and noise, for a total of 125 trials.
0.8 0.6
Dephasing
0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0
0.1
0.2 0.3 Time (sec)
0.4
0.5
Fig. 7. Variance of the sine of dephasing (solid line) between O3 and O4 (top) and between O3 and O5 (bottom) under the noisy stimulus shown in Fig. 6. The dashed line shows the phase difference between oscillators. According to Definition 1, O3 and O4 synchronize, whereas O3 and O5 do not.
2) Compute a set of initial patterns P1 , P2 , . . . , PM associated to the clean stimuli s1 , s2 , . . . , s M , as described in Section III. The Pi ’s can be thought of as a binary matrix with element one at the ath and bth lines if, at time T , oscillators Oa and Ob are ǫ-synchronized and zero if they are not. 3) Compute patterns associated to the noisy stimuli (i) (i) (i) s1 , s2 , . . . , s M . For each of these patterns, i=1···Q find which P1 , P2 , . . . , PM is closest to it using a Hamming measure. If the closest initial pattern corresponds to the same stimulus without noise, recognition is successful, otherwise not. The percentage of successful recognitions is the rate of recognition. It is instructive to relate the above steps to classical conditioning theory. We think of the stimuli as the unconditional stimuli, and the set of initial patterns P1 , . . . , PM as the unconditioned responses associated to the stimuli. Later on, we will see how we can include in the model conditioned stimuli that become associated to conditioned responses. For our simulation, we chose T = 500 ms and M = 5. The natural frequencies of each stimuli were independently and uniformly drawn between 5 and 45 Hz, a range
corresponding to observed frequencies in the brain’s cognitive activity [32], [35]–[38]. For simplicity, initial stimulus phases were set to zero and amplitudes to 1. We considered sets of stimuli composed of 2, 5, and 10 frequencies. For the noise, ρ (i) j,r (t) had standard deviation equal to 10, with a noise at the same order of magnitude as the natural frequencies. The coupling constants between recognition oscillators and between stimulus oscillators were uniformly distributed on the interval [0, 0.002], but the coupling between stimulus and recognition oscillators should be stronger, and so were uniformly drawn from the interval [0, 2]. In Fig. 8, we show the recognition rates for different numbers of recognition oscillators and stimulus frequencies. We see that increasing the computational capacity of the recognition network, i.e., increasing the number of recognition oscillators, leads to an improvement on the recognition rate. The figure also shows that more complex stimuli, with a larger number of natural frequencies, have better recognition results. In fact, in the simulation it is a stronger effect than the number of recognition oscillators. In Fig. 9, we use 30 recognition oscillators and compute the average recognition rates when T , the oscillator computation time, varies between 0.1 s and 1.0 s. Longer computation times allow oscillators to synchronize better in a noise-robust manner. Furthermore, the larger the number of stimulus oscillators, the lower the time needed for good recognition rates. Note that most of the gain from having a longer time to synchronize occurs in the first 500 ms. In Fig. 10, we set T = 0.5 s and studied the recognition rates as a function of the mean value of the coupling constants between stimulus and recognition oscillators. This figure indicates the existence of an optimal value for these constants, nearly independent of the number of natural frequencies of the stimulus oscillators. Below this optimal value, the network is not sensitive enough to external stimulation, above it, any excitation will lead to a synchronization of all oscillators and no discrimination. This leads us to consider the coupling constants more as related to sensitivity parameters than to learning, contrary to the standard view of Hebb’s rule for neural networks. In the next section, we discuss other
VASSILIEVA et al.: LEARNING PATTERN RECOGNITION THROUGH QUASI-SYNCHRONIZATION OF PHASE OSCILLATORS
100
100
90
90
Recognition Rate
Recognition Rate
70 60 50 2 Frequencies 5 Frequencies 10 Frequencies
40
70 60 50 40 30
30 20 0.1
2 Frequencies 5 Frequencies 10 Frequencies
80
80
89
20 0.2
0.3
0.4 0.5 0.6 0.7 Computation Time (sec)
0.8
0.9
1
Fig. 9. Recognition rates as a function of the time (averaged over 125 trials). The parameters are the same as before.
10
5
10
15 20 25 Number of Stimuli
30
35
40
Fig. 11. Recognition rates as a function of the number of stimuli in a network of 30 recognition oscillators (averaged over 125 trials). We use the same parameters as above.
90 80
oscillators’ frequencies. Reinforcement changes the recognition oscillators’ natural frequencies. To model this change, during reinforcement we postulate the following dynamic for the natural frequencies, in addition to (1)
1 dφm (t) d f n (t) = − f n (t) . (7) µnm dt 2π dt
Recognition Rate
70 60
2 Frequencies 5 Frequencies 10 Frequencies
50
Om ∈G R
40 30 20 10 10−3
101 10−2 10−1 100 Mean Value of the Coupling Constants
102
Fig. 10. Recognition rates as a function of the mean value of the coupling constants’ strength for 30 recognition oscillators (averaged over 125 trials).
mechanisms for learning that do not involve changes to coupling strengths. Fig. 11 shows the variation of the recognitions rates as a function of the size of the set of stimuli. While an increase in the number of stimuli lowers the recognition rate, it is remarkable that 30 recognition oscillators are able to correctly recognize 10 frequency stimuli with rates greater than 60% for a 40-stimulus set. To stress this point, we should recall that the rate of recognition by chance would be only 2.5%. III. L EARNING PATTERN R ECOGNITION While recognition of stimuli is in itself important, one of our main interests in this paper is to have a network that learns by reinforcement to associate a stimulus to a pattern [32]. In this section we introduce such learning. Since, as we discussed earlier, frequencies seem more important than couplings, in our model, memory is encoded on the recognition
Equation (7) drives the natural frequency f n to a value closer to the instantaneous frequency given by the dynamics. We emphasize that, because fn (t) is the natural frequency of the nth oscillator, and not the time derivative of its phase, (7) is not a second-order differential equation. The coefficients {µnm }nm (µnm = µmn ) are learning parameters, chosen such that the system evolves toward the desired pattern. If On and Om are to synchronize, we choose µnm > 0, if they are not to synchronize, we set µnm < 0, when it is immaterial whether they synchronize, we have µnm = 0. To make it explicit whether a learning parameter is bringing frequencies together (µ > 0) or pushing them apart (µ < 0), we call them µ+ and µ− , respectively. The procedure for learning may be summarized as follows. 1) Parameters: The stimulus, (Asn , f ns , φns (0)) n∈G , the S recognition oscillators’ natural frequencies { f n }n∈G R , and the coupling constants {knm }n,m , and the learning parameters {µnm }n,m . 2) Initial Conditions: For On ∈ G S m we set φn (0) = φns (0) and f n (0) = f n , otherwise φn (0) is randomly distributed in [0, 2π]. 3) Dynamics: For 0 < t ≤ T 1 dφn = fn + ρn (t) 2π dt N + An Am knm sin [φm (t) − φn (t)] . (8) m=1
90
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
40
35
35
30 Frequencies (HZ)
Frequencies (HZ)
30 25 20 15
0
0.2
0.4 0.6 Time (sec)
0.8
15
5
1
Fig. 12. Learning in a three-oscillator network. Oscillators initially unsynchronized get synchronized when the coupling is positive (µ 23 = 0.3) due to changes in their frequencies. The instantaneous frequencies of O1 , O2 , and O3 are shown as gray solid, black dotted, and black solid lines. The straight lines depict the oscillators’ natural frequencies. The instantaneous frequencies are the lines that oscillate, while the natural frequencies slowly converge to their final values.
0.2
0.4 0.6 Time (sec)
0.8
1
35
Frequencies (HZ)
30
25
20
25 20 15 10 5
15
0
Fig. 14. Instantaneous frequency for a six-oscillator network during learning. Parameters are given by (3) and (4), and the stimulus is the same as in Fig. 1, except for ω1s = 8 Hz and ω2s = 12 Hz. Oscillators O1 , . . . , O6 are represented as in Fig. 1.
30
Frequencies (HZ)
20
10
10 5
25
0
0.2
0.4 0.6 Time (sec)
0.8
1
Fig. 13. Oscillators’ frequencies after learning in a three-oscillator network. The same representation as Fig. 12 is used for the oscillators’ instantaneous and natural frequencies.
For all On ∈ G R
1 dφm (t) d f n (t) = − f n (t) . µnm dt 2π dt
(9)
Om ∈G R
Let us consider a fully connected three-node network, where O1 is a stimulus oscillator and O2 and O3 are recognition oscillators. In this network, only two patterns can occur, either O2 and O3 are synchronized or they are not. We choose as initial values f 2 = 10 Hz and f3 = 35 Hz. If a stimulus with frequency f 1s = 25 Hz occurs, no synchronization emerges. Let us now assume that we would like O2 and O3 to learn to synchronized under this stimulus. In Fig. 12, the recognition oscillators’ frequencies evolve toward the stimulus’, eventually synchronizing. If we now use the new learned frequencies, shown in Fig. 13, the stimulus results in the synchronization of
0
0.2
0.4 0.6 Time (sec)
0.8
1
Fig. 15. Instantaneous frequencies for the six-oscillator network of Fig. 14 after learning.
O2 and O3 . Finally, if we want the network to unsynchronize and forget, we can simply use a negative µ23 . Let us now consider the more complicated case of the six-oscillator network studied earlier. Figs. 14–17 show the response pattern to a stimulus with frequencies f1s = 8 Hz and f2s = 12 Hz before and during learning. According to our criteria of ǫ-synchronization, O3 is synchronized with O4 but not with O5 . For the network to learn to synchronize O3 , O4 , and O5 , we set µ34 = µ35 = µ45 = 0.3, µn6 = 0, n ∈ {3, 4, 5} (with µnm = µmn ). We now go back to the problem of stimuli recognition, and we show how adapting the natural frequencies of the recognition oscillators during reinforcement improves the recognition rates. We adopt a similar setup to the one used in Section III, with the main difference that only noisy versions of the stimuli are used and that some are considered reinforcement, based on the model described in [32, Ch. 8]. During reinforcement, the natural frequencies of the recognition oscillators evolve according to (8) and (9), and the patterns used for stimulus
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
Dephasing
1
0 −0.2
0
−0.4
−0.6
−0.6
−0.8
−0.8
−1
−1 0
0.1
0.2 0.3 Time (sec)
0.4
0.5
Fig. 16. Variance of the sine of dephasing (solid line) and dephasing (dashed line) for the six-oscillator network in Fig. 14.
recognition are updated. More precisely, if we set the number L of noisy versions used for learning, we and go through the following steps. 1) Fix the oscillator simulation time T , the number of recognition oscillators N R , their natural frequencies f 1 , f2 , . . . , f N R , and the connection strengths knm . 2) Compute the initial synchronization patterns 1 P11 , P21 , . . . , PM associated to the noisy stimuli (1) (1) (1) s1 , s2 , . . . , s M , as described in Section III. 3) For i = 2, . . . , L and l = 1, . . . , M, compute the (i) (i) synchronization pattern Pl associated to sl . Then, (i) use the learning parameters µilnm = µ+ if recognition oscillators m and n are synchronized for Pl(i−1) but not for Pl(i) , µilnm = µ(i) − if they are not synchronized, and µnm = 0 otherwise (µ+ and µ− corresponding to positive and negative values). Evolve, according to (8) and (9), the natural frequencies of the recognition oscillators. The pattern at the end of this reinforcement, Pli , is the updated recognition of stimulus sl . 4) For i = L + 1, . . . , Q, compute the patterns associated (i) (i) (i) to s1 , s2 , . . . , s M , and find which of the updated representation patterns P1L , P2L , . . . , PML is the closest to it according to a Hamming measure. If the closest initial pattern corresponds to the correct stimulus, then the recognition is correct. We applied steps 1–4 to various sequences of learning parameters µnm . For simplicity, we considered only cases where µmn = µ+ when m and n were to synchronize and µmn = µ− otherwise. We also fixed T = 500 ms, the number of stimulus oscillators to 5, and the number of recognition oscillators to 30. We defined the rate of recognition as the percentage of correctly recognizing a noisy stimulus to the original one. Noise was the same as before. The first relevant general result we obtained was that the same value for the µ parameters at each trial does not result in an improvement of the recognition rates. Indeed, either the parameters are low enough so that (7) would be negligible and learning would not occur, or they are high enough such that (7) is not to be negligible. If the latter, frequencies of the recognition oscillators evolve
91
−0.2
−0.4
0
0.2
0.4 0.6 Time (sec)
0.8
1
Fig. 17. Variance of the sine of dephasing (solid line) and dephasing (dashed line) for the six-oscillator network in Fig. 15.
90 80 70 Recognition Rate
Dephasing
VASSILIEVA et al.: LEARNING PATTERN RECOGNITION THROUGH QUASI-SYNCHRONIZATION OF PHASE OSCILLATORS
60 50 40 30 20 10 0
1
2
3
4 5 6 7 Number of Reinforcements
8
9
10
Fig. 18. Recognition rate as a function of the number of reinforcement trials for 5 (black line) and 30 (gray line) distinct stimuli.
back and forth, without any convergence. In this case, the patterns for the stimuli at a given step do not match the patterns obtained at the next step, and the recognition rate drops down to random guess rates. However, using a decreasing sequence of the learning parameters, the patterns converge, followed by a noticeable improvement on the recognition rates. We also noticed that µ− must be significantly smaller than µ+ for learning to happen. Fig. 18 shows (solid black line) an (i+1) example with the parameters set as µ(1) = + = 7.3, µ+ (i) (i) (i) µ+ /(i + 1), and µ− = −µ+ /2, where i is the trial number. One interesting characteristic of the learning shown is the initial period when the oscillators frequencies are adapting very fast. This fast adaptation leads to an initial mismatch between representations and new patterns, and to a dip in the recognition rate, before the rates finally improve by 15%. We emphasize that, because of the dynamics of the model, this dip in recognition rate will necessarily occur. Furthermore, starting values for µ equal to those used after the fifth reinforcement, when better rates appear, has no effect, since these coefficients are to small. Fig. 18 also shows a similar computation for a
larger set of 30 stimuli (solid gray line). Smaller values, by a factor two, for µ led to better learning, implying that learning more stimuli must take more reinforcement trials. Recognition rates after 10 reinforcements improved approximately 12%, which constitutes a 37% improvement over the initial 33% before learning. One interesting aspect of our model is that the mean rate of learning, shown in Fig. 18. Although standard stimulus-response theories do not present the observed dip in recognition observed in Fig. 18, this type of behavior resembles those of interference in psychology. The basic idea is that past learning can interfere with learning a new related concept or behavior that has serious overlap with the old. Suppes [39] studied a case where children at about the age of five years can learn rather easily when two finite sets are identical, but this learning interferes with learning the concept of two sets being equivalent. A neural network that models this result is given in [36], and the results presented therein are quite similar to the rates in Fig. 18. As a last topic in this section, we focus on storage capacity. In Fig. 5, we showed that, by adopting the concept of quasisynchronization, we were able to have eight different patterns stored in the network, as opposed to just two with the standard definition of synchronization. Additionally, our 5 + 30 oscillator network above was able to recognize 30 different noisy stimuli with a rate of almost 50% (see Fig. 18). So, our simulations suggest that our model with N oscillators has a storage capacity proportional to N. This should be contrasted with the storage capacity of Hopfield networks, which is a well-known model of artificial neural networks. In his famous 1982 paper, John Hopfield showed that a simple integrate-andfire neural network could be used as an associative memory [40]. Hopfield determined that the storage capacity of his network, given in terms of different patterns that could be recovered, was 0.15N, where N is the number of neurons. Later on, McEliece and collaborators [41] showed in more detail that the limiting storage for a Hopfield network was N/(2 log N), which corresponds to approximately five patterns for a network of 35 neurons. We thus see that, by using oscillators in the way proposed in this paper, we achieve a storage capacity that seems significantly larger than that of Hopfield networks. IV. A PPLICATION TO I MAGE R ECOGNITION We now show an example of how we can use quasisynchronized networks of oscillators to recognize images. In this example we investigate two cases of interest: 1) recognition of images degraded by Gaussian noise, and 2) recognition of incomplete images. The patterns used in our image recognition example are shown in Fig. 19. We start with the one-step learning performance of a 30-oscillator network. In our analysis, we use the following procedures. 1) Center the pictures and represent them as a single sequence of 0s and 1s corresponding to the 8560 pixels. 2) Compute the discrete time Fourier transform of the sequence found above. 3) Select the 10 largest Fourier coefficients for each picture. 4) Define the noise-free stimulus for the digitized picture of 0 as szero = (Anzero , fnzero , φnzero (0)) 1≤n≤10, where An
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Fig. 19. Digitized version of the numbers 0, 1, and 2, with resolution 80×107. Each number is represented by 8560 black-and-white pixels.
100 90 80 Recognition Rate
92
70 60 50 40 30 20 10 0 −20
−15
−10
−5
0 SNR (dB)
5
10
15
20
Fig. 20. Correct recognition rates of the characters 0, 1, and 2 by a network of 30 quasi-synchronized oscillators as a function of the signal-to-noise ratio (SNR) of a Gaussian noise. The figure shows examples of different noise levels for the patterns. We can see that for −10 dB the pattern is barely visible, yet the network correctly recognizes the pattern more than 70% of the time.
and φn are the amplitude and phase of the coefficients obtained on step 3. 5) Repeat step 4 for stimuli 1 and 2. The stimuli szero , sone , and st wo obtained from 1–5 are noisefree. To obtain the noisy version of the stimuli, for example, we just inject noise into the stimuli and repeat 1–5. For 1), we simulate the influence of Gaussian noise on all pixels. The recognition rate of the noisy version as a function of the SNR is depicted on Fig. 20. To compute these rates, we drew various networks at random (response oscillators’ natural frequency and coupling constants), and to each network we presented the noise-free version of stimuli 0, 1, and 2. Then we presented five noisy versions of the same stimuli. Whenever the synchronization pattern occurring with the noisy version of one of the three stimuli is closer (Hamming distance) to the non-noisy version of the same stimuli than the two others, we declared a successful recognition. The percentage of successes with respect to the number of trials is the recognition rate. Then we averaged the rates of all the networks drawn at random. For 2), we studied the influence of a “hole” in the picture. White squares of various size were superposed to the pictures on a random position (one white square per noisy picture). The same process as 1) was applied to obtain the recognition rates. Fig. 21 shows the decrease of the recognition rates as a function of the white square size for two situations: without any Gaussian noise, and with an SNR of −6 dB. We see that holes are more harmful to recognition rates than Gaussian noise.
VASSILIEVA et al.: LEARNING PATTERN RECOGNITION THROUGH QUASI-SYNCHRONIZATION OF PHASE OSCILLATORS
100
edge size 40, this increase in the recognition rates represents a 36% improvement of the performance. We also emphasize that learning is more efficient when recognition rates are initially lower (without learning).
Recognition Rate
90 80
V. F INAL R EMARKS
70 60 50 40 30
5
10
15
20 25 Hole Edge (Pixels)
30
35
40
Fig. 21. Correct recognition rates of the characters 0, 1, and 2 by a network of 30 quasi-synchronized oscillators as a function of the size of a blank square hole randomly positioned on the picture. The dark line shows the rates for the noise-free pictures, whereas the gray line shows it for pictures with an SNR of −6 dB. The figure shows three examples of the picture “0” for different values of the hole size and noise.
100
Recognition Rate
90 80 70 60 50 40 30
93
5
10
15
20 25 Hole Edge (Pixels)
30
35
40
Fig. 22. Recognition rates for different holes, before (gray line) and after learning (black line). The largest rate improvement was for the case when the rate jumped from 39% (almost at chance level) to 53% after learning (well beyond chance level).
When the hole is of big size (40 × 40 pixels), the impact of Gaussian noise seems to become negligible. Secondly, we study the influence of learning with the reinforcement process described in the previous section. We test various hole sizes with a Gaussian noise so that the SNR is −6 dB. To compute the recognition rates, we draw networks at random and start by computing the synchronization pattern with a noisy version of the three stimuli. Then ten other noisy versions of the three stimuli are used for learning reinforcement purposes. Finally, five other noisy versions are used to test the recognition capability of the networks that went through the learning procedure. The recognition rates are compared to mean one-step learning rates initialized with all the noisy versions used for learning. As shown on Fig. 22, learning yields an improvement of up to 14% on the recognition rates. Compared to the rate of 39% for holes of
The three main features of this paper have all been described, but we summarize them here to bring out what is most significant. First, we used a stochastic and approximate definition of synchronization suitable for the noisy environment found in many biological applications, where noise is endemic and cannot easily be removed. Second, our model simulations demonstrated robustness in the presence of noise, which is again a necessary feature for most biological applications. Third, and finally, we showed how a network of oscillators can learn to recognize a set of patterns with noise by changing their natural frequencies, rather than changing their coupling strengths, as in Hebb’s rule. Given its importance in artificial neural networks, it is worth comparing some of our results with those obtained for Hopfield networks [40]. First, we saw that, by using quasisynchronous oscillators, we were able to recognize a much larger number of patterns than we would if we were to use Hopfield nets. In fact, the computed theoretical limit for a 35-node Hopfield network is approximately five patterns [41], but quasi-synchronized oscillators were able to recognize 30. This indicates a higher storage capacity than that of Hopfield networks. Another important distinction between our model and Hopfield’s is in the way we represent learning. In our model, the oscillators’ couplings are fixed, but their natural frequencies vary. In Hopfield’s model, learning happens by changes in the connections between each node. Because it is fully connected, it is very hard to produce computer chips that mimic large-scale Hopfield networks [42]. It is possible that, by having fixed connections but changing natural frequencies, our model may face less difficulties with hardware implementation. R EFERENCES [1] A. T. Winfree, “Biological rhythms and the behavior of populations of coupled oscillators,” J. Theor. Biol., vol. 16, no. 1, pp. 15–42, Jul. 1967. [2] M. Bennett, M. F. Schatz, H. Rockwood, and K. Wiesenfeld, “Huygens’s clocks,” Proc. R. Soc. Lond. A, vol. 458, no. 2019, pp. 563–579, Mar. 2002. [3] Y. Kuramoto, Chemical Oscillations, Waves, and Turbulence. New York: Springer-Verlag, 1984. [4] J. A. Acebrón, L. L. Bonilla, C. J. P. Vicente, F. Ritort, and R. Spigler, “The Kuramoto model: A simple paradigm for sychronization phenomena,” Rev. Mod. Phys., vol. 77, no. 1, pp. 137–185, 2005. [5] W. Gerstner and W. Kistler, Spiking Neuron Models. Cambridge, U.K.: Cambridge Univ. Press, 2002. [6] E. M. Izhikevich, Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting. Cambridge, MA: MIT Press, 2007. [7] C. M. Gray, P. König, A. K. Engel, and W. Singer, “Oscillatory responses in cat visual cortex exhibit inter-columnar syncronization which reflects globl stimulus properties,” Nature, vol. 338, pp. 334–337, Mar. 1989. [8] R. Eckhorn, R. Bauer, W. Jordan, M. Brosch, W. Kruse, M. Munk, and H. J. Reitboeck, “Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analyses in the cat,” Biol. Cybern., vol. 60, no. 2, pp. 121–130, 1988. [9] R. W. Friedrich, C. J. Habermann, and G. Laurent, “Multiplexing using synchrony in the zebrafish olfactory bulb,” Nat. Neurosci., vol. 7, no. 8, pp. 862–871, Aug. 2004.
94
[10] V. B. Kazantsev, V. I. Nekorkin, V. I. Makarenko, and R. Llinas, “Selfreferential phase reset based on inferior olive oscillator dynamics,” Proc. Nat. Acad. Sci., vol. 101, no. 52, pp. 18183–18188, Dec. 2004. [11] A. Lutz, J.-P. Lachaux, J. Martinerie, and F. J. Varela, “Guiding the study of brain dynamics by using first-person data: Synchrony patterns correlate with ongoing conscious states during a simple visual task,” Proc. Nat. Acad. Sci., vol. 99, no. 3, pp. 1586–1591, Feb. 2002. [12] V. N. Murthy and E. E. Fetz, “Coherent 25- to 35-Hz oscillations in the sensorimotor cortex of awake behaving monkeys,” Proc. Nat. Acad. Sci., vol. 89, no. 12, pp. 5670–5674, Jun. 1992. [13] G. Rees, G. Kreiman, and C. Koch, “Neural correlates of consciousness in humans,” Nat. Rev. Neurosci., vol. 3, no. 4, pp. 261–270, Apr. 2002. [14] E. Rodriguez, N. George, J.-P. Lachaux, J. Martinerie, B. Renault, and F. J. Varela, “Perception’s shadow: Long-distance synchronization of human brain activity,” Nature, vol. 397, no. 6718, pp. 430–433, Feb. 1999. [15] H. Sompolinsky, D. Golomb, and D. Kleinfeld, “Global processing of visual stimuli in a neural network of coupled oscillators,” Proc. Nat. Acad. Sci., vol. 87, no. 18, pp. 7200–7204, Sep. 1990. [16] C. Tallon-Baudry, O. Bertrand, and C. Fischer, “Oscillatory synchrony between human extrastriate areas during visual short-term memory maintenance,” J. Neurosci., vol. 21, no. 20, pp. RC177-1–RC177-5, Oct. 2001. [17] P. N. Steinmetz, A. Roy, P. J. Fitzgerald, S. S. Hsiao, K. O. Johnson, and E. Niebur, “Attention modulates synchronized neuronal firing in primate somatosensory cortex,” Nature, vol. 404, no. 6774, pp. 187–190, Mar. 2000. [18] D. L. Wang, “Emergent synchrony in locally coupled neural oscillators,” IEEE Trans. Neural Netw., vol. 6, no. 4, pp. 941–948, Jul. 1995. [19] E. Leznik, V. Makarenko, and R. Llinas, “Electrotonically mediated oscillatory patterns in neuronal ensembles: An in vitro voltagedependent dye-imaging study in the inferior olive,” J. Neurosci., vol. 22, no. 7, pp. 2804–2815, Apr. 2002. [20] W. W. Lytton and T. J. Sejnowski, “Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons,” J. Neurophysiol., vol. 66, no. 3, pp. 1059–1079, Sep. 1991. [21] E. H. Park, P. Soa, E. Barreto, B. J. Gluckman, and S. J. Schi, “Electric field modulation of synchronization in neuronal networks,” Neurocomputing, vols. 52–54, pp. 169–175, Jun. 2003. [22] M. A. Trevisan, S. Bouzat, I. Samengo, and G. B. Mindlin, “Dynamics of learning in coupled oscillators tutored with delayed reinforcements,” Phys. Rev. E, vol. 72, no. 1, pp. 011907-1–011907-7, Jul. 2005. [23] J. Yamanishi, M. Kawato, and R. Suzuki, “Two coupled oscillators as a model for the coordinated finger tapping by both hands,” Biol. Cybern., vol. 37, no. 4, pp. 219–225, 1980. [24] S. Ozawa, A. Roy, and D. Roussinov, “A multitask learning model for online pattern recognition,” IEEE Trans. Neural Netw., vol. 20, no. 3, pp. 430–445, Mar. 2009. [25] P. Baldi and R. Meir, “Computing with arrays of coupled oscillators: An application to preattentive texture discrimination,” Neural Comput., vol. 2, no. 4, pp. 458–471, 1990. [26] D. L. Wang and D. Terman, “Image segmentation based on oscillatory correlation,” Neural Comput., vol. 9, no. 4, pp. 805–836, May 1997. [27] D. L. Wang and E. Cesmeli, “Texture segmentation using Gaussian– Markov random field and neural oscillator networks,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 394–404, Mar. 2001. [28] K. Chen and D. L. Wang, “A dynamically coupled neural oscillator network for image segmentation,” Neural Netw., vol. 15, no. 3, pp. 423–439, Apr. 2002. [29] R. Borisyuk and Y. Kazanovich, “Dynamics of neural networks with a central element,” Neural Netw., vol. 12, no. 3, pp. 441–454, Apr. 1999. [30] Y. Kazanovich and R. Borisyuk, “Object selection by an oscillatory neural network,” Biosystems, vol. 67, nos. 1–3, pp. 103–111, Oct.–Dec. 2002. [31] R. Borisyuk, M. Denham, F. Hoppensteadt, Y. Kazanovich, and O. Vinogradova, “An oscillatory neural network model of sparse distributed memory and novelty detection,” BioSystems, vol. 58, nos. 1–3, pp. 265–272, Dec. 2000. [32] P. Suppes, Representation and Invariance of Scientific Structures. Stanford, CA: CSLI Publications, 2002. [33] E. M. Izhikevich, “Polychronization: Computation with spikes,” Neural Comput., vol. 18, no. 2, pp. 245–282, Feb. 2006. [34] E. M. Izhikevich and Y. Kuramoto, “Weakely coupled oscillators,” in Encyclopedia of Mathematical Physics, J.-P. Francoise, G. Naber, and S. T. Tsou, Eds. New York: Elsevier, 2006. [35] P. Suppes, B. Han, and Z.-L. Lu, “Brain-wave recognition of sentences,” Proc. Nat. Acad. Sci., vol. 95, no. 26, pp. 15861–15866, Dec. 1998.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
[36] P. Suppes and L. Liang, “Concept learning rates and transfer performance of several multivariate neural network models,” in Recent Progress in Mathematical Psychology, C. E. Dowling, F. S. Roberts, and P. Theuns, Eds. Mahway, NJ: Lawrence Elrbaum, 1998. [37] P. Suppes, B. Han, J. Epelboim, and Z.-L. Lu, “Invariance between subjects of brain wave representations of language,” Proc. Nat. Acad. Sci., vol. 96, no. 22, pp. 12953–12958, Oct. 1999. [38] P. Suppes, Z.-L. Lu, and B. Han, “Brain wave recognition of words,” Proc. Nat. Acad. Sci., vol. 94, no. 26, pp. 14965–14969, Dec. 1997. [39] P. Suppes, “On the behavioral foundation of mathematical concepts,” Monographs Soc. Res. Child Develop., vol. 30, no. 1, pp. 60–96, 1965. [40] J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proc. Nat. Acad. Sci., vol. 79, no. 8, pp. 2554–2558, Apr. 1982. [41] R. J. McEliece, E. C. Posner, E. R. Rodemich, and S. S. Venkatesh, “The capacity of the Hopfield associative memory,” IEEE Trans. Inform. Theory, vol. 33, no. 4, pp. 461–482, Jul. 1987. [42] J. Ohta, M. Takahashi, Y. Nitta, S. Tai, K. Mitsunaga, and K. Kyuma, “GaAs/AlGaAs optical synaptic interconnection device for neural networks,” Opt. Lett., vol. 14, no. 16, pp. 844–846, Aug. 1989.
Ekaterina Vassilieva received the Graduate degree from the Department of Mechanical and Mathematical Sciences, Moscow State University, Moscow, Russia. She was awarded the Ph.D. degree in symbolic computations and effective algorithms in noncommutative algebraic structures by the same institution. She joined the French National Center for Scientific Research, Paris, France, in 2002, and is currently working in the Laboratory of Computer Science, École Polytechnique, Paris, as a Researcher. Her current research interests include algebraic combinatorics and applications of combinatorial methods in symbolic computation, telecommunications, and various fields of theoretical computer science like graph and map theory.
Guillaume Pinto received the Graduate degree from the École Polytechnique, Paris, France, and the M.Sc. degree from Stanford University, Stanford, CA. He is currently a Chief Technical Officer (CTO) and Program Manager at Parrot SA, Paris, which is high-tech company specialized in wireless cell phone accessories. He leads the company’s Consumer Products Design, Development, and Industrialization Division. After joining the company’s Digital Signal Processing Department in 2004, he was appointed to the Executive Committee as deputy CTO in January 2006.
José Acacio de Barros was born in Barra Mansa, Rio de Janeiro, Brazil. He received the B.Sc. degree in physics from Federal University of Rio de Janeiro, Rio de Janeiro, in 1988, and the M.Sc. and Ph.D. degrees in physics from the Brazilian Center for Research, Sao Paolo, Brazil, in 1989 and 1991, respectively. He was a Post-Doctoral Fellow at the Institute for Mathematical Studies in the Social Sciences, Stanford University, Stanford, CA, from 1991 to 1993, and a Science Researcher at Stanford’s Education Program for Gifted Youth from 1993 to 1995. In 1995, he joined the Physics Department, Federal University of Juiz de Fora, Juiz de Fora, Brazil, where he is a member of the staff (on leave). He has held Visiting Faculty positions at Stanford University, and was a Visiting Researcher at the Brazilian Center for Research in physics. Currently, he is with the Liberal Studies Department, San Francisco State University, San Francisco, CA. He has published several research papers on the foundations of physics, cosmology, physics education, and biophysics. His current research interests include interdisciplinary physical and mathematical models of cognitive processes and foundations of quantum mechanics.
VASSILIEVA et al.: LEARNING PATTERN RECOGNITION THROUGH QUASI-SYNCHRONIZATION OF PHASE OSCILLATORS
Patrick Suppes was born in Tulsa, OK. He received the B.S. degree in meteorology from the University of Chicago, Chicago, IL, in 1943, and the Ph.D. degree in philosophy from Columbia University, New York, NY, in 1950. He was a Director of the Institute for Mathematical Studies in the Social Sciences, Stanford University, Stanford, CA, from 1959 to 1992. He is currently the Lucie Stern Professor Emeritus of philosophy at the Center for the Study of Language and Information, Stanford University. He has published widely on educational uses of computers and technology in education, as well as in philosophy of science and psychology. His current research interests include
95
detailed physical and statistical models of electroencephalogram- and magnetoencephalogram-recorded brainwaves associated with processing of language and visual images, as well as continued development of computerbased curriculums in mathematics, physics, and English. Prof. Suppes has been a member of the National Academy of Education since 1965, the American Academy of Arts and Sciences since 1968, the National Academy of Sciences since 1978, and the American Philosophical Society since 1991. He received the American Pychological Association’s Distinguished Scientific Contribution Award in 1972, the National Medal of Science in 1990, the Lakatos Award Prize from the London School of Economics in 2003 for his 2002 book Representation and Invariance of Scientific Structures, and the Lauener Prize in philosophy, Switzerland, in 2004.
96
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
ELITE: Ensemble of Optimal Input-Pruned Neural Networks Using TRUST-TECH Bin Wang and Hsiao-Dong Chiang, Fellow, IEEE
Abstract— The ensemble of optimal input-pruned neural networks using TRUST-TECH (ELITE) method for constructing high-quality ensemble through an optimal linear combination of accurate and diverse neural networks is developed. The optimization problems in the proposed methodology are solved by a global optimization a global optimization method called TRansformation Under Stability-reTraining Equilibrium Characterization (TRUST-TECH), whose main features include its capability in identifying multiple local optimal solutions in a deterministic, systematic, and tier-by-tier manner. ELITE creates a diverse population via a feature selection procedure of different local optimal neural networks obtained using tier-1 TRUST-TECH search. In addition, the capability of each inputpruned network is fully exploited through a TRUST-TECH-based optimal training. Finally, finding the optimal linear combination weights for an ensemble is modeled as a nonlinear programming problem and solved using TRUST-TECH and the interior point method, where the issue of non-convexity can be effectively handled. Extensive numerical experiments have been carried out for pattern classification on the synthetic and benchmark datasets. Numerical results show that ELITE consistently outperforms existing methods on the benchmark datasets. The results show that ELITE can be very promising for constructing high-quality neural network ensembles. Index Terms— Feature selection, global optimization, neural network ensemble, optimal linear combination, transformation under stability-retaining equilibrium characterization (TRUSTTECH).
I. I NTRODUCTION WO well-known challenging tasks in the area of machine learning using artificial neural networks (ANNs) are those of network architecture selection and optimal weight training. In deciding the architecture for the multilayer perceptron (MLP), a large network usually provides better approximation accuracy on (training) data at the cost of generalization capability for the unseen (testing) data [1]–[3]. Ensemble offers an effective way to alleviate the burden of tuning the parameters of a single ANN and usually results in improved generalization capability [4], [5]. Several factors have a direct impact on the ensemble quality such as the
T
Manuscript received December 20, 2009; revised October 4, 2010; accepted October 5, 2010. Date of publication November 11, 2010; date of current version January 4, 2011. This work was supported in part by the Centers for Education and Research on Therapeutics program and in part by the National Science Foundation, USA, under Grant ECCS-0642327. The authors are with the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853 USA (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2087354
accuracy and diversity of member networks [6]–[8] and the optimal scheme for combination [9], [10]. We propose a systematic methodology for ensemble of optimal input-pruned neural networks using TRansformation Under Stability-reTraining Equilibrium Characterization (TRUST-TECH), termed ELITE. There are four stages in ELITE designed to achieve high performance accuracy and diversity of member networks and to achieve an optimal combination of the selected member networks. In order to construct high-quality neural network ensembles, a diverse population (member neural networks) is produced using different feature subsets for different members, while accurate individual networks are achieved via optimal training. The global optimizer used in ELITE is based on TRUST-TECH and it plays a critical role in achieving both the optimal training and the optimal combination of member neural networks. TRUST-TECH was developed to find high-quality solutions for general nonlinear optimization problems [11], [12]. It has been successfully applied to solve machine learning problems including optimally training ANNs [13] and estimating optimal parameters for finite mixture models [14], as well as to solve the optimal power flow problem [15]. TRUST-TECH-based methods can escape from a local optimal solution and search for other local optimal solutions in a systematic and deterministic way. Another feature of TRUST-TECH is its effective cooperation with existing local and global methods. This cooperation starts with a global method for obtaining promising solutions. Then by working with robust and fast local methods, TRUST-TECH efficiently searches the neighboring subspace of the promising solutions for new local optimal solutions in a tier-by-tier manner. A high-quality optimum can be found from the multiple local optimal solutions. One important role that TRUST-TECH plays in the proposed ELITE is to find multiple high-quality member networks and to design an optimal ensemble of the found high-quality member networks. ELITE provides an effective framework for constructing high-quality neural network ensembles (see Fig. 1). Existing training methods for neural networks and methods for composing ensembles can be easily incorporated into ELITE. For instance, in ELITE, optimization problems associated with training and linearly combining neural networks are solved using TRUST-TECH and existing local methods. As illustrated in Fig. 1, ELITE creates a population of neural networks through tier-1 TRUST-TECH search. The input layer of each network is pruned and a distinct feature subset is assigned. The accuracy of the input-pruned networks is then achieved
1045–9227/$26.00 © 2010 IEEE
WANG AND CHIANG: ENSEMBLE OF OPTIMAL INPUT-PRUNED NEURAL NETWORKS USING TRUST-TECH
ELITE Meth od for constructing Neural Network Ensembles TRUST-TECH Based Network Structure Selection Stage I: Generate a basic neural network with optimal training
TRUST-TECH Tier-1 Search Method
Stage II: Generating a population of local optimal neural networks
TRUST-TECH Based Input Layer Pruning Stage III: Generating a population of optimal, input-pruned neural network
TRUST-TECH + IPM Based Neural Network Combinator Stage IV: Creating the optimal neural network ensemble
Fig. 1. Structure of the ELITE method for constructing neural network ensembles.
by the TRUST-TECH-based optimal training. Finally, a combination of TRUST-TECH with the interior point method (IPM) is used to compute the optimal (combination) weights, and the member neural networks are combined to realize the ensemble. Several distinguishing features of ELITE are described below. 1) Diversity and Accuracy: Each member neural network in the ensemble constructed by ELITE is associated with a distinct and salient feature subset and is optimally trained with TRUST-TECH. Hence, both accuracy and diversity of the population can be achieved. The deterministic feature selection and TRUST-TECH based optimal training distinguish ELITE from existing methods in which feature subsets are randomly generated via bagging methods [16]. 2) Optimality: Optimality of the ensemble constructed using ELITE is achieved by optimally combining the member neural networks. The associated quadratic programming (QP) problem is effectively solved using TRUST-TECH and IPM. This feature differs from many existing method such as the genetic algorithm-based method [5] in that the issue of entrapment in a local optimum is resolved in an efficient and deterministic way. To illustrate effectiveness of the ELITE method, numerical experiments and studies are carried out for pattern classi-
97
fication using an ensemble of feedforward networks. These numerical experiments and studies are quite extensive and include the following. 1) Ensemble performance on a synthetic dataset and several UCI benchmark datasets. 2) Comparison of ensemble performance by different combination schemes. 3) Comparison of diversity and accuracy by different schemes with and without using TRUST-TECH. 4) Ensemble performance with different hidden layer sizes. 5) Comparison of performance of the proposed ELITE method with existing ensemble methods. Numerical results show that ELITE consistently outperforms existing methods on the benchmark datasets. The performance of ELITE was compared with that of six existing methods whose performance has been reported in the literature on the same datasets. Of a total of 12 datasets, ELITE achieves the best performance on 7 datasets, while on the other 5 datasets its performance is also comparable with the best performance. The rest of this paper is organized as follows. Section II provides preliminaries of the related work. An overview of the TRUST-TECH based method for training ANNs is given in Section III. ELITE for constructing neural network ensembles is presented and discussed in detail in Section IV. In Section V, numerical experiments are carried out and the results are studied. Section VI concludes this paper with additional remarks. II. P RELIMINARIES A. Neural Network Training The performance of a neural network is usually gauged by measuring the mean square error (MSE) of its output. The goal of optimal training is to find a set of parameters that achieves the global minimum MSE [17]. For an n-dimensional dataset, the MSE over Q samples in the training set is given by E(w) =
Q 1 [ti − y(xi , w)]2 Q
(1)
i=1
where ti is the target output for the i th sample xi , w is the weight vector, and y(·) is the network output function. The MSE as a function of the network parameters usually contains many local optimal solutions. Existing training methods can be categorized into local and global methods. Several successful training algorithms have been extensively studied in the literature [1], [18], [19]. Local methods, such as the backpropagation (BP) algorithm, are usually deterministic and have received significant attention. However, these methods can only attain a local optimal solution close to the initial conditions [20]. On the other hand, global methods, such as simulated annealing [21] and evolutionary algorithms [22]–[24], aim to explore the entire error surface to find solutions that approach the global optimum. Global methods can explore the entire solution space effectively to identify promising regions [25]. However, these methods may lack the ability to obtain a precise final solution and generally require local methods for fine-tuning.
98
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
B. Neural Network-Based Feature Selection The task of feature selection involves selecting a subset of relevant features to build robust learning models. Highdimensional features may degrade the efficiency of learning algorithms, especially when irrelevant or redundant features exist [26]. Feature selection has been recognized as a challenging combinatorial optimization problem. It is generally computationally prohibitive to evaluate all possible combinations to find the most compact feature set. As a recent advance, semi-supervised feature selection, in which both labeled and unlabeled examples are presented, has attracted special interest. In [27], Xu et al. solved this problem using convex–concave optimization with encouraging results. Neural networks can also be used as feature selectors. Neural network-based feature selection methods can be categorized as model independent or model dependent [28]. Model-independent methods perform feature selection and model building separately, while model-dependent methods attempt to optimize feature selection and model selection simultaneously. Optimal training plays an important role in both model-dependent and model-independent methods when neural networks are used as the feature selector. C. Neural Network Ensemble An ensemble provides an effective way to alleviate the burden of tuning the parameters of a single learning model. When a number of learning models are available, the best individual is usually chosen. However, the best model with respect to the performance on the training and validation sets does not necessarily have the best performance on the testing set. An ensemble formed by properly combining the outputs of different models usually results in better generalization performance than any of the involved individuals [4], [5], [29], [30]. Accurate and diverse neural networks are prerequisites to constructing a high-quality ensemble [6]–[8]. Bagging and boosting are two popular methods for creating diverse learning models by altering the training set such that each model sees different training subsets [31]. Bagging is parallel and it resamples the training set independently for learning each model. In contrast, boosting is sequential and the training set for each model is generated depending on previously learned models. In each case, optimal training is critical to achieve accurate learning models. The combination scheme has a direct impact on the ensemble quality [32]. Linear combination of neural networks is widely used in constructing an ensemble. The task of getting the optimal combination weights to achieve the minimum error has been formulated as an optimization problem [9], [10], [33]. One difficult issue in the ensemble is the entrapment in local optimal solutions. In the past, different global optimization methods were employed to address this issue with different degree of success. In ELITE, this issue is resolved by a combination of TRUST-TECH and IPM. III. T RAINING ANN S U SING TRUST-TECH This section presents an overview of the TRUST-TECHbased optimal training method. Without loss of generality, we
consider a feedforward neural network with one input layer, one hidden layer, and one output node. Given the input–output pairs (x1 , t1 ), (x2 , t2 ), . . . , and (x Q , t Q ), the training task can be formulated as an s-dimensional optimization problem min E(w) w
(2)
where s = (n + 2)k + 1 with n being the number of input nodes and k being the number of hidden nodes, and the weight vector is w = (w01 , . . . , w0k , . . . , wn1 , . . . , wnk , b0 , . . . , bk )T which includes all the network weights (w0 j is the j th output weight, and wi j is the weight connecting i th input node and j th hidden node) and biases (b j is the j th bias). The MSE to be minimized can be written as E(w) =
Q 1 [ti − y(xi , w)]2 . Q
(3)
i=1
TRUST-TECH solves an optimization problem by first defining a dynamical system such that the stable equilibrium points (SEPs) in the dynamical system have one-to-one correspondence with local optimal solutions of the optimization problem (2). Because of such correspondence, the problem of computing multiple local optimal solutions of the optimization problem is then transformed to finding multiple stability regions in the defined dynamical system, each of which contains a distinct SEP. An SEP can be computed with the trajectory method or using a local method with a trajectory point in its stability region as the initial point [11], [12]. To solve the optimization problem (2), the desired dynamical system can be defined as the generalized gradient system dw = −gr ad R E(w) = −R(w)−1 · ∇ E(w) (4) dt where R(w) is a positive definite symmetric matrix (also known as the Riemannian metric). This framework is quite general since different existing training algorithms can be included in this formulation with different definitions of R(w). If R(w) = I , then it is a naive error BP algorithm, if R(w) = J (w)T J (w), then it is the Gauss–Newton method, and if R(w) = J (w)T J (w) + µI , then it will be the Levenberg–Marquardt (LM) method. Hence, TRUST-TECHbased methods are dynamical methods for obtaining a set of local optimal solutions of general optimization problems. A TRUST-TECH-based algorithm for training neural network weights is detailed in Fig. 2. The value of ǫ is empirically chosen to be 0.1 to make w′ in the stability region of the neighboring SEP (or the neighboring local optimal solution). The task of selecting proper search directions in an efficient way is very challenging. In this algorithm, the search directions can be chosen as a subset of dominant eigenvectors of the objective Hessian at the SEP. Justification of this strategy is that local stable (unstable) manifold of an equilibrium point of (4) is tangent to the stable (unstable) eigenspace of the linearized system at this equilibrium point [34]. However, computing Hessian eigenvectors, even dominant ones, is computationally demanding, especially for large-scale problems.
WANG AND CHIANG: ENSEMBLE OF OPTIMAL INPUT-PRUNED NEURAL NETWORKS USING TRUST-TECH
Inp ut: a local optimal weight vector ws 0. Outp ut: a set Ws of next-tier SEPs. Initia liza tion: Ws ={ws 0} Alg orithm: 1) Determine the search directions d 1 , d 2 , . . . d k . 2) f or i = 1 : k ◦ Search for an exit point we along di. ◦ if we along the search direction is found, then · Step forward along di to the point w' = ws0 + ǫ(we − ws0) with ǫ being a small value. w' will lie in the stability region of the neighboring SEP. · Using w' as the initial guess, apply the local optimizer to get a tier-1 SEP, denoted as wsi, lying in the neighboring stability region. · Update Ws as Ws = Ws ∪ {wsi}. 3) Output the set Ws of SEPs of the generalized gradient system (4). Fig. 2.
Tier-1 TRUST-TECH search algorithm.
Another choice is to use random search directions, but they need to be orthogonal to each other in order to span the search space and to maintain a diverse search. It appears that effective directions in general have a close relationship with the structure of the objective function (and the feasible set for constrained problems). Hence, exploitation of the structure of the objective under study will prove fruitful in selecting search directions. By exploring the TRUST-TECH’s capability to escape from local optimal solutions in a systematic and deterministic way, it becomes feasible to locate multiple local optimal solutions in a tier-by-tier manner. Each local optimal solution corresponds to a local optimal neural network. As a result, distinct local optimal neural networks can be obtained (local optimal weights in Ws of the same network structure) and are denoted as N = {n1 , n2 , . . . , n L }. IV. O PTIMAL E NSEMBLE This section proposes a method called ELITE for constructing high-quality neural network ensembles, taking advantage of TRUST-TECH’s ability to find multiple local optimal solutions. The goals of designing ELITE are twofold. The first one is to generate a population of accurate and diverse neural networks, and the second one is to optimally combine them to realize an optimal ensemble. ELITE consists of the following four stages (see Fig. 3). 1) Stage I: Determine an optimal network structure. 2) Stage II: Generate member neural networks. 3) Stage III: Prune and feature selection for member neural networks. 4) Stage IV: Perform the optimal combination of member neural networks. A. Stage I: Determining an Optimal Network Structure Since TRUST-TECH can effectively find multiple local optimal solutions to the training problem, the potential (i.e., the capability) of a neural network with a specific structure can be well explored. Hence, a neural network with a compact structure can be obtained. Considering that complexity has
99
a direct impact on the generalization capability of neural networks, a neural network with a compact structure is needed. This stage serves to meet this need. An incrementally growing method is used to determine a compact-structured neural network. A similar constructive strategy had been used to train feedforward neural networks with promising results [35]. This method starts from an initial network with a single hidden layer and a small number of hidden nodes. The TRUST-TECH-based training method is applied to determine the optimal weights achieving the minimum MSE value. If this value is greater than a target value (0.01 in this paper), then a new hidden node is added and the network is trained again using TRUST-TECH. This process is repeated until the required MSE value is met or there is no significant improvement can be achieved in reducing the minimum MSE value. The neural network (with high-quality local optimal weights) thus obtained serves as the fundamental neural network for the subsequent stages. B. Stage II: Generating Member Neural Networks After the structure of a compact neural network has been determined, the network weights will be re-trained through a tier-1 TRUST-TECH search algorithm. The objective is twofold: First, from a nonlinear system theory viewpoint, this re-training will explore the stability regions of neighboring SEPs surrounding the SEP corresponding to the basic network. Hence, the possibility of getting better local optimal solutions is increased. Second, and more important, multiple local optimal solutions will be obtained through the TRUST-TECH search, providing a population of neural networks to form an ensemble. As a result, a population of neural networks sharing the same structure but with different local optimal weights is obtained. This set of neural networks is denoted as N0 = {(n0 , w1∗ ), (n0 , w2∗ ), . . . , (n0 , w∗L )}. In our previous work [13], the effort was stopped here and the neural network achieving the best performance on the validation dataset was selected as the final result. However, the single neural network thus selected does not necessarily have the best performance on an unseen testing dataset. In other words, the generalization capability of the selected neural network is not guaranteed. In the proposed ELITE method, all these tier-1 neural networks are preserved. To achieve improved generalization performance, additional processes including feature selection and network ensemble will be carried out, as described in the next two stages. C. Stage III: Input Pruning and Population Diversity In this stage, an improved saliency-based method is proposed to achieve optimal input-pruned neural networks using the TRUST-TECH-based training method. In the MLP, each feature is associated with an input node and feature selection can be realized based on evaluating the saliency of the input nodes. The saliency of a weight in the neural network can be approximated by the change on the performance caused by adjusting this weight to 0. Applying the Taylor expansion on the error function E(w) with respect
100
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Stage I:
Stage II:
Network structure learning
Member network generating and training
Stage III:
Stage IV:
Salient feature selection
Optimal network combination
TRUST-TECH Escape from a local optimum
Approach neighborhood local optima
Tier-by-tier search
Fig. 3. ELITE consisting of four stages for constructing an ensemble, by generating and optimally combining accurate and diverse neural networks. TRUST-TECH plays a central role in helping the local optimizers avoid entrapment in local optimal solutions to the associated optimization problems.
Input: a local optimal network n and the threshold p. Output: a salient feature subset x, and the input-pruned network n. Algorithm: 1) Calculate the input node saliency according to (8). 2) Calculate the normalized saliency
to the i th weight wi ∈ w = (w01 , . . . , bk )T , i = 1, . . . , s, we can get 1 ∂2 E ∂E T 3 2 . (5) · wi + + O |w | · w E = i i ∂wi 2 ∂wi2 In order to adjust wi to 0, we have wi = −wi , meanwhile, the higher order item in (5) can be approximated with 2 ∂E 1 ∂ωi 3 O(|wi | ) ≈ . (6) 2 ∂ 2 E2
S˜i =
{S˜ij | {S˜i1 ≥ S˜i2 ≥ · · · ≥ S˜ in}
(10)
4) The selected salient feature subset is:
x = {xi , . . . , xi } 1
(11)
k
where, k is determined via k
k = max{k | 1 −
S˜ j=1
ij
≥ p}
(12)
5) Remove the redundant input nodes and corresponding weights and get the input-pruned network n.
∂ωi
j
(9)
3) Sort {S˜i} in the descending order, resulting
Hence, we have the following representation for the saliency of the i th weight in the neural network: 2 ∂E 2 ∂E 1 ∂ E 2 1 ∂ωi si = − ωi + . (7) + ω ∂ωi 2 ∂ωi2 i 2 ∂ 2 E2
In essence, the saliency of an input node is the accumulated saliency of its fan-out weights. We describe a feature selection procedure in Fig. 4. The saliency of input nodes is computed. Then the nodal saliency is normalized according to the total saliency. To balance the number of selected features, the saliency threshold p is chosen empirically as 0.15 in this paper. In other words, the minimum set of input nodes (or features) whose combined saliency accounts for at least 85% of the total saliency is selected. This feature selection procedure is carried out separately on each tier-1 local optimal neural network in the set N0 . As a result, distinct feature subsets will be assigned to different local optimal neural networks. Since those low-saliency nodes have been removed, the input layer is condensed and the network structure is modified accordingly. These input-pruned neural networks are denoted as {ni : wi , xi }, i = 1, . . . , L, where ni stands for the corresponding network structure, wi corresponds to the modified weights, and xi is the dataset composed of the selected feature subset.
, ∀i = 1,· · ·, n
i
∂ωi
Consequently, the saliency for an input node (accordingly, the corresponding input feature) can be represented as [28] ⎧ 2 ⎫ ∂E ⎪ ⎪ ⎨ ∂E 1 ∂ 2 E 2 1 ∂ω j ⎬ − . (8) Si = ωj + ω + ⎪ ∂ω j 2 ∂ω2j j 2 ∂ 2 E2 ⎪ ⎭ j ∈ f anout {i} ⎩ ∂ω
Si
ni = 1 S
Fig. 4.
Saliency-based method for feature selection.
For each structurally modified neural network ni , it is evident that the remaining weights wi do not necessarily remain (local) optimal. The TRUST-TECH-based training method is carried out to find multiple local optimal weight vectors for each neural network, from which the optimal one w∗i will be selected. As a result of this stage, a population of optimal networks associated with different feature subsets is obtained and will be denoted as {ni : w∗i , xi }, i = 1, . . . , L. D. Stage IV: Optimal Combination The task of finding an optimal ensemble of a family of neural networks is achieved by solving the following optimization problem: ⎞2 ⎛ Q L ⎝ (13) v j ◦ f j (xi ) − ti ⎠ min E(v|N, x) = v
i=1
j =1
where E(v|N, x) is the error function with respect to the combination rule v, given the set of neural networks N = {n1 , . . . , n L }. The dataset x is usually chosen as the validation dataset. f j (xi ) is the output of the j th neural network when the i th sample is input, while ti is the desired output.
WANG AND CHIANG: ENSEMBLE OF OPTIMAL INPUT-PRUNED NEURAL NETWORKS USING TRUST-TECH
Input: The set of neural networks N = {n1,· · ·,nL}. Output: The optimal combination weights v∗. Initialization: The initial point v0 = {1/L,· · · , 1/L}, and the set of SEPs Vs = ∅. Algorithm: 1) Calculate the correlation matrix C using (16) and compute its eigenvalues λ1,· · · , λk. 2) Using v0 as the initial point, apply the IPM to solve (14) and get an SEP vs0. Update Vs as Vs = {vs0}. 3) If min ki = 1 λi < σ (σ is a small positive value) ◦ Calculate the search directions {d 1 , . . . d m } . ◦ for i = 1 : m Search for an exit point ve along di in the generalized gradient system (20). If ve along di is found, then · Step forward along the search direction to the point v' = vs0 + ǫ(ve − vs0) with ǫ being a small positive number. v' will lie in the stability region of the neighbouring SEP. · Using v' as the initial point, apply the IPM to get the tier-1 SEP, denoted as vsi, lying in the neighbouring stability region. · Update Vs as Vs = Vs ∪ {vsi}. 4) The optimal combination weight vector is: v∗ = arg minv { E(v) | v ǫ Vs}
1) Optimal Linear Combination: ELITE combines the neural networks using optimal linear weights. The optimal weights are calculated by solving the following QP problem [5]: minv E(v) = 21 vT Cv (14) s.t. vT e = 1 v≥0 where v stands for the combination weight vector and e = (1, . . . , 1)T . C is the correlation matrix whose elements Ci j = p(x) f i (x) − t (x) f j (x) − t (x) d x (15) are the correlation between the outputs of the i th and j th neural networks, where t (x) is the target output for x. For a practical problem with Q training samples, the matrix C can be numerically evaluated with elements being Q 1 f i (x k ) − tk f j (xl ) − tl , i, j = 1, . . . , L. Q k=1 (16) Since C may not always have positive eigenvalues, the quadratic optimization problem (14) is not necessarily convex. Hence, it might have multiple local optimal solutions. Iterative methods, such as the IPM [36], are very effective when solving convex quadratic optimizations problems. However, if there are multiple local optimal solutions, they may get stuck in a local optimal solution. ELITE combines TRUST-TECH and IPM to effectively find multiple local optimal solutions, from which a high-quality solution will be selected for ensemble. 2) IPM Formulation: Using the logarithmic barrier function, the augmented Lagrange function is
Ci j =
Fig. 5.
TRUST-TECH + IPM algorithm for optimal linear combination.
Let x = (v, λ). The generalized gradient system corresponding to the problem (19) is defined as dx = −∇ H T (x) · H (x) dt and the associated energy function is
(20)
1
H (x) 2. (21) 2 Since the system (20) is defined for searching multiple local optimal solutions starting from a local optimal solution of problem (14), µ = 0 and thus it has no presence in (20). The TRUST-TECH-based method for calculating the optimal combination weights is presented in Fig. 5. Being the same as in Fig. 2, the value of ǫ is chosen to be 0.1 in this paper, to make v′ closer to the neighboring SEP (or the neighboring local optimal solution). In Fig. 5, σ is used to detect nonconvex situations, which is set to 1e − 6 in this paper. E(x) =
n
1 ln v i L µ (v, λ) = vT Cv + λ vT e − 1 + µ 2
101
(17)
i=1
where µ is the barrier parameter. Hence, the Karush–Kuhn– Tucker optimality conditions are ∂ Lµ = Cv + λe − µV−1 e = 0 (18a) ∂v ∂ Lµ = vT e − 1 = 0 (18b) ∂λ where V = diag(v 1 , v 2 , . . . , v n ). Multiplying both sides of (18a) with V, we have VCv + λVe − µe Hµ (v, λ) = = 0. (19) vT e − 1 The essence of IPM is to solve a sequence of problems (18) with decreasing µ → 0. The obtained sequence of solutions will approach a local optimal solution to the original quadratic optimization problem (14) where µ = 0. 3) TRUST-TECH-Based Method: The IPM can only obtain a local optimal solution providing an initial point. TRUSTTECH is used to compute multiple local optimal solutions, with the IPM being the local solver. From these solutions, the best one is selected and used for constructing the ensemble. To this end, TRUST-TECH first builds an associated nonlinear dynamical system.
V. E XPERIMENTAL R ESULTS To evaluate the performance of the neural network ensembles constructed using ELITE, numerical experiments have been carried out to solve a variety of pattern classification problems on the synthetic dataset and the UCI benchmark datasets [37]. In this section, these experiments are described in detail and the results are presented and discussed. A. Experimental Setup ELITE has been implemented in MATLAB, and the ensemble is constructed by optimally combining feedforward networks with one hidden layer. The hyperbolic tangent function is used as the transfer function in both the hidden and output
102
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE I R ESULTS ON THE S YNTHETIC D ATA Network # Training MSE MSE Improvement Training error Testing error Network # Training MSE MSE Improvement Training error Testing error
1 0.14 — 4% 11%
2 0.15 −7.1% 4% 15%
12 13 0.44 0.32 −214.28% −128.57% 11% 8% 9% 12%
3 1.64 −1071.43% 41% 32% 14 0.18 −28.57% 5% 22%
4 0.56 −300% 14% 15%
5 6 7 8 9 10 11 0.48 0.36 0.24 0.15 0.32 0.32 0.16 −242.86% −157.14% −71.43% −7.1% −128.57% −128.57% −14.28% 12% 9% 6% 4% 8% 8% 4% 20% 11% 14% 10% 10% 11% 12%
15 16 0.40 0.64 −185.71% −357.14% 10% 16% 13% 18%
17 0.27 −92.86% 7% 11%
18 0.23 −64.28% 7% 16%
19* 0.08 75% 2% 15%
20 0.14 0% 5% 11%
21 0.26 −85.71% 7% 10%
Ensemble — — 7% 10%
The whole dataset
Training error 4.00% Testing error 11.00%
Training error 2.00% Testing error 15.00%
Training error 7.00% Testing error 10.00%
(a)
(b)
(c)
(d)
Fig. 6. Classification surfaces for the initial network, the tier-1 minimum-MSE network, and the final ensemble. (a) Whole dataset. (b) Initial network. (c) Network with the best MSE. (d) Ensemble.
layers, and the LM method is used as the local optimizer for training. Starting from the initial local optimal solution obtained by the local optimizer, we apply the TRUST-TECH tier-1 search algorithm to locate the tier-1 local optimal solutions. The search directions from the initial local optimal solution are 20 random but orthogonal directions. Hence, the largest ensemble size will be 21 (the base neural network and 20 tier-1 neural networks). The number of directions is determined based on the result reported in [38], where both the theoretical and experimental results showed that choosing an ensemble size between 10 and 20 is sufficient to approach the asymptotic bagging error. In addition, the experimental results in [38] showed that the performance improvement becomes flat when the ensemble size is larger than 20. In ELITE, if multiple networks have the same set of inputs, only the best one will be involved in ensemble. Furthermore, networks with accuracy lower than 50% are removed from the combination. As a result, the final ensemble size can be smaller than 21. After having performed the proposed stages of feature selection and TRUST-TECH-based optimal training, a family of optimal, yet diverse, neural networks are generated. The ensemble is constructed using a set of optimal linear combination weights computed by solving the quadratic program, where the IPM solver in the interior point optimizer (Ipopt) package [39] is used as the local optimizer. B. Experiments on a Synthetic Dataset Experiments are first carried out on a 2-D synthetic dataset. The main purpose is to visually demonstrate the diversity
between different local optimal neural networks that are computed in ELITE. This dataset is composed of 200 points in the x y-plane that have been picked up randomly within the area {(x, y)| − 1 < x < 1, −1 < y < 1} following the uniform distribution. Two class labels are assigned to the samples following: −1, x ≥ −y c= (22) +1, x < −y. Then a zero-mean Gaussian noise with a standard deviation of 0.2 is added to corrupt the data. One-half of each class is used for training, and the remaining half is used for testing. The distribution of the synthetic dataset is shown in Fig. 6(a). The neural network is composed of four nodes in the hidden layer. Since there are only two features in the dataset, feature selection is not conducted. Table I summarizes the performance of the local optimal neural networks and their ensemble. We have the following observations. 1) Local optimality and diversity: The initial local optimal network has a training error of 4% and a testing error 11%. Its classification surface shown in Fig. 6(b) indicates that it is slightly overtrained. The performance of different tier-1 local optimal neural networks varies significantly, where the training error varies from 2% to 41% and the testing error varies from 9% to 32%. 2) Minimum MSE neural network: In terms of the objective function value for the optimization problem (2), the best MSE is obtained by the 19th network, which is 0.08. However, although it achieves the best performance on the training data with the classification error being 2%,
WANG AND CHIANG: ENSEMBLE OF OPTIMAL INPUT-PRUNED NEURAL NETWORKS USING TRUST-TECH
TABLE II D ATASETS U SED IN THE E XPERIMENTS
103
TABLE IV P ERFORMANCE OF THE I NPUT-P RUNED N EURAL N ETWORKS
Dataset
Dataset name
Samples
Features
Classes
1 2 3 4 5 6 7 8 9 10 11 12
Breast cancer Clean Diabetes Glass Ionosphere Iris MAGIC Segmentation Sonar SPECTF Statlog (Heart) Wine
683 467 768 214 351 150 19020 210 208 80 270 178
9 168 8 9 33 4 11 19 60 44 13 13
2 2 2 6 2 3 2 7 2 2 2 3
Dataset 1 2 3 4 5 6 7 8 9 10 11 12
Feature set Size Reduction 5.67 117.2 5.87 6.07 19.07 3.07 6.13 9.47 36.87 28.40 8.33 8.0
37.00% 29.40% 26.62% 32.56% 42.21% 23.25% 44.27% 50.16% 38.55% 35.45% 35.92% 38.46%
Training error µT r σT r 0.76% 0.10% 17.70% 30.28% 0.09% 0.45% 13.20% 2.18% 0.37% 0.0% 4.08% 0.0%
Testing error µT s σT s 3.24% 0.09% 9.43% 0.94% 24.40% 0.23% 42.85% 1.97% 9.52% 0.68% 2.78% 0.67% 13.53% 0.08% 11.11% 1.78% 17.08% 0.74% 19.99% 1.40% 19.92% 0.59% 4.84% 0.90%
0.11% 0.01% 0.57% 2.54% 0.05% 0.16% 0.07% 0.68% 0.11% 0.0% 1.15% 0.0%
TABLE III TABLE V
P ERFORMANCE OF THE T IER -1 M IN MSE N EURAL N ETWORKS Dataset
Hidden nodes
1 2 3 4 5 6 7 8 9 10 11 12
6 10 6 6 10 4 9 8 10 6 8 4
Training error µT r σT r 0.08% 0.13% 13.13% 18.17% 0.0% 0.20% 12.21% 1.47% 0.0% 0.0% 2.31% 0.0%
0.08% 0.02% 0.66% 2.73% 0.0% 0.45% 0.07% 0.62% 0.0% 0.0% 1.24% 0.0%
Testing error µT s σT s 4.96% 0.12% 9.67% 0.73% 27.04% 0.27% 44.14% 1.29% 10.13% 0.96% 5.39% 1.14% 13.23% 0.09% 14.57% 0.29% 23.87% 0.29% 28.94% 1.30% 23.20% 0.63% 5.46% 0.77%
µ, σ : the mean and standard deviation, respectively.
this network is not the best in terms of the testing performance, which is 15%. Its classification surface is presented in Fig. 6(c), showing that this network is severely overtrained. 3) Improvements via ensemble: The ensemble enhances the generalization performance of the learning model. It results in a good balance between the classification errors on the training and testing datasets, which are 7% and 10%, respectively. Observation 3) can be further verified by observing its classification surface as shown in Fig. 6(d). The classification surface is well structured and very close to the ground-truth classification surface y = −x for the dataset before being contaminated by the noise. The capability of ELITE to generate a diverse population of neural networks can also be verified by observing the classification surfaces for all the 20 tier-1 local optimal neural networks, as shown in Fig. 7. This figure reveals excellent diversity among the local optimal neural networks found by ELITE, even without feature selection. This experiment supports the strategy of ELITE to use different local optimal neural networks as members for ensemble. C. Experiments on the UCI Benchmark Datasets To evaluate the performance on real data and to facilitate comparison with other existing methods, ELITE has also been
P ERFORMANCE OF THE E NSEMBLE Dataset 1 2 3 4 5 6 7 8 9 10 11 12
Ensemble size µ Sz σ Sz 19.12 20.02 15.66 12.91 20.26 6.47 11.97 18.06 20.81 19.85 19.87 20.43
0.47 0.11 0.38 0.70 0.34 0.22 0.45 0.51 0.10 0.47 1.01 0.06
Training error µT r σT r 1.60% 0.53% 17.28% 23.50% 0.39% 0.92% 12.73% 0.50% 0.92% 1.02% 7.10% 0.0%
0.15% 0.25% 0.43% 1.10% 0.11% 0.25% 0.05% 0.14% 0.19% 0.28% 2.69% 0.0%
Testing error µT s σT s 1.44% 0.07% 7.44% 0.32% 19.64% 0.16% 35.27% 0.60% 2.46% 0.48% 1.41% 0.13% 13.03% 0.08% 6.55% 0.11% 6.63% 0.22% 4.87% 0.17% 12.13% 2.26% 0.19% 0.11%
tested on the UCI benchmark datasets [37]. Twelve datasets for pattern classification have been used in the experiment, which are summarized in Table II. To evaluate the generalization performance, threefold cross validation is carried out on each dataset. The number of hidden nodes for each dataset is shown in Table III. This table also summarizes the performance of the tier-1 local optimal neural networks obtained by the tier-1 TRUST-TECH search. Feature selection is carried out on each tier-1 neural network, where the threshold for determining the redundant features is set as p = 0.15 in (12). The performance of neural networks without feature selections and with feature selection is summarized in Tables III and IV, respectively. It can be observed that the generalization performance has been greatly improved through feature selection due to a more compact network structure. For example, on average, 37.00% features have been eliminated from the Breast cancer dataset and the classification error is increased from 0.08% to 0.76% on the training set, while the testing error is decreased from 4.96% to 3.24%. For the SPECTF dataset, the procedure of feature selection have removed 35.45% features, and yet the training error is maintained at 0.0% while the testing error is reduced from 28.94% to 19.99%. Table V summarizes the performance of the resultant ensemble. By the performance summarized in Tables IV and V, it
104
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Training error = 4.00% Testing error = 15.00%
Training error = 41.00% Testing error = 32.00%
Training error = 14.00% Testing error = 15.00%
Training error = 12.00% Testing error = 20.00%
(a)
(b)
(c)
(d)
Training error = 9.00% Testing error = 11.00%
Training error = 6.00% Testing error = 14.00%
Training error = 4.00% Testing error = 10.00%
Training error = 8.00% Testing error = 10.00%
(e)
(f)
(g)
(h)
Training error = 8.00% Testing error = 11.00%
Training error = 4.00% Testing error = 12.00%
Training error = 11.00% Testing error = 9.00%
Training error = 8.00% Testing error = 12.00%
(i)
(j)
(k)
(l)
Training error = 5.00% Testing error = 22.00%
Training error = 10.00% Testing error= = 13.00%
Training error = 16.00% Testing error = 18.00%
Training error = 7.00% Testing error = 11.00%
(m)
(n)
(o)
(p)
Training error = 7.00% Testing error = 16.00%
Training error = 2.00% Testing error = 15.00%
Training error = 5.00% Testing error = 11.00%
Training error = 7.00% Testing error = 10.00%
(q)
(r)
(s)
(t)
Fig. 7. Classification surfaces for the 20 local optimal networks obtained by tier-1 TRUST-TECH search. Diversity of the classification surfaces is readily observed. (a) 2nd network. (b) 3rd network. (c) 4th network. (d) 5th network. (e) 6th network. (f) 7th network. (g) 8th network. (h) 9th network. (i) 10th network. (j) 11th network. (k) 12th network. (l) 13th network. (m) 14th network. (n) 15th network. (o) 16th network. (p) 17th network. (q) 18th network. (r) 19th network. (s) 20th network. (t) 21th network.
WANG AND CHIANG: ENSEMBLE OF OPTIMAL INPUT-PRUNED NEURAL NETWORKS USING TRUST-TECH
20
Training error Testing error
6 4
10 5
2 S2
S3
S4
S1
S2
6 4
5 S2
S3
0
S4
S1
S2
S3
S3
S4
S1
S2
S2
S3
S4
(j)
S4
Training error Testing error
15
15 10
35 30 25 20 15 10 5 0
S3
20
10 5
S1
S2
S3
0
S4
S1
S2
(g) Training error Testing error
(i)
S1
(d)
20
0
S4
Error (%)
Error (%)
Error (%) S2
40 35 30 25 20 15 10 5 0
20
0
S4
Training error Testing error
(f) Training error Testing error
S3
5
(e)
S1
S2
25
2 S1
S1
30
Training error Testing error
8 Error (%)
Error (%)
10
30
(c)
10
Training error Testing error
15
35 30 25 20 15 10 5 0
S4
Error (%)
20
0
S3
Training error Testing error
40
10
(b)
(a)
50
Training error Testing error
Error (%)
S1
0
40 35 30 25 20 15 10 5 0
S3
S4
(h) 10
Training error Testing error
Training error Testing error
8 Error (%)
0
Error (%)
15 Error (%)
Error (%)
8
Training error Testing error
Error (%)
10
105
6 4 2
S1
S2
S3
(k)
S4
0
S1
S2
S3
S4
(l)
Comparison of testing performance
Fig. 8. Performance in stages of ELITE. The mean, standard deviation, range of the error are presented. The labels on x-axis stand for the four stages: S1 for the base local optimal network obtained in stage 1; S2 for tier-1 minimum-MSE neural networks in stage 2; S3 for the optimal input-pruned neural networks in stage 3; and S4 for the final ensemble in stage 4. Consistent improvement of the performance from stage to stage can be readily observed. (a) Cancer. (b) Clean. (c) Diabetes. (d) Glass. (e) Ionosphere. (f) Iris. (g) MAGIC. (h) Segmentation. (i) Sonar. (j) SPECTF. (k) Statlog (heart). (l) Wine.
160 140 120
ELITE Uniform Voting
100 80 60 40 20 0
1
2
3
4
5
6 7 Dataset
8
9
10 11 12
Fig. 9. Comparison of the testing error of three ensemble schemes. The voting performance is used as the baseline for comparison.
follows that the improvement on performance made by Stage IV of ELITE is quite significant. For example, for the Breast cancer dataset, the classification error on the testing set is reduced from 3.24% to 1.44%. For the Wine dataset, the improvement is more significant, with the testing error being decreased from 4.84% to 0.19%. Generally speaking, Stage IV of ELITE constructs a highperformance ensemble by optimally ensembling a family of diverse and yet locally optimal neural networks built in Stage I through III of ELITE. For instance, it is observed that, on comparing the performance of the solo neural network obtained in Stage I, the ensemble created using ELITE also reduces the standard deviation σ of the classification error. For example, for the Iris dataset, σ of the testing error is 0.59% before using TRUST-TECH, which is increased to
1.14% for the tier-1 best neural network and 0.67% after using the proposed feature selection method. ELITE achieves a significant decrease on σ to 0.13%. The direct benefit introduced by the reduced deviation is that sensitivity of the classifier’s performance in initialization can be effectively suppressed. Hence, improvement on both performance and robustness achieved by ELITE has been well demonstrated. Among all learning iterations involved in the numerical experiments on the UCI datasets, it is found that the condition in Fig. 5 (i.e., mi n{λ1 , . . . , λk } < σ ) is satisfied for 57.3% of the cases. These cases benefit from TRUST-TECH to acquire the optimal combination weights. Finally, statistics regarding the ensemble size, i.e., the number of neural networks involved in the ensemble for each dataset, is also presented in Table V (columns 2 and 3). For the convenience of comparison, Fig. 8 summarizes the performance in different stages on each dataset. A consistent improvement of the performance from each stage of ELITE can be readily observed. D. Comparison with Other Ensemble Schemes We next show the effectiveness of Stage IV of ELITE. To this end, the performance of different ensemble schemes is compared. In the past, ensembles have also been constructed using two other widely adopted schemes, uniform linear combination and majority voting. On the other hand, the optimal linear combination scheme is employed in ELITE. A comparison between the performance by these three ensemble schemes is presented in Table VI. Using the performance by
106
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE VI C OMPARISON B ETWEEN D IFFERENT E NSEMBLE S CHEMES Dataset
1.60% 0.53% 17.28% 23.50% 0.39% 0.92% 12.73% 0.50% 0.92% 1.02% 7.10% 0.0%
1.44% 7.44% 19.64% 35.27% 2.46% 1.41% 13.03% 6.55% 6.63% 4.87% 12.13% 0.19%
Uniform weight ET r ET s 1.77% 2.66% 17.69% 21.66% 0.25% 0.93% 13.17% 0.04% 0.14% 0.0% 6.41% 0.0%
1.99% 11.19% 21.28% 35.94% 5.09% 2.11% 13.52% 7.45% 10.84% 11.00% 15.01% 0.89%
Voting ET r ET s 1.70% 2.43% 17.18% 28.53% 0.25% 1.13% 13.57% 1.10% 0.13% 0.0% 6.23% 0.0%
1.84% 10.82% 21.46% 37.59% 5.10% 1.63% 13.77% 7.41% 10.84% 10.31% 14.79% 0.59%
voting as the baseline, a comparison of the testing error by the three schemes is better visualized in Fig. 9. We have the following observations based on the comparison. First, ELITE outperforms the other two schemes by using the TRUST-TECH-based optimal linear combination scheme. On all the datasets, the scheme achieves testing performance better than both uniform linear combination and majority voting. The benefit of being computationally efficient in finding optimal combination weights to construct high-quality ensembles makes the proposed ELITE method a very favorable choice in practical applications. Second, the uniform linear combination scheme is comparable with the majority voting scheme. In fact, these two ensemble schemes share almost indistinguishable performance when they are used to combine the local optimal neural networks learned in Stages I through III of ELITE. E. Diversity and Accuracy Relationship between the diversity and accuracy of the member networks involved in ensemble is examined. In particular, the relationship exhibited in ELITE is numerically studied. The diversity of a family of neural networks is evaluated as the averaged double fault measure [40] di v = 1 −
L−1 L 2 pr ob(ni fails, n j fails) (23) L(L − 1) i=1 j = i+1
where ni and n j are the i th and j th member networks, respectively, and L is the total number of member networks. A numerical test is carried out on the Sonar dataset. The following six families of neural networks are produced: 1) fully connected networks training by TRUST-TECH; 2) fully connected networks with random initialization and trained by a local optimizer without using TRUSTTECH; 3) input-pruned and retrained networks generated from 1); 4) input-pruned and retrained networks generated from 2); 5) input-pruned networks (no retraining) generated from 1); 6) input-pruned networks (no retraining) generated from 2).
Diversity µdiv σdiv
Scheme a b c d e f
0.902 0.886 0.921 0.884 0.869 0.867
Training error µT r σT r
0.007 0.008 0.007 0.004 0.009 0.003
0.48% 0.13% 0.92% 0.93% 9.85% 9.04%
0.18% 0.09% 0.19% 0.21% 0.39% 0.87%
20
Testing error µT s σT s 10.26% 0.74% 12.60% 0.56% 6.63% 0.22% 10.95% 0.90% 13.32% 0.68% 14.35% 0.58%
Training error Testing error
15 Error (%)
1 2 3 4 5 6 7 8 9 10 11 12
ELITE ET r ET s
TABLE VII D IVERSITY AND A CCURACY
10 5 0
2
3
4
5 6 7 8 9 Number of hidden nodes
10
11
Fig. 10. Training error and testing error with different number of nodes in the hidden layer. Both the training error and testing error reach their local optimal solutions when the number of hidden nodes is 10.
Ensembles are constructed using the proposed optimal combination method and their accuracy is evaluated. The diversity and accuracy of these ensembles is shown in Table VII. We have the following observations based on our numerical results. 1) The diversity measure is highly correlated with the ensemble accuracy. In other words, a higher diversity is associated with a better ensemble performance. 2) Input-pruned neural networks without retraining lead to the worst performed ensemble for both scenarios with and without using TRUST-TECH. 3) Retraining the input-pruned networks can improve the ensemble performance, which also outperforms the ensemble using networks without input pruning. 4) Using TRUST-TECH in schemes 1), 3), and 5) results in better performance than schemes 2), 4), and 6), respectively. 5) The proposed ELITE method, corresponding to the scheme 3), achieves the best ensemble performance among the six situations. This validates effectiveness of the four-stage TRUST-TECH-based procedures in ELITE. F. Performance with Hidden Layer Size In this experiment, the influence of the hidden layer size of member networks over the ensemble performance is studied. This experiment is carried out on the Sonar data, and the hidden layer size is varied between 2 to 11. Since the hidden layer size is fixed, only Stage II through Stage IV of the proposed method is involved in constructing the ensembles.
WANG AND CHIANG: ENSEMBLE OF OPTIMAL INPUT-PRUNED NEURAL NETWORKS USING TRUST-TECH
TABLE VIII C OMPARISON B ETWEEN ELITE AND E XISTING M ETHODS Dataset
ELITE
1 2 3 4 5 6 7 8 9 10 11 12
1.44% 7.44% 19.64% 35.27% 2.46% 1.41% 13.03% 6.55% 6.63% 4.87% 12.13% 0.19%
In [43]
NLBP
3.89% 2.63% 1.10% 1.23% 3.11% 43.07% × × × × 24.72% 20.58% 17.80% 19.69% 23.50% 31.38% × 24.60% 22.89% × 10.79% 6.54% × × 6.40% 4.18% 2.67% × × × 14.67% × × × × 17.07% 7.40% × × × 40.38% 12.98% 14.36% 14.36% 14.70% 34.17% × × × × 29.95% × 11.20% 11.96% 16.40% 5.69% 4.48% × × ×
SVM
In [13]
CNNE
COOP
28.11% × 23.72% 31.50% 4.96% 4.40% × 3.04% 15.60% × 18.44% ×
×: data not available
Both the training error and testing error with different number of hidden nodes are shown in Fig. 10. It can be observed that, as the number of hidden nodes increases, the ensemble performance improves accordingly. In addition, when the hidden layer size is larger than 5, the improvement of the ensemble performance becomes saturated, although slightly improved. Finally, it can be observed that the performance reaches an optimal condition when the hidden nodes is 10. It should be noted that this optimal condition comes with a cost, i.e., more computational efforts in the tier-1 searches of Stages II and III. G. Comparison with Existing Methods Finally, the performance of ELITE is compared with six existing methods whose performance were also reported on (part of) the same datasets as those used in this paper. 1) Solo support vector machine (SVM) using linear kernel that is implemented in the library for SVMs (LIBSVM) [41]. 2) Solo neural network that is optimally trained using a multi-tier TRUST-TECH search, which is our previous work reported in [13]. 3) The constructive algorithm for training cooperative neural network ensembles (CNNE) reported in [42]. 4) The cooperative coevolutive approach (COOP) for designing neural network ensembles reported in [30]. 5) The infinite ensemble learning method using SVMs reported in [43]. 6) The nonlinear boosting projection (NLBP) method for ensemble construction reported in [44]. The best performance ensemble of the neural network, the C4.5 tree and SVM, that had been reported is considered for comparison. The comparison results are summarized in Table VIII. For the convenience of comparison, the best performance on each dataset is highlighted. It can be seen that ELITE achieves the best performance on seven datasets. On the other five datasets, the performance of the ensemble constructed by ELITE is also comparable with the best performance. Numerical results in this section have shown the effectiveness of ELITE in constructing high-quality neural network
107
ensembles. Specifically, diversity of the member networks and accuracy of the constructed ensemble have been achieved with the proposed procedures. Hence, ELITE is promising and can be a competing choice in practical applications. VI. C ONCLUSION In this paper, we have developed a methodology, termed ELITE, for constructing high-quality neural network ensembles. ELITE is designed to address the two challenging issues in the area of machine learning using ANNs, network architecture selection and optimal weight training. There are four stages in ELITE in which a seed neural network is first constructed (Stage I), followed by a family of member neural networks (Stage II), each of which is optimally pruned (Stage III), and the optimal ensemble is achieved at Stage IV. Several design tasks in ELITE are formulated as optimization problems and solved by the TRUST-TECH method, which provides a systematic and deterministic way to escape from a local optimal solution and to approach multiple local optimal solutions. Distinguished features of ELITE include the following. 1) Diversity: Ensemble members of ELITE are distinct optimal neural networks with different optimally pruned inputs. These members are generated using the proposed saliencybased feature selection method. In this manner, diversity of the ensemble members can be achieved. 2) Accuracy: In ELITE, accuracy of the ensemble members is achieved via selecting high-quality neural networks from multiple local optimal solutions obtained by the TRUSTTECH-based training method. 3) Optimality: Optimality of the ensemble in ELITE is achieved by optimally combining a set of optimal input-pruned member networks. Specifically, optimality of the member neural networks is attained by using a TRUST-TECH-based training method, and optimality of the combination weights is achieved via solving the associated quadratic optimization problem using TRUST-TECH and a local optimizer IPM. The performance of ELITE was compared with other six methods whose performance has been reported in the literature on the same datasets. Of a total of 12 datasets, ELITE achieves the best performance on 7 datasets, while on the other 5 datasets its performance is also comparable with the best one. Extensive numerical results conducted so far have shown the effectiveness of the ELITE method in constructing high-quality neural network ensembles. Hence, ELITE is promising and can be a favourable choice in practical applications. Further work on improving the generalization performance of the ensemble constructed by ELITE is desirable. Efforts on the following two aspects may prove fruitful. 1) Network Weight Pruning: In this paper, feature selection has been implemented as input node pruning. However, the obtained input-pruned neural network is still fully connected between the layers. Hence, there might still be considerable redundancy in network connection. Additional weight pruning to further reduce the network complexity so as to improve diversity of the member networks is desirable.
108
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
2) Multiobjective Neural Network Design: The tasks of network weight training and feature selection have been solved separately. In fact, objectives of these two tasks are generally competing, and it is better to solve them simultaneously. To this end, the neural network design method can be formulated as a multiobjective optimization problem and solved by an extended TRUST-TECH method. The development of TRUSTTECH-based multiobjective optimization methodology is one of our future works.
ACKNOWLEDGMENT The authors are grateful to the anonymous reviewers for their constructive comments, which helped improve this paper considerably.
R EFERENCES [1] S. Haykin, Neural Networks and Learning Machines, 3rd ed. Englewood Cliffs, NJ: Prentice Hall, 2009. [2] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, Jan. 2000. [3] G. B. Huang, “Learning capability and storage capacity of two-hiddenlayer feedforward networks,” IEEE Trans. Neural Netw., vol. 14, no. 2, pp. 274–281, Mar. 2003. [4] N. S. V. Rao, “On fusers that perform better than best sensor,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 8, pp. 904–909, Aug. 2001. [5] Z. H. Zhou, J. X. Wu, and W. Tang, “Ensembling neural networks: Many could be better than all,” Artif. Intell., vol. 137, nos. 1–2, pp. 239–263, May 2002. [6] D. W. Opitz and J. W. Shavlik, “Generating accurate and diverse members of a neural-network ensemble,” in Advances in Neural Information Processing Systems, vol. 8. Denver, CO: MIT Press, 1996, pp. 535–541. [7] G. Brown, “Diversity in neural network ensembles,” Ph.D. dissertation, School Comput. Sci., Univ. Birmingham, Birmingham, U.K., 2003. [8] T. Windeatt, “Accuracy/diversity and ensemble MLP classifier design,” IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 1194–1211, Sep. 2006. [9] S. Hashem, “Optimal linear combination of neural networks,” Neural Netw., vol. 10, no. 4, pp. 599–614, 1997. [10] N. Ueda, “Optimal linear combination of neural networks for improving classification performance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 2, pp. 207–215, Feb. 2000. [11] H. D. Chiang and C. C. Chu, “A systematic search method for obtaining multiple local optimal solutions of nonlinear programming problems,” IEEE Trans. Circuits Syst., vol. 43, no. 2, pp. 99–109, Feb. 1996. [12] J. Lee and H. D. Chiang, “A dynamical trajectory-based methodology for systematically computing multiple optimal solutions of general nonlinear programming problems,” IEEE Trans. Autom. Control, vol. 49, no. 6, pp. 888–899, Jun. 2004. [13] H. D. Chiang and C. K. Reddy, “TRUST-TECH based neural network training,” in Proc. Int. Joint Conf. Neural Netw., Orlando, FL, Aug. 2007, pp. 90–95. [14] C. K. Reddy, H. D. Chiang, and B. Rajaratnam, “TRUST-TECH based expectation maximization for learning finite mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 7, pp. 1146–1157, Jul. 2008. [15] H.-D. Chiang, B. Wang, and Q.-Y. Jiang, “Applications of TRUST-TECH methodology in optimal power flow of power systems,” in Optimization in the Energy Industry. New York: Springer-Verlag, 2009, pp. 297–318. [16] S. Sohn and C. H. Dagli, “Ensemble of evolving neural networks in classification,” Neural Process. Lett., vol. 19, no. 3, pp. 191–203, Jun. 2004. [17] Y. Shang and B. W. Wah, “Global optimization for neural network training,” IEEE Comput., vol. 29, no. 3, pp. 45–54, Mar. 1996. [18] C. M. Bishop, Neural Networks for Pattern Recognition. London, U.K.: Oxford Univ. Press, 1995.
[19] B. D. Ripley and N. L. Hjort, Pattern Recognition and Neural Networks. Cambridge, U.K.: Cambridge Univ. Press, 1995. [20] M. Gori and A. Tesi, “On the problem of local minima in backpropagation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 1, pp. 76–86, Jan. 1992. [21] S. Amato, B. Apolloni, G. Caporali, U. Madesani, and A. Zanaboni, “Simulated annealing approach in backpropagation,” Neurocomputing, vol. 3, nos. 5–6, pp. 207–220, Dec. 1991. [22] X. Yao and Y. Liu, “A new evolutionary system for evolving artificial neural networks,” IEEE Trans. Neural Netw., vol. 8, no. 3, pp. 694–713, May 1997. [23] F. H. F. Leung, H. K. Lam, S. H. Ling, and P. K. S. Tam, “Tuning of the structure and parameters of a neural network using an improved genetic algorithm,” IEEE Trans. Neural Netw., vol. 14, no. 1, pp. 79–88, Jan. 2003. [24] C. F. Juang, “A hybrid of genetic algorithm and particle swarm optimization for recurrent network design,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 34, no. 2, pp. 997–1006, Apr. 2004. [25] M. Milano, P. Koumoutsakos, and J. Schmidhuber, “Self-organizing nets for optimization,” IEEE Trans. Neural Netw., vol. 15, no. 3, pp. 758–765, May 2004. [26] A. L. Rendell and R. Sheshu, “Learning hard concepts through constructive induction: Framework and rationale,” Comput. Intell., vol. 6, no. 4, pp. 247–270, Nov. 1990. [27] Z. L. Xu, I. King, M. R. T. Lyu, and R. Jin, “Discriminative semisupervised feature selection via manifold regularization,” IEEE Trans. Neural Netw., vol. 21, no. 7, pp. 1033–1047, Jul. 2010. [28] P. Leray and P. Gallinari, “Feature selection with neural networks,” Behaviormetrika, vol. 26, no. 1, pp. 145–166, 1999. [29] X. Yao and Y. Liu, “Making use of population information in evolutionary artificial neural networks,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 28, no. 3, pp. 417–425, Jun. 1998. [30] N. Garcia-Pedrajas, C. Hervas-Martinez, and D. Oriz-Boyer, “Cooperative coevolution of artificial neural network ensembles for pattern classification,” IEEE Trans. Evol. Comput., vol. 9, no. 3, pp. 271–302, Jun. 2005. [31] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,” J. Artif. Intell. Res., vol. 11, pp. 169–198, 1999. [32] A. Sharkey, “On combining artificial neural nets,” Connect. Sci., vol. 8, nos. 3–4, pp. 299–314, Dec. 1996. [33] U. Naftaly, U. Intrator, and D. Horn, “Optimal ensemble averaging of neural networks,” Network, vol. 8, no. 3, pp. 283–296, 1997. [34] J. Guckenheimer and P. Holmes, Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. New York: Springer-Verlag, 1983. [35] D. R. Liu, T. S. Chang, and Y. Zhang, “A constructive algorithm for feedforward neural networks with incremental training,” IEEE Trans. Circuits Syst. I: Fundam. Theory Appl., vol. 49, no. 12, pp. 1876–1879, Dec. 2002. [36] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York: Springer-Verlag, 2006. [37] A. Asuncion and D. J. Newman. (2007). UCI Machine Learning Repository [Online]. Available: http://archive.ics.uci.edu/ml/ [38] G. Fumera, F. Roli, and A. Serrau, “A theoretical analysis of bagging as a linear combination of classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 7, pp. 1293–1299, Jul. 2008. [39] A. Wachter and L. T. Biegler, “On the implementation of an interiorpoint filter line-search algorithm for large-scale nonlinear programming,” in Mathematical Programming: Series A & B, vol. 106. New York: Springer-Verlag, 2006, pp. 25–57. [40] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles and their relationship with ensemble accuracy,” Mach. Learn., vol. 51, no. 2, pp. 181–207, May 2003. [41] R. E. Fan, P. H. Chen, and C. J. Lin, “Working set selection using second order information for training SVM,” J. Mach. Learn. Res., vol. 6, pp. 1889–1918, Dec. 2005. [42] M. M. Islam, X. Yao, and K. Murase, “A constructive algorithm for training cooperative neural network ensembles,” IEEE Trans. Neural Netw., vol. 14, no. 4, pp. 820–834, Jul. 2003. [43] H. T. Lin and L. Li, “Support vector machinery for infinite ensemble learning,” J. Mach. Learn. Res., vol. 9, pp. 285–312, Jun. 2008. [44] N. Garcia-Pedrajas, C. Garcia-Osorio, and C. Fyfe, “Nonlinear boosting projections for ensemble construction,” J. Mach. Learn. Res., vol. 8, pp. 1–33, May 2007.
WANG AND CHIANG: ENSEMBLE OF OPTIMAL INPUT-PRUNED NEURAL NETWORKS USING TRUST-TECH
Bin Wang received the B.E. degree from Tong Ji University, Shanghai, China, in 2000, the M.S. degree from Shanghai Jiao Tong University, Shanghai, in 2003, both in electrical engineering. From 2006 to 2007, he was affiliated with the Computational Bioinformatics and Bio-Imaging Laboratory, Virginia Tech, Blacksburg, working on medical image analysis and machine learning methods for bioinformatics. He is currently pursuing the Ph.D. degree in the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY. His current research interests include nonlinear systems theory and global optimization methods and their applications to machine learning, power systems, and computer vision.
109
Hsiao-Dong Chiang (M’87–SM’91–F’97) received the Ph.D. degree in electrical engineering and computer science from the University of California, Berkeley, in 1986. He has been a Professor of electrical and computer engineering at Cornell University, Ithaca, NY, since 1998. He holds 10 U.S. patents and several consultant positions. He and his research team have published more than 300 papers in refereed journals and conference proceedings. His current research interests include theoretical developments and practical applications, nonlinear system theory, stability regions of general nonlinear dynamical systems, boundary of stability region based controlling unstable equilibrium point (BCU) method and applications, global and robust optimization techniques and applications, nonlinear computations and their practical applications to electric circuits, systems, signals, and images.
110
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Approximate Confidence and Prediction Intervals for Least Squares Support Vector Regression Kris De Brabanter, Jos De Brabanter, Johan A. K. Suykens, Senior Member, IEEE, and Bart De Moor, Fellow, IEEE
Abstract— Bias-corrected approximate 100(1 − α)% pointwise and simultaneous confidence and prediction intervals for least squares support vector machines are proposed. A simple way of determining the bias without estimating higher order derivatives is formulated. A variance estimator is developed that works well in the homoscedastic and heteroscedastic case. In order ˘ to produce simultaneous confidence intervals, a simple Sidák correction and a more involved correction (based on upcrossing theory) are used. The obtained confidence intervals are compared to a state-of-the-art bootstrap-based method. Simulations show that the proposed method obtains similar intervals compared to the bootstrap at a lower computational cost. Index Terms— Bias, confidence interval, heteroscedasticity, homoscedasticity, kernel-based regression, variance.
I. I NTRODUCTION ONPARAMETRIC function estimators are very popular data analytic tools [1]–[3]. Many of their properties have been rigorously investigated and are well understood. An important activity immediately accompanying these estimators is the construction of interval estimates, e.g., confidence intervals. In the area of kernel-based regression, a popular tool to construct interval estimates is the bootstrap (see [4], references therein). This technique produces very accurate intervals at the cost of heavy computational burden. In the field of neural networks, [5], [6] have proposed confidence and prediction intervals. In case of nonlinear
N
Manuscript received July 6, 2010; revised August 30, 2010, October 6, 2010, and October 10, 2010; accepted October 10, 2010. Date of publication November 1, 2010; date of current version January 4, 2011. This work was supported in part by the following agencies: Research Council KUL: GOA AMBioRICS, GOA MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC), IOF-SCORES4CHEM, several PhD/post-doc & fellow Grants; Flemish Government: FWO: PhD/postdoc grants, projects G.0452.04 (new quantum algorithms), G.0499.04 (Statistics), G.0211.05 (Nonlinear), G.0226.06 (cooperative systems and optimization), G.0321.06 (Tensors), G.0302.07 (SVM/Kernel), G.0320.08 (convex MPC), G.0558.08 (Robust MHE), G.0557.08 (Glycemia2), G.0588.09 (Brain-machine) research communities (ICCoS, ANMMM, MLDM), G.0377.09 (Mechatronics MPC); IWT: PhD Grants, McKnow-E, Eureka-Flite+, SBO LeCoPro, SBO Climaqs, POM; Belgian Federal Science Policy Office: IUAP P6/04 (DYSCO, Dynamical systems, control and optimization, 2007-2011); EU: ERNSI; FP7-HD-MPC (INFSO-ICT-223854), COST intelliCIS, EMBOCOM; Contract Research: AMINAL; Other: Helmholtz, viCERP, ACCM, Bauknecht, Hoerbiger. K. De Brabanter, J. A. K. Suykens, and B. De Moor are with the Department of Electrical Engineering, Research Division SCD, Katholieke Universiteit Leuven, Leuven 3001, Belgium (e-mail:
[email protected];
[email protected];
[email protected]). J. De Brabanter is with the Department Industrieel Ingenieur, KaHo Sint Lieven (Associatie K.U. Leuven), Gent 9000, Belgium, (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2087769
regression, [7] proposed confidence intervals based on least squares estimation and using the linear Taylor expansion of the nonlinear model. For Gaussian processes (GPs) several type of methods have been developed to construct interval estimates. Early works of [8] and [9] address the construction of interval estimates via a Bayesian approach and a Markov chain Monte Carlo method to approximate the posterior noise variance. An extension of the latter was proposed by [10]. In general, Bayesian intervals (which are often called Bayesian credible intervals) do not exactly coincide with frequentist confidence intervals (as proposed in this paper) for two reasons. First, credible intervals incorporate problem-specific contextual information from the prior distribution, whereas confidence intervals are based only on the data. Second, credible intervals and confidence intervals treat nuisance parameters in radically different ways (see [11], references therein). Recently, [12] and [13] propose to use the leave-one-out cross-validation estimate of the conditional mean in fitting the model of the conditional variance in order to overcome the inherent bias in maximum likelihood estimates of the variance. In this paper, we will address some of the difficulties in constructing these interval estimates and develop a methodology for interval estimation in case of least squares support vector machines (LS-SVM) for regression which is not based on bootstrap. Consider the bivariate data (X 1 , Y1 ), . . . , (X n , Yn ), which form an independent and identically distributed (i.i.d.) sample from a population (X, Y ). Our interest is to estimate the regression function m(X) = E[Y |X], with E[Y |X] = R y f Y |X (y|x) d y, where fY |X is the conditional distribution of the random variables (X, Y ). We regard the data as being generated from the model Y = m(X) + σ (X)ε
(1)
where E[ε|X] = 0, Var[ε|X] = E[ε2 |X] − E2 [ε|X] = 1, and X and ε are independent. In setting (1), it is immediately clear that E[Y |X] = m(X) and Var[Y |X] = σ 2 (X) > 0. Two possible situations can occur: 1) σ 2 (X) = σ 2 = constant, and 2) the variance is a function of the random variable X. The first is called homoscedasticity and the latter heteroscedasticity. The problem that we will address is the construction of confidence intervals for m. Specifically, given α ∈ (0, 1) and an estimator mˆ for m, we want to find a bound gα such that P sup |m(x) ˆ − m(x)| ≤ gα ≥ 1 − α (2)
1045–9227/$26.00 © 2010 IEEE
x
DE BRABANTER et al.: APPROXIMATE CONFIDENCE AND PREDICTION INTERVALS FOR LEAST SQUARES SUPPORT VECTOR REGRESSION
at least for large sample sizes. Also, bear in mind that mˆ depends on n. A major difficulty in finding a solution to (2) is the fact that nonparametric estimators for m are biased (kernel estimators in particular). As a consequence, confidence interval procedures must deal with estimator bias to ensure that the interval is correctly centered and proper coverage is attained [14]. In order to avoid the bias estimation problem, several ˆ − authors have studied the limiting distribution of supx |m(x) m(x)| for various estimators mˆ of m. A pioneering article in this field is due to [15] for kernel density estimation. Extensions of [15] to kernel regression are given in [16]–[18]. A second way to avoid calculating the bias explicitly is to undersmooth. If we smooth less than the optimal amount, then the bias will decrease asymptotically relative to the variance. It was theoretically shown in [4] that undersmoothing in combination with a pivotal statistic based on bootstrap results in the lowest reduction in coverage error of confidence intervals. Unfortunately, there does not seem to be a simple and practical rule for choosing just the right amount of undersmoothing. A third and more practical way is to be satisfied with indicating the level of variability involved in a nonparametric regression estimator, without attempting to adjust for the inevitable presence of bias. Bands of this type are easier to construct but require careful interpretation. Formally, the bands indicate pointwise variability intervals for E[m(X)|X]. ˆ Based on this idea, often misconceptions exist between confidence intervals and error bars. In this paper, we propose the construction of confidence intervals based on the central limit theorem for linear smoothers combined with bias correction and variance estimation. Finally, if it is possible to obtain a reasonable bias estimate, we can use it to construct confidence intervals for m. The application of this approach can be found in local polynomial regression [19], [20], where a bias estimate can be easily calculated. In this paper, we will develop a similar approach to estimate the bias. Confidence intervals are widely used, and applications can be found, e.g., in the chemical industry, fault detection/diagnosis, and system identification/control. These intervals give the user the ability to see how well a certain model explains the true underlying process while taking statistical properties of the estimator into account. In control applications, these intervals are used for robust design, while the applicability of these intervals in fault detection is based upon reducing the number of false alarms. For further reading on this topic, we refer the reader to the work of [21] and [22]. The rest of this paper is organized as follows. The bias and variance estimation of the LS-SVM are discussed in Section II. The construction of approximate 100(1 − α)% pointwise and simultaneous confidence (prediction) intervals are discussed in Section III. Section IV briefly summarizes the ideas of bootstrap-based methods. Simulations and comparison with the current state-of-the-art bootstrap-based method [4] is given in Section V. Some closing comments are made in Section VI.
111
II. E STIMATION OF B IAS AND VARIANCE A. LS-SVM Regression and Smoother Matrix In the primal weight space, LS-SVM is formulated as follows [23]: n ei2 min J (w, e) = 12 w T w + γ2 (3) w,b,e i=1
s.t. Yi = w T ϕ(X i ) + b + ei
i = 1, . . . , n
where ei ∈ R are assumed to be i.i.d. random errors with E[e|X] = 0, Var[e|X] < ∞, m ∈ C z (R)1 with z ≥ 2, is an unknown real-valued smooth function, and E[Y |X] = m(X), ϕ : Rd → Rnh is the feature map to the highdimensional feature space (can be infinite dimensional) as in the standard SVM case [24], w ∈ Rnh , b ∈ R and γ ∈ R+ 0 is the regularization parameter. By using Lagrange multipliers, the solution of (3) can be obtained by taking the Karush-Kuhn-Tucker conditions for optimality. The result is given by the following linear system in the dual variables α: 1nT 0 b 0 = (4) 1n + γ1 In α Y with Y = (Y1 , . . . , Yn )T , 1n = (1, . . . , 1)T , α = (α1 , . . . , αn )T , and il = ϕ(X i )T ϕ(X l ) = K (X i , X l ) for i, l = 1, . . . , n with K (·, ·) a positive definite kernel. Based on Mercer’s theorem, the resulting LS-SVM model for function estimation becomes n αˆ i K (x, X i ) + bˆ (5) m(x) ˆ = i=1
Rd
Rd
× → R, d is the number of dimenwhere K : sions. For example, the radial basis function (RBF) kernel K (X i , X j ) = exp(−X i − X j 22 / h 2 ) with bandwidth h > 0. Other possibilities are the polynomial kernel K (X i , X j ) = (X iT X j + τ ) p of degree p with τ ≥ 0 and linear kernel K (X i , X j ) = X iT X j . By noticing that LS-SVM is a linear smoother, suitable bias and variance formulations can be found. Definition 1 (Linear Smoother): An estimator mˆ of m is a linear smoother if, for each x ∈ Rd , there exists a vector L(x) = (l1 (x), . . . , ln (x))T ∈ Rn such that n li (x)Yi (6) m(x) ˆ = i=1
Rd
where m(·) ˆ : → R. On training data, (6) can be written in matrix form as ˆ n ))T ∈ Rn , and mˆ = LY , where mˆ = (m(X ˆ 1 ), . . . , m(X n×n is a smoother matrix whose i th row is L(X i )T , thus L∈R L i j = l j (X i ). The entries of the i th row show the weights given to each Yi in forming the estimate m(X ˆ i ). Thus we have Theorem 1: The LS-SVM estimate (5) is n li (x)Yi m(x) ˆ = i=1
1 m ∈ C z (R) stands for m is z-times continuously differentiable in R
112
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
and L(x) = (l1 (x), . . . , ln (x))T is the smoother vector T J1T −1 −1 −1 Jn −1 ⋆T Z Z Z −Z L(x) = x + c c
C. Variance Estimation (7)
with ⋆x = (K (x, X 1 ), . . . , K (x, X n ))T the kernel vector evaluated at point x, c = 1nT ( + (In /γ ))−1 1n , Z = + (In /γ ), Jn a square matrix with all elements equal to 1 and J1 = (1, . . . , 1)T . Then the estimator, under model (1), has conditional mean n li (x)m(x i ) E[m(x)|X ˆ = x] = i=1
and conditional variance
Var[m(x)|X ˆ = x] =
n
li (x)2 σ 2 (x i ).
(8)
i=1
Proof: see Appendix I. Note that one should not confuse linear smoothers, i.e., smoothers of the form (6), with linear regression, in which one assumes that the regression function m is linear. B. Bias Estimation Using Theorem 1, the conditional bias can be written as E[m(x)|X ˆ = x] − m(x) =
n
li (x)m(x i ) − m(x).
i=1
It can be shown, in the 1-D case, by using a Taylor series expansion around the fitting point x ∈ R, that for |x i − x| ≤ h the conditional bias is equal to E[m(x)|X ˆ = x] − m(x) = m ′ (x)
n
(x i − x)li (x)
i=1
n
m ′′ (x) + (x i − x)2li (x) + O(ρ(h, γ )) 2 i=1
where ρ is an unknown function describing the relation between the two tuning parameters. Although the above expression gives insight on how the conditional bias behaves asymptotically, it is quite hard to use this in practice since it involves the estimation of firstand second-order derivatives of the unknown m. In fact, this procedure can be rather complicated in the multivariate case, especially when estimating derivatives. Therefore, we opt for a procedure that does not rely completely on the asymptotic expression, but stays “closer” to the exact expression for the conditional bias. As a result, this will carry more information about the finite sample bias. Theorem 2: Let L(x) be the smoother vector evaluated in ˆ n ))T . Then, the a point x and denote mˆ = (m(X ˆ 1 ), . . . , m(X estimated conditional bias for LS-SVM is given by bias[m(x)|X ˆ = x] = L(x)T mˆ − m(x). ˆ
Our goal is to derive a fully automated procedure to estimate the variance function σ 2 (·). Due to the simple decomposition σ 2 (x) = Var[Y |X = x] = E[Y 2 |X = x]−{E[Y |X = x]}2 , we tend to use the following obvious and direct estimator [26]: 2 σˆ d2 (x) = E[Y 2 |X = x] − {m(x)} ˆ .
In fact, there are some drawbacks in using such an estimator. For example, σˆ d2 (x) is not always nonnegative due to estimation error, especially if different smoothing parameters are used in estimating the regression function and E[Y 2 |X]. Furthermore, such a method can result in a very large bias [27]. Before stating the estimator, we first need a condition on the weights of the LS-SVM (Lemma 1). The resulting variance estimator is given in Theorem 3. Lemma 1: The weights {li (x)} of the LS-SVM smoother are normal n li (x) = 1. i=1
Proof: see Appendix III. Theorem 3 (Variance Estimator): Assume model (1), and let L denote the smoother matrix corresponding to the initial smooth. Let S ∈ Rn×n denote the smoother matrix corresponding to a natural means of estimating σ 2 (·) based on smoothing the squared residuals. Denote by S(x) the smoother vector in an arbitrary point x, where S(·) : Rd → Rn . Assume that S preserves constant vectors, i.e., S1n = 1n , then an estimator for the variance function evaluated at a point x is given by σˆ 2 (x) =
(10)
where εˆ denote the residuals and diag(A) is the column vector containing the diagonal entries of the square matrix A. Proof: see Appendix III. The class of variance function estimators (10) can be viewed as a generalization of those commonly used in parametric modeling (see [28]). We next approximate the conditional variance of LS-SVM in (8). Given the estimator of the error variance function (10), an estimate of the conditional variance of LS-SVM with heteroscedastic errors (8) is given by ˆ 2 L(x) Var[m(x)|X ˆ = x] = L(x)T with
ˆ2
=
diag(σˆ 2 (X
1
), . . . , σˆ 2 (X
(11)
n )).
III. C ONFIDENCE AND P REDICTION I NTERVALS A. Pointwise Confidence Intervals The estimated bias (9) and variance (11) can be used to construct pointwise confidence intervals. Under certain regularity conditions [29], the central limit theorem for linear smoothers is valid and one can show asymptotically
(9)
Proof: see Appendix II. The above rationale is also closely related to the technique of double smoothing, see [25] and iterative bias reduction.
S(x)T diag(ˆεεˆ T ) 1 + S(x)T diag(L L T − L − L T )
D
m(x) ˆ − E[m(x)|X ˆ = x] D −→ N (0, 1) Var[m(x)|X ˆ = x]
where −→ denotes convergence in distribution. If the estimate is conditionally unbiased, i.e., E[m(x)|X ˆ = x] = m(x),
DE BRABANTER et al.: APPROXIMATE CONFIDENCE AND PREDICTION INTERVALS FOR LEAST SQUARES SUPPORT VECTOR REGRESSION
0.7075
approximate 100(1 − α)% pointwise (in point x) confidence intervals may take the form m(x) ˆ ± z 1−α/2 Var[m(x)|X ˆ = x] (12)
ˆ − bias[m(x)|X ˆ = x]. with mˆ c (x) = m(x) The reader can easily verify that bias-corrected approximate 100(1 − α)% pointwise confidence intervals can also be obtained for the classical ridge regression case by replacing the LS-SVM smoother matrix by the one of the ridge regression estimators [30]. In a similar way as before, suitable bias and variance estimates can be computed. B. Simultaneous Confidence Intervals
The confidence intervals presented so far in this paper are pointwise. For example, by looking at two pointwise confidence intervals in Fig. 1 (Fossil dataset [31]), we can make the following two statements separately: 1) (0.70743, 0.70745) is an approximate 95% pointwise confidence interval for m(105); 2) (0.70741, 0.70744) is an approximate 95% pointwise confidence interval for m(120). However, as is well known in multiple comparison theory, it is wrong to state that m(105) is contained in (0.70743, 0.70745) and simultaneously m(120) is contained in (0.70741, 0.70744) with 95% confidence. Therefore, it is not correct to connect the pointwise confidence intervals to produce a band around the estimated function. In order to make these statements, we have to modify the interval to obtain simultaneous confidence intervals. Three major groups exist to modify the ˘ interval. Monte Carlo simulations, Bonferroni, Sidák or other type of corrections, and upcrossing theory [32], [33]. The latter is also well known in GPs (see [34]). Although Monte Carlo-based modifications are accurate (even when the number of data points n is relatively small), they are computationally expensive. Therefore, we will not discuss this type of methods in this paper. The interested readers can refer to [31] and reference therein. ˘ Sidák and Bonferroni corrections are one of the most popular methods since they are easy to calculate and produce quite acceptable results. In what follows, the rationale behind the ˘ Sidák correction (generalization of Bonferroni) is elucidated. This correction is derived by assuming that individual tests are independent. Let the significance threshold for each test be β (significance level of pointwise confidence interval), then the probability that at least one of the tests is significant under
0.7074 0.7074 mˆ (X)
where z 1−α/2 denotes the (1 − α/2)th quantile of the standard Gaussian distribution. Formally, the interval (12) is a confidence interval for E[m(x)|X ˆ = x]. It is a confidence interval for m(x) under the assumption E[m(x)|X ˆ = x] = m(x). However, since LS-SVM is a biased smoother, the interval (12) has to be adjusted to allow for bias. The exact bias is given by bias[m(x)|X ˆ = x] = E[m(x)|X ˆ = x] − m(x). Since the exact bias is unknown, this quantity can be estimated by (9). Therefore, a bias-corrected approximate 100(1 − α)% pointwise confidence interval is given by ˆ = x] (13) mˆ c (x) ± z 1− α2 Var[m(x)|X
113
0.7073 0.7073 0.7072 0.7072 0.7071
90
95
100
105
110
115
120
125
X Fig. 1.
Fossil data with two pointwise 95% confidence intervals.
this threshold is (1 – the probability that none of them is significant). Since we are assuming that they are independent, the probability that all of them are not significant is the product of the probabilities that each of them is not significant, or 1 − (1 − β)n . Now we want this probability to be equal to α, which is the significance level for the entire series of tests (or simultaneous confidence interval). By solving for β, we get β = 1 − (1 − α)1/n . To obtain an approximate 100(1 − α)% ˘ simultaneous confidence intervals, based on a Sidák correction, (13) has to be modified as ˆ = x] mˆ c (x) ± z 1−β/2 Var[m(x)|X
with mˆ c (x) = m(x) ˆ − bias[m(x)|X ˆ = x]. The last method analytically approximates the modification of the interval. As a result, an approximate 100(1 − α)% simultaneous confidence interval will be of the form ˆ = x]. (14) mˆ c (x) ± ν1−α Var[m(x)|X
The value for ν1−α is given by [35]
κ 0 (15) ν1−α = 2 log απ where L(x)2 L ′ (x)2 − (L(x)T L ′ (x))2 κ0 = dx L(x)2 X
where X is the set of x values of interest and L ′ (x) = (d/d x)L(x), with the differentiation applied elementwise. Computation of κ0 requires numerical integration and is slow in multiple dimensions or at small bandwidths h. It is shown in [32] that κ0 is strongly related to the degrees of freedom of the fit. A good approximation of κ0 results in π κ0 ≈ (trace(L) − 1). 2 For the Fossil dataset (n = 106), set α = 0.05, then z 1−α/2 = 1.96 and z 1−β/2 = 3.49. The simultaneous ˘ intervals obtained by using a Sidák correction are about 1.78 (= 3.49/1.96) times wider than the pointwise intervals. In fact, using Monte Carlo simulations [or upcrossing theory (15)] resulted in a factor 3.2 (3.13) instead of 3.49 obtained
114
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Algorithm 1 Confidence Intervals 1: Given the training data {(X 1 , Y1 ), . . . , (X n , Yn )}, calculate mˆ on training data using (5) 2: Calculate the bias by using (9) on training data 3: Calculate residuals εˆ k = Yk − m(X ˆ k ), k = 1, . . . , n 4: Calculate the variance of the LS-SVM by using (10) and (11) 5: Set significance level e.g. α = 0.05 6: For pointwise confidence intervals: use (13). For simultaneous confidence intervals: use (15) and (14).
Algorithm 3 Bootstrap-based on residuals (homoscedasticity) 1: Given the data {(X 1 , Y1 ), . . . , (X n , Yn )}, calculate m(X ˆ k) using (5) 2: Calculate residuals εˆ k = Yk − m(X ˆ k) n 3: Re-center residuals ε˜ k = εˆ k − n −1 j =1 εˆ j n ⋆ 4: Generate bootstrap samples {˜ εk }k=1 by uniform sampling with replacement from {˜εk }nk=1 5: Generate {Yk⋆ }nk=1 from Yk⋆ = m(X ˆ k ) + ε˜ k⋆ ⋆ ⋆ 6: Calculate m ˆ from {(X 1 , Y1 ), . . . , (X n , Yn⋆ )} 7: Repeat steps 4–6 B times
Algorithm 2 Prediction Intervals 1: Given the training data {(X 1 , Y1 ), . . . , (X n , Yn )} and test ˆ k ) for data {(X n+1 , Yn+1 ), . . . , (X m , Ym )}, calculate m(X k = n + 1, . . . , m using (5) 2: Calculate the bias by using (9) on test data 3: Calculate residuals εˆ k = Yk − m(X ˆ k ), k = 1, . . . , n 4: Calculate the variance of the LS-SVM by using (10) and (11) on test data 5: Set significance level e.g. α = 0.05 6: For pointwise prediction intervals: use (16). For simultaneous prediction intervals: use (15) and (17).
Algorithm 4 Bootstrap-based on residuals (heteroscedasticity) 1: Given the data {(X 1 , Y1 ), . . . , (X n , Yn )}, calculate m(X ˆ k) using (5) 2: Calculate residuals εˆ k = Yk − m(X ˆ k) n 3: Re-center residuals ε˜ k = εˆ k − n −1 j =1 εˆ j ⋆ 4: Generate bootstrap data ε˜ k = ε˜ k ηk where ηk are Rademacher variables, defined as 1, with probability 1/2; ηk = −1, with probability 1/2.
˘ ˘ by a Sidák correction. This is the reason why Sidák (and also Bonferroni) corrections result in conservative confidence intervals. In the rest of this paper, (15) will be used to determine the modification of the interval. We conclude this paragraph with a final remark regarding the difference between confidence intervals and error bars. Although the latter are very popular, they differ from confidence intervals. Statistically, pointwise confidence intervals are defined according to (2) and a bound gα has to be found. Also, the constructed confidence intervals must deal with estimator bias to ensure proper coverage rate and centering of the interval. On the other hand, error bars do not take this bias into account. They only give an idea of the variability of the estimator mˆ and do not provide a solution to (2). C. Prediction Intervals In some cases, one may also be interested in the uncertainty on the prediction for a new observation. This type of requirement is fulfilled by the construction of a prediction interval. Assume that the new observation Y ⋆ at a point x ⋆ is independent of the estimation data. Then Var[Y ⋆− m(x ˆ ⋆ )|X=x ⋆ ] = Var[Y ⋆|X=x ⋆ ]+Var[m(x ˆ ⋆ )|X=x ⋆ ] n = σ 2 (x ⋆ ) + li (x ⋆ )2 σ 2 (x i ). i=1
In the last step, we have used the model assumptions (1) and (8). Thus, an approximate pointwise 100(1 − α)% prediction interval in a new point x ⋆ is constructed by ˆ ⋆ )|X = x ⋆ ] (16) mˆ c (x ⋆ ) ± z 1− α2 σˆ 2 (x ⋆ ) + Var[m(x with mˆ c (x ⋆ ) = m(x ˆ ⋆) − bias[m(x ˆ ⋆ )|X = x ⋆ ].
Generate {Yk⋆ }nk=1 from Yk⋆ = m(X ˆ k ) + ε˜ k⋆ ⋆ ⋆ 6: Calculate m ˆ from {(X 1 , Y1 ), . . . , (X n , Yn⋆ )} 7: Repeat steps 4–6 B times 5:
As before, an approximate simultaneous 100(1 − α)% prediction interval in a new point x ⋆ is given by ˆ ⋆ )|X = x ⋆ ]. (17) mˆ c (x ⋆ ) ± ν1−α σˆ 2 (x ⋆ ) + Var[m(x
We conclude this section by summarizing the construction of confidence intervals and prediction intervals given in Algorithms 1 and 2, respectively. IV. B OOTSTRAP -BASED C ONFIDENCE AND P REDICTION I NTERVALS
In this section we will briefly review the current state-ofthe-art regarding bootstrap-based confidence and prediction intervals, which are used for comparison in the experimental section. A. Bootstrap-Based on Residuals It is shown in [36] that the standard bootstrap [37] based on residuals does not work for nonparametric heteroscedastic regression models. A technique used to overcome this difficulty is the wild or external bootstrap, developed in [38] following suggestions in [39] and [40]. Further theoretical refinements are found in [41]. Algorithms 3 and 4 have to be applied when the errors are homoscedastic and heteroscedastic, respectively. Other possibilities for the two-point distribution of the ηk also exist (see [38]). The Rademacher distribution was chosen because it was empirically shown in [42] that this distribution came out as best among six alternatives.
DE BRABANTER et al.: APPROXIMATE CONFIDENCE AND PREDICTION INTERVALS FOR LEAST SQUARES SUPPORT VECTOR REGRESSION
115
0.7075 1
0.70745 0.70740
0.6 mˆ (X)
mˆ (X)
0.8
0.4
0.70735 0.70730
0.2 0.70725 0 0.70720 −0.2
0
0.2
0.4
0.6
0.8
1
90
X
95
100
105
110
115
120
125
X
Fig. 2. Pointwise and simultaneous 95% confidence intervals. The outer (inner) region corresponds to simultaneous (pointwise) confidence intervals. The full line (in the middle) is the estimated LS-SVM model. For illustration purposes, the 95% pointwise confidence intervals are connected.
Fig. 4. Simultaneous 95% confidence intervals for the Fossil dataset. The dashed lines correspond to the proposed simultaneous confidence intervals and the full lines are the bootstrap confidence intervals. The full line (in the middle) is the estimated LS-SVM model.
1.2
Hall [4] suggested the following approach: estimate the distribution of the pivot
1 0.8
T (m(x), m(x)) ˆ =
mˆ (X)
0.6
Var[m(x)|X ˆ = x]
by the bootstrap. Depending on homoscedastic or heteroscedastic errors, Algorithms 3 or 4 should be used to estimate this distribution. Now, the distribution of the pivotal statistic T (m(x), m(x)) ˆ is approximated by the corresponding distribution of the bootstrapped statistic
0.4 0.2 0
mˆ ⋆ (x) − mˆ g (x) V(mˆ ⋆ (x), mˆ g (x)) = Var⋆ [m(x)|X ˆ = x]
−0.2 −0.4
m(x) ˆ − m(x)
0
0.2
0.4
0.6
0.8
1
X Fig. 3. Simultaneous 95% confidence and prediction intervals. The outer (inner) region corresponds to simultaneous prediction (confidence) intervals. The full line (in the middle) is the estimated LS-SVM model.
where mˆ g denotes the undersmoother with bandwidth g and ⋆ denote bootstrap counterparts. Practically, we choose bandwidth g = h/2. Hence, a 100(1 − α)% pointwise confidence interval is given by α2 , 1− α2 with
ˆ + Q α2 α2 = m(x)
B. Construction of Confidence and Prediction Intervals The construction of confidence and prediction intervals for nonparametric function estimation falls into two parts, i.e., the construction of a confidence or a prediction interval based on a pivotal method for the expected value of the estimator, and bias correction through undersmoothing. Then, a confidence interval is constructed by using the asymptotic distribution of a pivotal statistic. The latter can be obtained by bootstrap. Before illustrating the construction of intervals based on bootstrap, we give a formal definition of a key quantity used in the bootstrap approach. Definition 2 (Pivotal Quantity): Let X = (X 1 , . . . , X n ) be random variables with unknown joint distribution F and denote by T (F) a real-valued parameter of interest (e.g., the regression function). A random variable T (X, T (F)) is a pivotal quantity (or pivot) if the distribution of T (X, T (F)) is independent of all parameters.
and
1− α2 = m(x) ˆ + Q 1− α2
Var[m(x)|X ˆ = x]
Var[m(x)|X ˆ = x].
Q α denotes the αth quantile of the bootstrap distribution of the pivotal statistic. A 100(1 − α)% simultaneous confidence ˘ interval can be constructed by applying a Sidák correction, see Section III. Similar to the workflow in Section III, 100(1−α)% pointwise and simultaneous prediction intervals are obtained. A question that remains unanswered is how to determine B, the number of bootstrap replications in Algorithms 3 and 4. The construction of confidence (and prediction) intervals demands accurate information of the low and high quantiles of the limit distribution. Therefore, enough resamplings are needed in order for bootstrap to accurately reproduce this distribution. Typically, B is chosen in the range of 1.000–2.000 for pointwise intervals and more than 10.000 for simultaneous intervals.
116
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
4
6 5 4 3
2
2 1
mˆ (X)
sorted mˆ (X) (Training data)
3
0
1 0 −1
−1
−2 −3
−2
−4 −3
0
50
100
150 200 Index
250
300
−5 −5
350
Fig. 5. Simultaneous 95% confidence intervals for the Boston housing dataset (dots). Sorted outputs are plotted against their index.
5
0 X
5
Fig. 7. Pointwise and simultaneous 95% prediction intervals for heteroscedastic errors. The outer (inner) region corresponds to simultaneous (pointwise) prediction intervals. The full line (in the middle) is the estimated LS-SVM model. For illustration purposes, the 95% pointwise prediction intervals are connected.
4 1.4
2
1.2
1 1
0 σ2(X), σˆ 2(X)
sorted mˆ (X) (Test data)
3
−1 −2
0.8 0.6
−3 0.4 −4
0
20
40
60
80 100 Index
120
140
160
180
Fig. 6. Simultaneous 95% prediction intervals for the Boston housing dataset (dots). Sorted outputs are plotted against their index.
V. E XAMPLES In all simulations, the RBF kernel was used and α = 0.05. The tuning parameters (regularization parameter γ and kernel bandwidth h) of the LS-SVM were obtained via leave-one-out cross validation. A. Homoscedastic Examples In the first example, data were generated from (1) using normal errors and following the regression curve: 2
m(x) = e−32(x−0.5) . The sample size is taken to be n = 200, and σ (x) = σ = 0.1. Pointwise and simultaneous 95% confidence intervals are shown in Fig. 2. The line in the middle represents the LS-SVM model. For illustration purposes, the 95% pointwise confidence intervals are connected. In a second example, we generate data from (1) following the regression curve (normal errors with σ = 0.05) m(x) = sin 2 (2π x).
0.2 0 −5
0 X
5
Fig. 8. Variance function estimation. The full line represents the real variance function and the dashed line is the estimated variance function obtained by Theorem 3.
Fig. 3 illustrates the 95% simultaneous confidence and prediction intervals. The outer (inner) region corresponds to the prediction (confidence) interval. In a third example, we compare the proposed simultaneous confidence intervals with the bootstrap method, see Section IV, on the Fossil dataset. The number of bootstrap replications were B = 15.000. From Fig. 4, it is clear that both methods produce similar confidence intervals. Although the proposed method is based on an asymptotic result (central limit theorem for smoothers) and the number of data points is small (n = 106), it produces good confidence intervals close the ones obtained by bootstrap. It can be expected that, when n is very small, the bootstrap-based confidence intervals will be more accurate than the proposed confidence intervals since the former reconstructs the limit distribution from the given data.
DE BRABANTER et al.: APPROXIMATE CONFIDENCE AND PREDICTION INTERVALS FOR LEAST SQUARES SUPPORT VECTOR REGRESSION
100
was obtained by Theorem 3. The latter clearly demonstrates the capability of the proposed methodology for variance estimation. As a last example, consider the Motorcycle dataset. We compare the proposed simultaneous confidence intervals with the wild bootstrap method, see Section IV. The number of bootstrap replications was B = 15.000. The result is given in Fig. 9. As before, both intervals are very close to each other. Fig. 10 shows the estimated variance function of this data set.
50
mˆ (X)
0
−50
−100
−150
VI. C ONCLUSION 0
10
20
30 X
40
50
60
Fig. 9. Simultaneous 95% confidence intervals for the Motorcycle dataset. The dashed lines correspond to the proposed simultaneous confidence intervals and the full lines are the bootstrap confidence intervals. The full line (in the middle) is the estimated LS-SVM model.
1500
σˆ 2(X)
1000
500
0
117
0
10
20
30 X
40
50
In this paper, we have studied the properties of datadriven confidence bands for kernel-based regression, more specifically for LS-SVM in the regression context. We have illustrated how to calculate a bias estimate for LS-SVM without computing higher order derivatives. Also, we proposed a simple way to obtain the variance function if the errors are heteroscedastic. These two estimates can be combined to obtain approximate 100(1 − α)% pointwise and simultaneous confidence and prediction intervals. In order to correspond ˘ with multiple comparison theory, a Sidák correction and a more involved result from upcrossing theory were used to construct approximate 100(1 − α)% simultaneous confidence and prediction intervals. Furthermore, we compared our method with a state-of-the-art bootstrap-based method. From this simulation, it is clear that the two produce similar intervals. However, it can be expected, when the number of data points is small, that the intervals based on bootstrap will be more accurate than the proposed confidence (prediction) intervals since the former reconstructs the limit distribution from the given data and does not rely on asymptotic results. A PPENDIX I P ROOF OF T HEOREM 1
60
Fig. 10. Variance function estimation of the Motorcycle dataset obtained by Theorem 3.
As a last example, consider the Boston housing dataset (multivariate example). We selected randomly 338 training data points and 168 test data points. The corresponding simultaneous confidence and prediction intervals are shown in Figs. 5 and 6, respectively. The outputs on training and test data are sorted and plotted against their corresponding index. Also, the respective intervals are sorted accordingly. B. Heteroscedastic Examples The data is generated according to the following model (with normal errors): Yk = sin(x k ) + 0.05x k2 + 0.01 εk , k = 1, . . . , 200 where the x k are equally spaced over the interval [−5, 5]. Figs. 7 and 8 show the 95% pointwise and simultaneous prediction intervals for this model and the estimated (and true) variance function, respectively. The variance function
In matrix form, the resulting LS-SVM model (5) on training data is given by mˆ = αˆ + 1n bˆ with In −1 ˆ (Y − 1n b) αˆ = + γ and In −1 + Y γ bˆ = −1 . In 1nT + 1n γ 1nT
Plugging this in to the above expression results in Jn −1 −1 −1 Jn −1 Z Z mˆ = Z −Z + Y c c = LY with c = 1nT ( + (In /γ ))−1 1n , Z = + (In /γ ), and Jn is a square matrix with all elements equal to 1.
118
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
The above derivation is valid when all points x are considered as training data. However, evaluating the LS-SVM at an arbitrary point x can be written as m(x) ˆ = = =
⋆T αˆ x ⋆T x
+ 1n bˆ J1T −1 −1 −1 Jn −1 Y + Z Z Z −Z c c
L(x)T Y
with ⋆x = (K (x, X 1 ), . . . , K (x, X n ))T the kernel vector evaluated in point x ⋆ and J1 = (1, . . . , 1)T . The conditional mean and conditional variance of the LS-SVM can then be derived as follows: E[m(x)|X ˆ = x] = =
n
i=1 n
li (x) E[Yi |X = x i ] li (x)m(x i )
i=1
n
li (x)2 Var[Yi |X = x i ]
=
2 2
li (x) σ (x i ).
i=1
A PPENDIX II P ROOF OF T HEOREM 2 The exact conditional bias for LS-SVM is given by (in matrix form) E[m|X] ˆ − m = (L − In )m where m = (m(X 1 ), . . . , m(X n ))T and mˆ = LY . Observe that the residuals are given by εˆ = Y − mˆ = (In − L)Y. Taking expectations yields E[ˆε|X]
= m − Lm = − bias[m|X]. ˆ
This suggests estimating the conditional bias by smoothing the negative residuals bias[m|X] ˆ
= =
−L εˆ −L(In − L)Y
=
(L − In )m. ˆ
Therefore, evaluating the estimated conditional bias at a point x can be written as bias[m(x)|X ˆ = x] =
n
li (x)m(x i ) − m(x) ˆ
i=1
=
It suffices to show that (Jn /c)Z −1 1n = 1n to complete the proof −1
T + In 1 1n 1 n n γ Jn −1 Z 1n = = 1n . −1
c 1n 1nT + Iγn n
li (x)
=
L(x)T 1n
=
i=1
i=1
n
Before we prove Theorem 3, we prove Lemma 1. First we show that for all training data L1n = 1n Jn −1 −1 −1 Jn −1 Z Z L1n = Z −Z + 1n c c Jn −1 −1 −1 Jn −1 1n + Z Z 1n = Z −Z c c Jn −1 Jn −1 Z 1n + Z 1n .[4 pt] = Z −1 1n − c c
We can now formulate the result for any point x. Let L(x) be the smoother vector in a point x, then
and Var[m(x)|X ˆ = x] =
A PPENDIX III P ROOF OF L EMMA 1 AND T HEOREM 3
L(x)T m − m(x). ˆ
⋆T x
Z
−1
−Z
−1 Jn
c
Z
−1
J1T −1 Z 1n . + c
Similar to the derivation given above, we have to show that (J1T /c)Z −1 1n = 1 to conclude the proof −1
T + In T 1 1n n J1 −1 γ Z 1n = = 1.
−1 c 1nT + Iγn 1n
Theorem 3 is proved as follows. Let L ∈ Rn×n be the smoother matrix corresponding to an initial smooth of the data, and put εˆ = (In − L)Y the vector of residuals. Then a natural means of estimating the variance function σ 2 (·) is to smooth the squared residuals to obtain S diag(ˆεεˆ T ). It is also reasonable that the estimator should be unbiased when the errors are homoscedastic. Thus, under homoscedasticity and = E[(Y − m)2 |X] and B1 = E[LY |X] − m, we obtain E[Sdiag(ˆεεˆ T )|X] = S E diag (In − L)Y Y T (In − L)T |X = S diag (In− L) E(Y Y T |X)(In− L)T = S diag (In− L)(mm T + )(In− L)T
= S diag B1B1T +σ 2 diag (In−L)(In−L)T
= S diag B1 B1T + σ 2 (1n + ) where = diag L L T − L − L T . Since E[S diag(ˆεεˆ T )|X] = σ 2 (1n +S) when LY is conditionally unbiased, i.e., B1 = 0,
DE BRABANTER et al.: APPROXIMATE CONFIDENCE AND PREDICTION INTERVALS FOR LEAST SQUARES SUPPORT VECTOR REGRESSION
and using Lemma 1, this motivates the variance estimator at a point x σˆ 2 (x) =
S(x)T diag(ˆεεˆ T ) 1 + S(x)T diag L L T − L − L T
where S(x) is the smoother matrix in an arbitrary point x [see also (7)]. R EFERENCES [1] C.-J. Ong, S. Shao, and J. Yang, “An improved algorithm for the solution of the regularization path of support vector machine,” IEEE Trans. Neural Netw., vol. 21, no. 3, pp. 451–462, Mar. 2010. [2] Z. Sun, Z. Zhang, H. Wang, and M. Jiang, “Cutting plane method for continuously constrained kernel-based regression,” IEEE Trans. Neural Netw., vol. 21, no. 2, pp. 238–247, Feb. 2010. [3] A. B. Tsybakov, Introduction to Nonparametric Estimation. New York: Springer-Verlag, 2009. [4] P. Hall, “On bootstrap confidence intervals in nonparametric regression,” Ann. Statist., vol. 20, no. 2, pp. 695–711, Jun. 1992. [5] G. Chryssoloiuris, M. Lee, and A. Ramsey, “Confidence interval prediction for neural network models,” IEEE Trans. Neural Netw., vol. 7, no. 1, pp. 229–232, Jan. 1996. [6] G. Papadopoulos, P. J. Edwards, and A. F. Murray, “Confidence estimation methods for neural networks: A practical comparison,” IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 1278–1287, Nov. 2001. [7] I. Rivals and L. Personnaz, “Construction of confidence intervals for neural networks based on least squares estimation,” Neural Netw., vol. 13, nos. 4–5, pp. 463–484, Jun. 2000. [8] C. M. Bishop and C. S. Qazaz, “Regression with input-dependent noise: A Bayesian treatment,” in Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press, 1997, pp. 347–353. [9] P. W. Goldberg, C. K. Williams, and C. M. Bishop, “Regression with input-dependent noise: A Gaussian process treatment,” in Advances in Neural Information Processing Systems 10. Cambridge, MA: MIT Press, 1998. [10] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely heteroscedastic Gaussian process regression,” in Proc. 24th ICML, Corvalis, OR, 2007, pp. 393–400. [11] J. M. Bernardo and A. F. M. Smith, Bayesian Theory. New York: Wiley, 2000. [12] G. C. Cawley, N. L. C. Talbot, R. J. Foxall, S. R. Dorling, and D. P. Mandic, “Heteroscedastic kernel ridge regression,” Neurocomputing, vol. 57, pp. 105–124, Mar. 2004. [13] G. C. Cawley, N. L. C. Talbot, and O. Chapelle, “Estimating predictive variances with kernel ridge regression,” in Machine Learning Challenges Workshop. Berlin, Germany: Springer-Verlag, 2006, pp. 56–77. [14] R. L. Eubank and P. L. Speckman, “Confidence bands in nonparametric regression,” J. Amer. Statist. Assoc., vol. 88, no. 424, pp. 1287–1301, Dec. 1993. [15] P. J. Bickel and M. Rosenblatt, “On some global measures of the deviations of density function estimates,” Ann. Statist., vol. 1, no. 6, pp. 1071–1095, 1973. [16] P. Hall and D. M. Titterington, “On confidence bands in nonparametric density estimation and regression,” J. Multivariate Anal., vol. 27, no. 1, pp. 228–254, Oct. 1998. [17] W. Härdle, “Asymptotic maximal deviation of M-smoothers,” J. Multivariate Anal., vol. 29, no. 2, pp. 163–179, May 1989. [18] G. J. Johnstone, “Probabilities of maximal deviations for nonparametric regression function estimates,” J. Multivariate Anal., vol. 12, no. 3, pp. 402–414, Sep. 1982. [19] J. Fan and I. Gijbels, “Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation,” J. R. Statist. Soc. B, vol. 57, no. 2, pp. 371–394, 1995. [20] J. Fan and I. Gijbels, Local Polynomial Modelling and its Applications. London, U.K.: Chapman & Hall, 1996. [21] R. Isermann, Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance. New York: Springer-Verlag, 2005. [22] M. Witczak, Modelling and Estimation Strategies for Fault Diagnosis of Non-Linear Systems: From Analytical to Soft Computing Approaches (Lecture Notes in Control and Information Sciences). New York: Springer-Verlag, 2007. [23] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines. Singapore: World Scientific, 2002.
119
[24] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1999. [25] W. Härdle, P. Hall, and J. S. Marron, “Regression smoothing parameters that are not far from their optimum,” J. Amer. Statist. Assoc., vol. 87, no. 417, pp. 227–233, Mar. 1992. [26] W. Härdle and A. Tsybakov, “Local polynomial estimators of the volatility function in nonparametric autoregression,” J. Econometrics, vol. 81, no. 1, pp. 223–242, Nov. 1997. [27] J. Fan and Q. Yao, “Efficient estimation of conditional variance functions in stochastic regression,” Biometrika, vol. 85, no. 3, pp. 645–660, Sep. 1998. [28] J. A. Rice, Mathematical Statistics and Data Analysis, 2nd ed. Pacific Grove, CA: Duxbury Press, 1995. [29] A. N. Shiryaev, Probability, 2nd ed. New York: Springer, 1996. [30] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, Feb. 1970. [31] D. Ruppert, M. P. Wand, and R. J. Carroll, Semiparametric Regression. Cambridge, U.K.: Cambridge Univ. Press, 2003. [32] C. Loader, Local Regression and Likelihood. New York: Springer-Verlag, 1999. [33] S. O. Rice, “The distribution of the maxima of a random curve,” Amer. J. Math., vol. 61, no. 2, pp. 409–416, Apr. 1939. [34] J. Sun and C. R. Loader, “Simultaneous confidence bands for linear regression and smoothing,” Ann. Statist., vol. 22, no. 3, pp. 1328–1345, Sep. 1994. [35] G. Knafl, J. Sacks, and D. Ylvisaker, “Confidence bands for regression functions,” J. Amer. Statist. Assoc., vol. 80, no. 391, pp. 683–691, Sep. 1985. [36] W. Härdle, “Resampling for inference from curves,” in Proc. 47th Session Int. Statist. Inst., Bonn, Germany, 1989, pp. 59–69. [37] B. Efron, “Bootstrap methods: Another look at the jackknife,” Ann. Statist., vol. 7, no. 1, pp. 1–26, 1979. [38] R. Y. Liu, “Bootstrap procedure under some non-I.I.D. models,” Ann. Statist., vol. 16, no. 4, pp. 1696–1708, Dec. 1988. [39] C. F. J. Wu, “Jackknife, bootstrap and other resampling methods in regression analysis,” Ann. Statist., vol. 14, no. 4, pp. 1261–1295, 1986. [40] C. F. J. Wu, “Jackknife bootstrap and other resampling methods in regression analysis,” Ann. Statist., vol. 14, no. 4, pp. 1295–1298, 1986. [41] E. Mammen, “Bootstrap and wild bootstrap for high dimensional linear models,” Ann. Statist., vol. 21, no. 1, pp. 255–285, Mar. 1993. [42] J. Davidson, A. Monticini, and D. Peel, “Implementing the wild bootstrap using a two-point distribution,” Econ. Lett., vol. 96, no. 3, pp. 309–315, Sep. 2007.
Kris De Brabanter was born in Ninove, Belgium, on February 21, 1981. He received the Masters degree in electronic engineering from the Erasmus Hogeschool Brussel, Brussels, Belgium, in 2005, and the Masters degree in electrical engineering (data mining and automation) from the Katholieke Universteit Leuven (K. U. Leuven), Leuven, Belgium, in 2007. Currently, he is pursuing the Ph.D. degree in the SCD-SISTA Laboratory, Department of Electrical Engineering, K. U. Leuven. His current research interests include nonparametric statistics and nonlinear systems.
Jos De Brabanter was born in Ninove, Belgium, on January 11, 1957. He received the Masters degree in electronic engineering in 1990, the Safety Engineer degree in 1992, the Master of Environment and Human Ecology degree in 1993, the Master of Artificial Intelligence degree in 1996, the Master of Statistics degree in 1997, and the Ph.D. degree in applied sciences in 2004, all from Katholieke Universiteit Leuven (K. U. Leuven), Leuven, Belgium. He currently holds an Associated Docent position at K. U. Leuven. His current research interests include statistics and nonlinear systems.
120
Johan A. K. Suykens (M’02–SM’04) was born in Willebroek, Belgium, on May 18, 1966. He received the Masters degree in electro-mechanical engineering and the Ph.D. degree in applied sciences from the Katholieke Universiteit Leuven (K. U. Leuven), Leuven, Belgium, in 1989 and 1995, respectively. He was a Visiting Post-Doctoral Researcher at the University of California, Berkeley, in 1996, through the Fund for Scientific Research FWO Flanders. He is currently a Professor (Hoogleraar) at K. U. Leuven. He is the author of Artificial Neural Networks for Modelling and Control of Non-linear Systems (Kluwer Academic Publishers) and Least Squares Support Vector Machines (World Scientific), a co-author of Cellular Neural Networks, Multi-Scroll Chaos and Synchronization (World Scientific), and the editor of Nonlinear Modeling: Advanced Black Box Techniques (Kluwer Academic Publishers) and Advances in Learning Theory: Methods, Models and Applications (IOS Press). Dr. Suykens was an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS from 1997 to 1999 and from 2004 to 2007, and the IEEE T RANSACTIONS ON N EURAL N ETWORKS from 1998 to 2009. He received an IEEE Signal Processing Society Best Paper (Senior) Award in 1999, and several Best Paper Awards at International Conferences. He was a recipient of the International Neural Networks Society INNS Young Investigator Award in 2000 for significant contributions in the field of neural networks. He has served as a Director and Organizer of the North Atlantic Treaty Organization Advanced Study Institute on Learning Theory and Practice, Leuven, in 2002, as a Program Co-Chair for the International Joint Conference on Neural Networks in 2004, and the International Symposium on Nonlinear Theory and its Applications in 2005, and as an organizer of the International Symposium on Synchronization in Complex Networks in 2007.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Bart De Moor (M’86–SM’93–F’04) was born in Halle, Belgium, in 1960. He received the M.E. degree in electrical engineering and the Ph.D. degree in engineering from the Katholieke Universiteit Leuven (K. U. Leuven), Leuven, Belgium. He spent two years as a Visiting Research Associate in the Department of Electrical Engineering (Information Systems Laboratory, under Prof. Kailath) and Computer Science (under Prof. Golub), Stanford University, Stanford, CA, from 1988 to 1990. Currently, he is a Full Professor in the Department of Electrical Engineering, K. U. Leuven, in the research group SCD. He is also the Chairman of the Industrial Research Fund, Hercules (heavy equipment funding in Flanders) and several other scientific and cultural organizations. He is on the board of six spinoff companies (IPCOS, Data4s, TMLeuven, Silicos, Dsquare, Cartagenia) of the Flemish Interuniversity Institute for Biotechnology the Study Center for Nuclear Energy, and of the Institute for Broadband Technology. He is leading a research group of 30 Ph.D. students and 8 post-doctoral fellows. In the recent past, he has guided 55 Ph.D. students. Prof. De Moor was a recipient of the Leybold-Heraeus Prize in 1986, the Leslie Fox Prize in 1989, the Guillemin-Cauer Best Paper Award of the IEEE T RANSACTION ON C IRCUITS AND S YSTEMS in 1990, the Laureate of the Belgian Royal Academy of Sciences in 1992, the Biannual Siemens Award in 1994, the Best Paper Award of Automatica (the International Federation of Automatic Control) in 1996, and the IEEE Signal Processing Society Best Paper Award in 1999. He was a member of the Academic Council of K. U. Leuven and of its Research Policy Council.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
121
Super-Resolution Method for Face Recognition Using Nonlinear Mappings on Coherent Features Hua Huang and Huiting He
Abstract— Low-resolution (LR) of face images significantly decreases the performance of face recognition. To address this problem, we present a super-resolution method that uses nonlinear mappings to infer coherent features that favor higher recognition of the nearest neighbor (NN) classifiers for recognition of single LR face image. Canonical correlation analysis is applied to establish the coherent subspaces between the principal component analysis (PCA) based features of high-resolution (HR) and LR face images. Then, a nonlinear mapping between HR/LR features can be built by radial basis functions (RBFs) with lower regression errors in the coherent feature space than in the PCA feature space. Thus, we can compute super-resolved coherent features corresponding to an input LR image according to the trained RBF model efficiently and accurately. And, face identity can be obtained by feeding these super-resolved features to a simple NN classifier. Extensive experiments on the Facial Recognition Technology, University of Manchester Institute of Science and Technology, and Olivetti Research Laboratory databases show that the proposed method outperforms the state-of-the-art face recognition algorithms for single LR image in terms of both recognition rate and robustness to facial variations of pose and expression. Index Terms— Canonical correlation analysis, face recognition, radial basis function, super resolution.
I. I NTRODUCTION
F
ACE recognition has drawn great attention in recent decades, due to its wide range of commercial and lawenforcement applications [1]. Although there has been a marked high recognition rate for frontal faces under controlled conditions, face recognition is still an unsolved problem due to the challenges from different poses, illumination changes, and facial expressions. In addition to these facial variations, the low quality of facial images significantly degrades the performance of conventional face recognition systems [2]. These low-resolution (LR) images are common in practice, usually caused by the limited accuracy of available hardware and capturing device. Thus, enhancement of recognition performance under LR conditions is desirable in various applications. Typical scenarios include security surveillance, Manuscript received June 8, 2010; revised September 12, 2010; accepted October 9, 2010. Date of publication November 9, 2010; date of current version January 4, 2011. This work was supported in part by the National Natural Science Foundation of China under Grant 60703003 and Grant 60972142, 973 Program under Project 2010CB327900, and Program for New Century Excellent Talents in University under Project NCET-09-0635. The authors are with the School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2089470
where subjects are far away from the camera and their faces are quite small in the field of view. Another application of face recognition for LR images is to automatically organize group photos in digital family albums or social networking service. In this paper, we focus on improving the recognition performance in the case where only a single face “snapshot” of LR is available. Researchers in the machine learning community strive for devising sophisticated classifiers (recognizers) in order to increase the recognition rate on inputs of low quality [3]–[6]. An alternative approach is to feed classifiers with high-resolution (HR) images (or features) reconstructed from a single or multiple LR images by super-resolution (SR) techniques [7]– [12]. Unfortunately, most SR algorithms are not designed for recognition, but for visual enhancement of images. The reconstruction of high-frequency details is the central objective for these SR algorithms even when face priors are introduced [9], [11]. On the other hand, studies on human vision systems show that high-frequency information by itself is not sufficient for recognition of low-quality facial images [13]. Therefore, explicit SR reconstruction of facial textural details in the pixel domain may not be able to significantly improve the performance of recognition algorithms for LR images, especially of those proven to be successful for HR images based on local features such as Gabor wavelets [14], [15] and local binary patterns [16]. We resort to directly reconstructing holistic features for recognition from LR inputs as SR in the feature domain, which can bring more advantages in terms of computation efficiency and robustness compared to the SR methods in the pixel or local domain, as stated in [2]. Most SR methods in the feature domain attempt to accumulate the information from a series of LR observations, such as the seminal work of Gunturk et al. [17] and its extensions. Sezer et al. proposed an algorithm by using Bayesian estimation and projection onto convex sets in the feature domain to recognize LR face in video frames [2]. However, the Gaussian assumption applied to eigenface coefficients in [17] and [2] leads to lower recognition performance on face images with pose variations. In [18], the support vector data description is extended to multiple LR face data in order to generate discriminative features for recognition, which also achieves good performance for frontal faces. Arandjelovic and Cipolla formulated the face recognition on LR videos as a person-specific generative model that separates latent factors including illumination and downsampling [19]. Their report on a large number of degraded sequences shows an increase of recognition rate over 50%.
1045–9227/$26.00 © 2010 IEEE
122
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
0.1
0.08 0.06
0.06
0.02
0.04
0
0.02
−0.02
0
−0.04
−0.02
−0.06
−0.04
−0.08
−0.06 0.02 0.04 0.06 0.08
HR features before CCA LR features before CCA
0.08
0.04
−0.1 −0.08 −0.06 −0.04 −0.02 0 (a) Fig. 1.
HR features before CCA LR features before CCA
−0.08 −0.06 −0.04 −0.02
0
0.02 (b)
0.04
0.06
0.08
First two dimensions of the features of HR and LR face images (a) before CCA transformation, and (b) after CCA transformation.
As multiple LR inputs are not available, we have to explore the information given by training LR and HR images further in order to reconstruct SR features for face recognition on a single LR input. Psychological studies suggest that familiarity is able to increase the ability of humans to tolerate image degradations [20]. From a computational point of view, this familiarity can be encoded by the connections between LR/HR pairs in training sets as utilized in single-image SR [21]. In the context of face recognition for single LR image, Li et al. [22] proposed coupled mappings to project the face images with different resolutions into a unified feature space, in which promising recognition rate is obtained. Similar to the feature reconstruction algorithms for LR videos, this method cannot handle the pose variations of faces very well since linear mappings are assumed for the projection. In order to simultaneously identify and reconstruct faces from single LR images, Hennings–Yeomans et al. [23] expressed the constraints between LR and HR images in a regularization formulation. However, their formulation based on parametric learning and reconstruction is quite time consuming since the optimization has to be executed for each test image with respect to each enrollment. Jia and Gong developed a generalized face SR method [24] for feature-domain reconstruction based on multilinear analysis, which is able to accommodate multiple factors such as lighting and pose variations. The tensor manipulations for reconstruction also demand high computational expenses since no explicit connections between LR and HR pairs are established. We apply nonlinear mappings based on linear combinations of radial basis functions (RBFs) [25] to bridge the LR and HR features for recognition of a single LR image with a wide range of variations. The RBF-based mappings are built in the new feature subspaces, called coherent features of LR/HR images, which favor the nearest neighbor (NN) classifier. NN is a simple yet effective classifier, as its decision surfaces are nonlinear [26]. We follow the strategies of improving the performance of the NN in recent research which make the neighbors belonging to the same class as close as possible [26]–[28]. Hence, we can accurately infer the neighbors of classical principal component analysis (PCA) features for an HR image from that of its corresponding LR version. It is widely accepted that the down-
sampling process preserves the local topology of face manifolds, embodied by local neighborhood structures [21], [29]. However, the manifolds of LR and HR PCA features are quite different in practice as illustrated by the positions of 2-D points in the left panel of Fig. 1. We use the canonical correlation analysis (CCA) to project LR/HR PCA features as coherent features, in which the correlation between LR and HR pairs is maximized [30]. The right part of Fig. 1 demonstrates more “overlaps” between LR and HR features, reflecting the increase of the neighborhood coherence between LR/HR features. The coherence can be quantified by the neighborhood preservation rate, which measures how many (the percentage) of the M NNs of each feature in the HR set correspond to the M NNs of the feature in the LR set [30]. By comparing the traditional PCA subspace and coherent feature space in Fig. 1, the average rates are increased from 57.02% to 65.30% (M = 1), and from 72.47% to 78.56% (M = 10). Thus, universal functional approximators, e.g., RBFs, are more likely to achieve better prediction on these coherent features from the input LR features, which yield better performance for the NN classifier. Specifically, we calculate the holistic PCA features of training HR and their corresponding LR face images in the training phase. Subsequently, CCA is applied to extract coherent features that have maximal correlation between the training HR and LR features. In order to directly connect the LR features to their HR counterparts, RBFs [25] are employed to construct the nonlinear mappings between the features in the coherent subspaces. Given an input LR face image, the coherent SR feature is obtained by mapping the LR feature via the learnt RBFs in the coherent subspace for recognition. Higher recognition rates can be achieved by a simple NN classifier. We tested the recognition on three major face databases, i.e., Facial Recognition Technology (FERET) [31], [32], University of Manchester Institute of Science and Technology (UMIST) [33], and Olivetti Research Laboratory (ORL) [34] databases. The images in the FERET database present different facial expressions, while those in UMIST and ORL have significant pose variations. The ORL database is also characterized by its small number of images for training. Given small input faces (12 × 12 for FERET, 14 × 11 for UMIST,
HUANG AND HE: FACE RECOGNITION USING NONLINEAR MAPPINGS ON COHERENT FEATURES
123
and 8 × 8 for ORL), the recognition rate is as high as 84.4%, 93%, and 95%, respectively. This outperforms current feature domain SR algorithms including the control loop performance monitoring (CLPM) [22], Wang’s method [11], and classical Gunturk’s method [17], and works even better than the eigenface recognition on original HR images [35]. The rest of this paper is organized as follows. In Section II, we briefly review related works on SR for face recognition and applications of CCA and RBF model. In Section III, the framework of our method is introduced. Section IV gives the details of our method, and is followed by extensive experiments in Section V. Section VI concludes this paper.
and RBF are analyzed in [49] for classification. RBFs can also be applied for the problem of reconstructing a surface from scattered points sampled on a physical shape [51]. As pointed out, RBF neural networks are best suited for learning continuous or piecewise continuous [52] approximation, and the RBF neural classifier was applied for face recognition to cope with small training sets of a high-dimensional problem efficiently [53]. The SR face image corresponding to the input LR face feature was obtained by RBF mappings [21]. Here, we apply RBF-based mapping to build the regression model between the features of LR and HR face images.
II. R ELATED W ORKS Our work applies CCA and RBF to feature domain SR for the recognition of LR face images. The most relevant works are briefly reviewed. SR techniques are central to a variety of applications ranging from digital photography to publishing. Furthermore, face image SR methods are often applied to enhance the face recognition rate of LR image sequences. Lin et al. [36] applied optical flow SR algorithm as a preprocessing stage to improve the face recognition performance of LR face images. A sequence of video frames of a subject is also applied for creating a SR image of the face with increased resolution and reduced blur for face recognition [12], [37]. Zhou et al. explicitly introduced a state variable for recognition into a Bayesian framework and achieved face recognition from LR videos by sequentially estimating the state variable based on the particle filtering [38]. Given a single LR face image, Jia et al. generated the SR identity parameter vector for recognition [39] by incorporating the tensor structure that models multiple factors into the similar Bayesian framework as that in [38]. These studies strive for an effective approach to combining information from multiple images/sources into recognition. Instead, we aim at extracting SR features from a single LR image which is suitable for performance improvement on NN classifiers. CCA was first developed by Hotelling [40] to find bases for two sets of random vectors such that the correlation between the projections of the vectors onto the bases is maximized. Classical CCA has been generalized in various ways, such as kernel CCA to maximize nonlinear correlation [41], and tensor CCA for multiple sets of variables [42]. CCA and its extensions can be used whenever there is a need to establish a relationship between two sets of variables. In the context of machine learning, CCA is commonly applied for supervised dimensionality reduction in which correlation is found between a label or semantic set and feature vectors [43], [44]. The main difference between CCA and PCA is that CCA is closely related to mutual information [45]. CCA can also be used to measure the similarity between two image sets for object and action recognition [42]. In this paper, we apply CCA to establish the coherent subspaces for HR and LR face images. RBF was introduced by Broomhead and Lowe [46] for the purpose of exact function interpolation [25], [47]. Algorithms based on RBFs are commonly applied for statistical learning [48], geometric data analysis, and pattern recognition [49], [50]. Support vector machines with Gaussian kernels
III. P ROBLEM F ORMULATION AND A LGORITHM OVERVIEW The holistic features, specifically the PCA features, are applied for the recognition of LR face images, since local features are not applicable any more for the images of very low resolution in our applications. The problem of feature-domain SR for LR face recognition turns out to be the inference of SR features ch from an input LR image Il given a training set consisting of HR images I H and their corresponding LR versions I L . Manifold learning theory suggests that the subspace of face images has an embedded manifold structure [21], [54], [55]. The high-dimensional structure formed by face images in the high-dimensional pixel space is homeomorphic with a geometric structure in lower dimensional pixel space, and the downsampling process preserves the intrinsic structures in the high-dimensional image manifold. This means that the features of HR and LR face images share a common topological structure, and thus they are coherent through the structure. As stated, in the recognition of LR face images, the PCA features are generally applied. However, this coherence does not always hold in practice [22], [30] in the PCA space. We need to find a feature subspace where the coherence between the topological structure of HR and LR face images is established and the HR feature can be estimated more accurately. CCA can build the linear correlation between two sets of data by finding one base for each set, in which the corresponding correlation coefficients of the two sets after projection by the bases are maximized [56]. We apply CCA transformation to the PCA feature sets of LR and HR face images in order to find the coherent feature subspaces, in which the correlation of the topological structures of LR and HR is maximal. As shown in Fig. 1, the topological structures are more coherent and it is easier to establish the mapping relationship after CCA transformation. We apply the RBF-based mapping to build the regression model between the features of HR and LR face images in the coherent subspace by taking advantages of the salient features of RBF regression such as fast learning and generalization ability [21], [25], [47], [51]. Fig. 2 provides the flowchart of the proposed method that super-resolves features for recognition. Our approach is divided into training and testing phases. The corresponding HR and LR face image sets are used for training to obtain the base vectors of CCA transformation and the parameters of RBF regression. In the testing stage, we calculate the PCA
124
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Training Training HR face images
Training LR face images
Feature vector extraction
Correlated features of training HR CCA
Feature vector extraction
Correlated features of training LR
Correlated feature of input LR
Input LR face image
NN classifier
RBF Mappings
Corresponding HR feature
Recognition result
Testing Fig. 2.
Flowchart of our method.
coefficients of a given LR image and project the PCA features into the coherent subspace using the learnt base vectors. Hence, the SR coherent feature corresponding to the given input LR face image can be obtained by simply applying the learnt RBF mappings. And, an NN classification is performed on these super-revolved features for face recognition. IV. A LGORITHM In this section, we present the detailed procedure of our algorithm. As stated, the problem of SR of feature domain for face recognition is formulated as the inference of the HR domain feature ch from an input LR image Il , given the training sets of HR and LR face images, I H = {IiH }m i=1 = L L L [I1H , I2H , . . . , ImH ] and I L = {IiL }m i=1 = [I1 , I2 , . . . , Im ], where m denotes the size of the training sets. The dimension of the image data, which is much larger than the number of training images, leads to huge computational costs. So, the holistic features of face images are obtained by classical PCA, which represents a given face image by a weighted combination of eigenfaces. We define T H (1) Ii − µ H x iH = B H
where µ H is the corresponding mean face of HR training face images and x iH is the feature vector of face image IiH . B H is the feature extraction matrix obtained by the HR training face images and is made up of orthogonal eigenvectors of ( Iˆ H )T × Iˆ H corresponding to the eigenvalues being ordered in H H H descending order, where Iˆ H = { IˆiH }m i=1 = [(I1 − µ ), (I2 − H H H µ ), . . . , (Im −µ )]. Similarly, the feature of LR face image is represented as T L Ii − µ L (2) x iL = B L
where B L and µ L are the feature extraction matrix and the mean face obtained by LR training face images, respectively. Then, we have the PCA feature vectors of HR and LR training p×m and X L = {x L }m ∈Rq×m . The sets as X H = {x iH }m i i=1 i=1 ∈R following process of our algorithm is based on these PCA feature vectors.
A. Coherent Features In our study of feature-domain SR for LR face recognition, the relationship between HR and LR feature vectors should be learned by the training sets. Thus, given an input LR face features, the corresponding SR features can be obtained for recognition. In the existing methods, this relationship is directly obtained by the PCA features of LR and HR face images [11], [17]. Corresponding HR and LR images of the same face have differences only in resolution, thus, they are coherent through their intrinsic structures. In order to learn the relationship between HR and LR feature vectors more exactly, we apply CCA [56] to incorporate the intrinsic topological structure as the prior constraint. In the coherent subspace obtained by CCA transformation, the solution space of HR feature corresponding to a given LR image is reduced. Then, the more exact coherent SR features can be obtained for recognition in the coherent subspace. Specifically, from the PCA feature training sets X H and L X , we first subtract their mean values x¯ H and x¯ L , respectively, which yields the centralized data sets Xˆ H = [xˆ1H , xˆ2H , . . . , xˆmH ] and Xˆ L = [xˆ1L , xˆ2L , . . . , xˆmL ]. CCA finds two base vectors V H and V L for datasets Xˆ H and Xˆ L in order to maximize the correlation coefficient between vectors C H = (V H )T Xˆ H and C L = (V L )T Xˆ L . The correlation coefficient is defined as ρ=
E[C H C L ] E[(C H )2 ]E[(C L )2 ] E[(V H )T Xˆ H ( Xˆ L )T V L ]
= (3) L L T L H H T H L T H T ˆ ˆ ˆ ˆ E[(V ) X ( X ) V ]E[(V ) X ( X ) V ] where E[·] denotes mathematical expectation. To find the base vectors V H and V L , we define C11 = E[ Xˆ H ( Xˆ H )T ] and C22 = E[ Xˆ L ( Xˆ L )T ] as the within-set covariance matrices of Xˆ H and Xˆ L , respectively, while C12 = E[ Xˆ H ( Xˆ L )T ] and C21 = E[ Xˆ L ( Xˆ H )T ] as their between-set covariance matrices. Then, we compute −1 −1 C12 C22 C21 R1 = C11
(4)
HUANG AND HE: FACE RECOGNITION USING NONLINEAR MAPPINGS ON COHERENT FEATURES
0.5
and (5)
V H is made up of the eigenvectors of R1 when the eigenvalues of R1 are ordered in descending order. Similarly, the eigenvectors of R2 compose V L [56]. We obtain the corresponding projected coefficient sets q×m and C L = {c L }m ∈Rq×m of the PCA C H = {ciH }m i=1 ∈R i i=1 H feature sets X and X L projected into the coherent subspaces using the following base vectors: T ciH = V H xˆiH (6) L T L L (7) xˆi . ci = V
As there exists a coherent intrinsic structure embedded in the HR and LR feature sets X H and X L , the correlation between the two sets C H and C L is increased and their topological structures are more coherent after the transformation. Then, the relationship between HR and LR features are more exactly established in the coherent subspace. B. Nonlinear Mappings Between the Coherent Features of HR and LR Face Images As the coherent subspace is obtained, the nonlinear mapping relationship between the coherent features of HR and LR will be learned by the training features. This problem can be formulated as finding an approximate function to establish the mapping between the coherent features of HR and LR face images. RBFs are typically used to build up function approximations. So, we apply RBF to construct the mapping relationship. RBF uses radial symmetry function to transform the multivariate data approximation problem into the unary approximation problem, and can interpolate nonuniform distribution of high-dimensional data smoothly [57]. The form of RBFs used to build up function approximations is m w j ϕ(||ti − t j ||) (8) fi (·) = j =1
where the approximating function f i (·) is represented as a sum of m RBFs ϕ(·), each associated with a different center t j , and w j is the weighting coefficient. The form has been particularly used in nonlinear systems [25]. In our implementation, we apply multiquadric basis function 2 ϕ(·) = ti − t j + 1.
In order to apply RBFs, first, we train the weighting coefficients by training coherent features of HR and LR face images. The approximate value we want to obtain is the coherent HR features, while the input value is the coherent features of LR face images. So, in the training stage, we substitute the coherent features of LR face images ciL and c Lj for ti and t j , and the coherent HR feature ciH corresponding to ciL for f i . The aim of RBFs is to establish the nonlinear mappings between ciH and ciL . In our implementation, the m value in (8) is the size of the training set. Thus, the centers of these RBFs is the corresponding training LR coherent features C L .
RBF regression error on 100–D coherent features RBF regression error on 100–D PCA features RBF regression error on 500–D PCA features
0.45 Error to the true HR feature
R2 =
−1 −1 C22 C21 C11 C12 .
125
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
Fig. 3.
0
5
10
15
20
25 30 Subject index
35
40
45
50
Error in feature computation.
The matrix form of RBFs in (8) is represented as F = W , specifically ⎡ ⎤ ϕ(||t1 − t1 ||) ...ϕ(||tm − t1 ||) ⎦. [ f 1 , ... f m ] = [w1 , ...wm ] ⎣ ... ... ... ϕ(||t1 − tm ||)...ϕ(||tm − tm ||) (9) Then, the weighting coefficient matrix W is solved as W = F · i nv()
(10)
by setting F = C H and ti = ciL . Note that, since is not always invertible, we need to perform a regularization operation, that is, + τ I, where τ is set to a small positive value such as τ = 10−3 , and I is the identity matrix. Based on the trained RBFs, the SR coherent features of a given LR coherent features can be obtained. In our algorithm, the RBF mappings are performed on the coherent subspaces to obtain more exact recognition features. In order to illustrate the effectiveness of the coherent subspace, and to verify that it works as predicted, the errors between HR features and SR features by RBF mappings in PCA and coherent subspaces are calculated, respectively. We define the true HR feature as s, and the SR feature as sˆ. Then, the RBF regression error between these two features is defined as D(s, s) ˆ =
||s − sˆ ||2 . ||s||2
(11)
In order to obtain the errors for comparison, the experiments are performed on the FERET expression database, while the standard gallery is used for the training set, the other 50 images in the probe set are used for testing. We apply PCA for feature extraction, which retains the top 500 components for HR face images while retraining the top 100 components for LR face images. Then, the coherent features obtained by CCA transformation are 100 dimensions for LR and HR face images. The errors for 50 subjects in the PCA subspace and the coherent subspace are calculated, respectively. In order to compare the errors more reasonably, the errors for the 100D features in the PCA subspace are also calculated. As shown in Fig. 3, the values of RBF regression error to the true HR features in the coherent subspace are small, and these values
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
are much smaller in the coherent subspace than that in PCA subspace. The mean and variance values for the 50 subjects are 0.0122 and 8.2389 × 10−5 in the coherent subspace, while these values are 0.0865 and 0.0038 in the PCA subspace for 100D features. So, the more exact features are estimated in the coherent subspace which are used for recognition. C. Super Resolution for Recognition We feed the coherent features super-resolved from the features of LR faces to an NN classifier to achieve the face recognition. In the testing phase, given an LR face image Il , the PCA feature vector xl of the input face image is computed as T xl = B L Il − µ L . (12) In our algorithm, we execute our recognition process in the coherent subspaces. So, the PCA feature vector xl is transformed to the coherent subspace using T cl = V L xl − x¯ L . (13)
The coherent SR feature ch is obtained by feeding the coherent feature of the LR face image cl to the trained RBF mapping in (8) ch = W
· [ϕ(c1L , cl ), . . . ϕ(cmL , cl )]T .
(14)
Finally, we apply the coherent feature ch and C H = {ciH }m i=1 for recognition based on the NN classification with L2 norm H gk (ch ) = min(||ch − cik ||2 ) i = 1, 2, . . . m
where
H cik
(15)
represents the i th sample in the kth class in C H .
(a)
(b)
(c)
Fig. 4. Face images of one individual in FERET face database. (a) HR training face image with size 72 × 72. (b) LR face image for training with size 12 × 12 from training image set. (c) LR input face image for testing with size 12 × 12 from testing image set. 0.9 0.8 0.7 Recognition Rate
126
0.6 Our method CLPM Wang method Gunturk method PCA-RBF RBF-PCA LR-PCA HR-PCA
0.5 0.4 0.3 0.2 0.1 0 50
80 100 120 144
200 Dimension
250
300
350
Fig. 5. Recognition results with different feature dimensions for the FERET expression database.
advantages of our method, we compare the mean time taken by each method on the three databases, as well as the recognition rate with different downsampling rates on the relatively largest FERET database. We will explain these experiments in detail below.
V. E XPERIMENTS AND A NALYSIS Our experiments are performed on the FERET face database [31], [32], the UMIST database [33], and the ORL database [34]. In order to demonstrate the effectiveness of our algorithm, we compare the face recognition rate of our method with that of the CLPM method [22], which applies RBF to study the relationship between training LR/HR PCA coefficient pairs and then based on the relationship obtain the interpolated PCA coefficient of the input LR images for recognition (PCA-RBF); Wang’s method [11], which applies RBF to study the relationship between training LR/HR PCA coefficient pairs and then based on the relationship obtain the interpolated PCA coefficient of the input LR images for recognition (PCA-RBF); Gunturk’s method [17], which applies RBF to obtain the interpolated HR face image of the input LR face image and extracts the feature of the interpolation image for recognition (RBF-PCA); the method recognizes the feature extracted from the interpolation image that is obtained by enlarging the input LR image to the size of HR image by nearest interpolation (LR-PCA); and the method using original HR face images (HR-PCA). CLPM method uses coupled mappings to obtain the unified feature space which favors the task of classification. Wang’s method applies eigen transformation to obtain high-quality face images for recognition. Gunturk’s method uses eigenface-domain SR to obtain the HR features for recognition. To further analyze the
A. FERET Face Database for Recognition We evaluate our method with the FERET expression database [31], [32]. The influence of changing expression on the performance of these methods is analyzed in these experiments. The standard gallery, which contains 1196 images corresponding to 1196 individuals, is used for training set, and the probe set fafb, which contains 1195 images corresponding to 1195 individuals, is used for testing set. In the experiments, the HR face images with size of 72 × 72 pixels are aligned with the positions of the two eyes. The LR images with size of 12 × 12 pixels are generated by the operation of smoothing and downsampling. The face images of one individual in FERET expression database for experiments are shown in Fig. 4. As shown in the figure, the LR face image with low quality is difficult to recognize. For CLPM method, we choose |N(i ) = 1| for the best performance [22]. For the other methods, PCA is employed to extract the features. For Gunturk’s method, the number of iterations is set to 7 and λ = 0.5. In these experiments, the images are adjusted to zero mean and unit variance for normalization. The parameters are the same, unless otherwise stated, in the following experiments as with other face databases. In Fig. 5, the recognition rate with different feature dimensions are drawn. Because our method, CLPM method, Wang’s method, Gunturk’s method, and PCA-RBF method need to
HUANG AND HE: FACE RECOGNITION USING NONLINEAR MAPPINGS ON COHERENT FEATURES
127
1
Recognition Rate
0.9 0.8 0.7 0.6
(a)
Our method CLPM Wang method Gunturk method PCA-RBF RBF-PCA LR-PCA HR-PCA
0.5 0.4 1
2
3
4
5 6 Rank
7
8
9
10
(b) Fig. 6.
Cumulative recognition results for the FERET expression database.
obtain the features from LR face images with a size of 12×12 pixels, the largest dimension that can be chosen for these five methods is 144. Our method with 144D features achieves the highest recognition rate of 0.844 compared to other methods. The recognition rates achieved by the CLPM method, Wang’s method, and PCA-RBF with 144D features are 0.803, 0.701, and 0.784, respectively. For Gunturk’s method, the recognition rate with 120D features is 0.655. On the other hand, the recognition rate of RBF-PCA, LR-PCA, and HR-PCA with 350D features is only 0.697, 0.369, and 0.695. Based on Fig. 5, the recognition rate of our method is slightly higher than that of HR-PCA. And the speculation about the results is that the neighborhood constraints of HR and LR images introduced by CCA make the reconstructed features and their neighborhood more coherent, which favors the NN classifier. According to the dimensions selected above, we plot the cumulative recognition results of the different methods in Fig. 6. For the calculation of cumulative recognition, when the rank number is k, an input image is considered to be correctly recognized if there is at least one of the k neighbors from training HR coherent feature domain of the LR input’s SR feature belongs to the same individual with this LR input. Then the recognition rate is easily obtained. As shown in Fig. 6, our method outperforms the other compared methods of LR face recognition. The performance lies in that our method obtains more coherent features than CLPM, Wang’s, Gunturk’s, and PCA-RBF methods, which favor the NN classifier. Our method reconstructs input LR images in the feature space where the recognition is done, while RBF-PCA interpolates input LR images in the pixel space. Our method uses CCA to improve the neighborhood preservation rate, while RBF-PCA does not. More fitting reconstruction space and higher neighborhood preservation rate make our method obtain a higher recognition rate than RBF-PCA. More detailed information added by the SR technique makes our method better than the LR-PCA method. B. UMIST Database for Recognition The UMIST face image database consists of 564 images of 20 individuals, each covering a wide range of multiview face images, from profile to frontal views. The influence of pose variations on the performance of these methods is studied in
(c) Fig. 7. Face images of one individual in the UMIST database. (a) HR training face images with size 56 × 46. (b) LR training face images with size 14 × 11. (c) LR input face images with size 14 × 11. TABLE I R ECOGNITION R ESULTS WITH D IFFERENT D IMENSIONS FOR THE UMIST D ATABASE Dimension Our method CLPM Wang’s method Gunturk’s method PCA-RBF RBF-PCA LR-PCA HR-PCA
70
90
100
120
154
0.92 0.81 0.88 0.49 0.91 0.91 0.86 0.91
0.93 0.86 0.89 0.44 0.91 0.91 0.87 0.91
0.93 0.87 0.91 0.45 0.91 0.91 0.89 0.91
0.93 0.87 0.91 0.44 0.91 0.91 0.88 0.91
0.93 0.88 0.91 0.43 0.91 0.91 0.90 0.92
these experiments. For each individual in the UMIST database, we chose 10 images for training and the other 5 images for testing. In the experiments, the HR face images with a size of 56 × 46 pixels and the LR images with size of 14 × 11 pixels are generated by the operation of smoothing and downsampling. The face images of one individual in the UMIST database for the experiments are shown in Fig. 7. The views of other individuals for the experiments are similar to those in Fig. 7. As shown in Table I, the recognition rates with different feature dimensions are listed. Our method with 90D features achieves the highest recognition rate of 0.93 compared to other methods. The recognition rates achieved by CLPM, LR-PCA, and HR-PCA with 154D features are 0.88, 0.90, and 0.92, respectively. For Gunturk’s method, the recognition rate with 70D features is only 0.49. On the other hand, the recognition rate of PCA-RBF, RBF-PCA with 70D features, and Wang’s method with 100D features is 0.91. From the results, our method achieves the highest recognition rate with less feature dimensions. The performance of CLPM method and Gunturk’s method is even worse than that of RBF-PCA and LR-PCA methods. The reason is that CLPM method applies linear mapping to obtain the recognition features, which is not
128
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE II C UMULATIVE R ECOGNITION R ESULTS FOR UMIST D ATABASE Rank Our method CLPM Wang’s method Gunturk’s method PCA-RBF RBF-PCA LR-PCA HR-PCA
1
2
3
4
5
(a)
0.93 0.88 0.91 0.49 0.91 0.91 0.90 0.92
0.96 0.88 0.94 0.53 0.94 0.94 0.93 0.95
0.96 0.90 0.94 0.58 0.95 0.94 0.94 0.95
0.96 0.91 0.95 0.60 0.96 0.95 0.94 0.96
0.97 0.93 0.95 0.64 0.98 0.96 0.95 0.98
(b)
(c)
suitable for the face images of nonlinear pose variations, and Gaussian assumption on the PCA features in Gunturk’s method is not applicable to the pose variations face images. According to the dimensions selected above, the cumulative recognition results are shown in Table II. We can also see that our method can obtain the highest recognition rate than the other compared methods. So, our method is robust to pose variations. In these experiments the SR recognition methods except ours hardly have advantage over RBF-PCA and LR-PCA methods. The reasons are the following. 1) There are only 20 people in this experiment, and each has 10 images for training. 2) For every person, the testing and training LR images are very similar. Once the input LR image matches with one of the corresponding 10 images, the input LR images is correctly recognized. Both the small size of the database and many training images for one input image cause the high recognition rate of the LR-PCA method, but this special experimental condition is rare in common recognition. Based on the other experiments in this paper, when the recognition condition is harder, i.e., more individuals and fewer images for one individual in training sets, the SR methods are better than RBF-PCA and LR-PCA methods. C. ORL Database for Recognition We also show the results obtained with the ORL database. The ORL database includes 40 individuals, and each has 10 different face images. In the 400 images, there are high variations in lighting, facial expression, and pose. For each individual in the ORL database, we choose five images for training and the other five for testing. So, there are only 200 face images for training. In the experiments, the HR face images with the size of 32 × 32 pixels, and the LR images with the size of 8 × 8 pixels are generated by the operation of smoothing and downsampling. The face images of one individual in ORL database are shown in Fig. 8. In Table III, the recognition rates with different feature dimensions are listed. From the results, we see that our method achieves the highest recognition rate of 0.950 with 50D features. The recognition rates of Gunturk’s method and HR-PCA are 0.910 and 0.915, respectively, with 50D features. CLPM, Wang’s method, PCA-RBF method, and RBF-PCA method obtain recognition rates of 0.915, 0.910, 0.890, and 0.880 with 40D features, and LR-PCA obtains 0.845 with 64D features. According to the dimensions selected above, we list the cumulative recognition results in Table IV. In our
Fig. 8. Face images of one individual in ORL database. (a) HR training face images with size 32 × 32. (b) LR training face images with size 8 × 8. (c) LR input face images with size 8 × 8. TABLE III R ECOGNITION R ESULTS WITH D IFFERENT D IMENSIONS FOR THE ORL D ATABASE Dimension Our method CLPM Wang’s method Gunturk’s method PCA-RBF RBF-PCA LR-PCA HR-PCA
20
30
40
50
64
0.910 0.845 0.865 0.855 0.870 0.850 0.815 0.880
0.920 0.905 0.895 0.880 0.885 0.875 0.840 0.895
0.930 0.915 0.910 0.900 0.890 0.880 0.840 0.905
0.950 0.915 0.905 0.910 0.885 0.880 0.835 0.915
0.945 0.910 0.905 0.910 0.890 0.880 0.845 0.910
TABLE IV C UMULATIVE R ECOGNITION R ESULTS FOR ORL D ATABASE Rank Our method CLPM Wang’s method Gunturk’s method PCA-RBF RBF-PCA LR-PCA HR-PCA
1
2
3
4
5
0.950 0.915 0.910 0.910 0.890 0.880 0.845 0.910
0.965 0.930 0.930 0.935 0.910 0.910 0.890 0.930
0.980 0.955 0.945 0.950 0.945 0.930 0.905 0.970
0.985 0.960 0.950 0.955 0.970 0.965 0.940 0.975
0.990 0.965 0.975 0.970 0.970 0.965 0.950 0.980
method, the nonlinear variations can be reflected by the RBF and provides the nonlinear mappings between the coherent features of HR and LR face images. So, the performance is better than those of the other compared methods. D. Time Complexity Analysis In this experiment, the time complexity of each method is discussed. Table V gives the mean runtime of every method on the FERET, UMIST, and ORL databases five times. It can be seen that the time consumed by our method is closest to naive PCA-RBF, LR-PCA, and HR-PCA methods. The time consumed by CLPM, Wang’s, and RBF-PCA methods are more than 1–8 times as ours. The time consumed by Gunturk’s method is longest and is about 60 times as long as ours. Based on the results, the complexity of our method, which can obtain the highest recognition rate, is lowest among the
HUANG AND HE: FACE RECOGNITION USING NONLINEAR MAPPINGS ON COHERENT FEATURES
TABLE V T IME C OMPLEXITY OF E ACH M ETHOD ( IN S) Database
FERET
UMIST
ORL
Our method CLPM Wang’s method Gunturk’s method PCA-RBF RBF-PCA LR-PCA HR-PCA
14.04 21.29 128.71 863.82 14.01 57.96 8.89 8.90
0.14 0.33 0.64 9.24 0.14 0.99 0.13 0.13
0.62 3.21 1.56 183.12 0.46 1.74 0.44 0.49
129
implement high recognition rates in the coherent subspaces. Compared to other feature-domain SR methods, our method is more robust under the variations of expression, pose, lighting, and downsampling rate and has a higher recognition rate. In this paper, CCA was applied to the classical PCA features to form the coherent features for recognition, but it is applicable to other holistic face recognition features such as independent component analysis and discrete cosine transform features [58], which might improve the recognition performance further.
TABLE VI R ECOGNITION R ESULTS WITH D IFFERENT D OWNSAMPLING R ATES ON THE FERET D ATABASE Downsampling rate
4
5
6
7
8
Our method CLPM Wang’s method Gunturk’s method PCA-RBF RBF-PCA LR-PCA
0.835 0.802 0.696 0.703 0.792 0.677 0.458
0.844 0.701 0.693 0.662 0.795 0.687 0.408
0.841 0.674 0.699 0.700 0.791 0.692 0.369
0.786 0.674 0.520 0.475 0.785 0.687 0.302
0.803 0.582 0.520 0.535 0.777 0.682 0.260
ACKNOWLEDGMENT The authors would like to thank B. Li for providing the source code of his method [22]. Portions of the research in this paper use the FERET database of facial images collected under the FERET program, sponsored by the Department of Defense Counterdrug Technology Development Program Office. The authors would like to express their sincere gratitude for the invaluable comments and constructive suggestions by the anonymous reviewers. R EFERENCES
SR recognition methods and is close to the direct LR and HR recognition methods. E. Impact of Downsampling Rate In this experiment, we study the impact of dowmsampling rate on each SR recognition method. When the downsampling rate is 4, 5, 6, 7, and 8, all methods were applied to the relatively large FERET database to study the impact of the downsampling rate. Table VI gives the corresponding results. We can see that, basically, the larger the downsampling rate, the lower the recognition rate for every method. At all downsampling rates, our method obtain the highest recognition rate. With the downsampling rate changing from 4 to 8, the changing range of recognition rate, which is obtained by the maximum minus the minimum, of our method is only 0.058 (i.e., 0.844 minus 0.786), which shows that our method is very stable. And the range of CLPM, Wang’s, Gunturk’s, and LR-PCA methods, respectively, is 0.220, 0.179, 0.228, and 0.198. Although PCA-RBF and RBF-PCA methods are also stable, their recognition rates are 3% and 4% lower than ours, on average. Thus, our method is the best, considering both stability and effectiveness. VI. C ONCLUSION For the problem of LR face images resulting in lower recognition rate, an SR method in the feature domain for face recognition was proposed in this paper. CCA was applied to obtain the coherent subspaces between the holistic features of HR and LR face images, and RBF model was used to construct the nonlinear mapping relationship between the coherent features. Then, the SR feature in the HR space of the single-input LR face image was obtained for recognition. Experiments show that even the simple NN classifier can
[1] W. Zhao, R. Chellappa, P. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM Comput. Surveys (CSUR), vol. 35, no. 4, pp. 399–458, Dec. 2003. [2] O. Sezer, Y. Altunbasak, and A. Ercil, “Face recognition with independent component-based super-resolution,” in Proc. SPIE Visual Commun. Image Process., vol. 6077. San Francisco, CA, 2006, pp. 52–66. [3] J. Lu, K. Plataniotis, A. Venetsanopoulos, and S. Li, “Ensemblebased discriminant learning with boosting for face recognition,” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 166–178, Jan. 2006. [4] J. Lu, X. Yuan, and T. Yahagi, “A method of face recognition based on fuzzy c-means clustering and associated sub-NNs,” IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 150–160, Jan. 2007. [5] S. Phung and A. Bouzerdoum, “A pyramidal neural network for visual pattern recognition,” IEEE Trans. Neural Netw., vol. 18, no. 2, pp. 329– 343, Mar. 2007. [6] K.-C. Kwak and W. Pedrycz, “Face recognition using an enhanced independent component analysis approach,” IEEE Trans. Neural Netw., vol. 18, no. 2, pp. 530–541, Mar. 2007. [7] J. V. Ouwerkerk, “Image super-resolution survey,” Image Vis. Comput., vol. 24, no. 10, pp. 1039–1052, Oct. 2006. [8] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: A technical overview,” IEEE Signal Process. Mag., vol. 20, no. 3, pp. 21–36, May 2003. [9] S. Baker and T. Kanade, “Hallucinating faces,” in Proc. 4th IEEE Int. Conf. Autom. Face Gesture Recognit., Grenoble, France, Mar. 2000, pp. 83–88. [10] F. Lin, C. Fookes, V. Chandran, and S. Sridharan, “Super-resolved faces for improved face recognition from surveillance video,” in Advances in Biometrics (Lecture Notes in Computer Science), vol. 4642, New York: Springer-Verlag, 2007, pp. 1–10. [11] X. Wang and X. Tang, “Hallucinating face by eigentransformation,” IEEE Trans. Syst., Man, Cybern., Part C: Appl. Rev., vol. 35, no. 3, pp. 425–434, Aug. 2005. [12] F. Wheeler, X. Liu, and P. Tu, “Multi-frame super-resolution for face recognition,” in Proc. IEEE Conf. Biometrics: Theory, Appl. Syst., Crystal City, VA, Sep. 2007, pp. 1–6. [13] P. Sinha, B. Balas, Y. Ostrovsky, and R. Russell, “Face recognition by humans: Nineteen results all computer vision researchers should know about,” Proc. IEEE, vol. 94, no. 11, pp. 1948–1962, Nov. 2006. [14] L. Wiskott, J.-M. Fellous, N. Krüger, and C. von der Malsburg, “Face recognition by elastic bunch graph matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 775–779, Jul. 1997. [15] H. Zhang, B. Zhang, W. Huang, and Q. Tian, “Gabor wavelet associative memory for face recognition,” IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 275–278, Jan. 2005.
130
[16] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, Jul. 2002. [17] B. Gunturk, A. Batur, Y. Altunbasak, M. Hayes, and R. Mersereau, “Eigenface-domain super-resolution for face recognition,” IEEE Trans. Image Process., vol. 12, no. 5, pp. 597–606, May 2003. [18] S.-W. Lee, J. Park, and S.-W. Lee, “Low resolution face recognition based on support vector data description,” Pattern Recognit., vol. 39, no. 9, pp. 1809–1812, Sep. 2006. [19] O. Arandjelovic and R. Cipollam, “A manifold approach to face recognition from low quality video across illumination and pose using implicit super-resolution,” in Proc. IEEE Int. Conf. Comput. Vis., Rio de Janeiro, Brazil, Oct. 2007, pp. 1–8. [20] A. Burton, S. Wilson, M. Cowan, and V. Bruce, “Face recognition in poor-quality video: Evidence from security surveillance,” Psychol. Sci., vol. 10, no. 3, pp. 243–247, 1999. [21] Y. Zhuang, J. Zhang, and F. Wu, “Hallucinating faces: LPH superresolution and neighbor reconstruction for residue compensation,” Pattern Recognit., vol. 40, no. 11, pp. 3178–3194, Nov. 2007. [22] B. Li, H. Chang, S. Shan, and X. Chen, “Low-resolution face recognition via coupled locality preserving mappings,” IEEE Signal Process.Lett., vol. 17, no. 1, pp. 20–23, Jan. 2010. [23] P. Hennings-Yeomans, S. Baker, and B. Kumar, “Simultaneous superresolution and feature extraction for recognition of low resolution faces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Anchorage, AK, Jun. 2008, pp. 1–8. [24] K. Jia and S. Gong, “Generalized face super-resolution,” IEEE Trans. Image Process., vol. 17, no. 6, pp. 873–886, Jun. 2008. [25] M. Buhmann, “Radial basis functions,” Acta Numer., vol. 9, pp. 1–38, Mar. 2001. [26] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,” in Advances in Neural Information Processing Systems, Cambridge, MA: MIT Press, 2004, pp. 513–520. [27] D. Masip and J. Vitria, “Shared feature extraction for nearest neighbor face recognition,” IEEE Trans. Neural Netw., vol. 19, no. 4, pp. 586–595, Apr. 2008. [28] R. Salakhutdinov, “Learning deep generative models,” Ph.D. dissertation, Graduate Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, 2009. [29] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1. Washington D.C., Jun.–Jul. 2004, pp. 275–282. [30] H. Huang, H. He, X. Fan, and J. Zhang, “Super-resolution of human face image using canonical correlation analysis,” Pattern Recognit., vol. 43, no. 7, pp. 2532–2543, Jul. 2010. [31] P. Phillips, H. Wechsler, J. Huang, and P. Rauss, “The FERET database and evaluation procedure for face recognition algorithms,” Image Vis. Comput., vol. 16, no. 5, pp. 295–306, Apr. 1998. [32] P. Phillips, H. Moon, S. Rizvi, and P. Rauss, “The FERET evaluation methodology for face recognition algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 1090–1104, Oct. 2000. [33] The UMIST Face Database [Online]. Available: http://www.shef.ac. uk/eee/research/vie/research/face.html [34] The ORL Face Database [Online]. Available: http://www.cl.cam.ac. uk/research/dtg/attarchive/facedatabase.html [35] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognitive Neurosci., vol. 3, no. 1, pp. 71–86, 1991. [36] F. Lin, J. Cook, V. Chandran, and S. Sridharan, “Face recognition from super-resolved images,” in Proc. 8th Int. Symp. Signal Process. Appl., vol. 2. Sydney, NSW, Australia, Aug. 2005, pp. 667–670. [37] M. Al-Azzeh, A. Eleyan, and H. Demirel, “PCA-based face recognition from video using super-resolution,” in Proc. 23rd Int. Symp. Comput. Inform. Sci., Istanbul, Turkey, Oct. 2008, pp. 1–4. [38] S. Zhou, V. Krueger, and R. Chellappa, “Probabilistic recognition of human faces from video,” Comput. Vis. Image Understanding, vol. 91, nos. 1–2, pp. 214–245, Jul.–Aug. 2003. [39] K. Jia and S. Gong, “Multi-modal tensor face for simultaneous superresolution and recognition,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 2. Beijing, China, Oct. 2005, pp. 1683–1690. [40] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, nos. 3–4, pp. 321–377, 1936. [41] P. L. Lai and C. Fyfe, “Kernel and nonlinear canonical correlation analysis,” Int. J. Neural Syst., vol. 10, no. 5, pp. 365–377, Oct. 2000.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
[42] T.-K. Kim, S.-F. Wong, and R. Cipolla, “Tensor canonical correlation analysis for action classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Minneapolis, MN, Jun. 2007, pp. 1–8. [43] L. Sun, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland, Jul. 2008, pp. 1024–1031. [44] W. Zheng, X. Zhou, C. Zou, and L. Zhao, “Facial expression recognition using kernel canonical correlation analysis (KCCA),” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 233–238, Jan. 2006. [45] M. Borga, “Learning multidimensional signal processing,” Ph.D. dissertation, Dept. Electr. Eng., Linköping Univ., Sweden, 1998. [46] D. Broomhead and D. Lowe, “Radial basis functions, multi-variable functional interpolation and adaptive networks,” Complex Syst., vol. 2, pp. 321–355, Mar. 1988. [47] R. Schaback, “Multivariate interpolation by polynomials and radial basis functions,” Constructive Approx., vol. 21, no. 3, pp. 293–317, 2005. [48] S. Chen, C. Cowan, P. Grant et al., “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Trans. Neural Netw., vol. 2, no. 2, pp. 302–309, Mar. 1991. [49] B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik, “Comparing support vector machines with Gaussian kernels to radial basis function classifiers,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2758–2765, Nov. 1997. [50] B. Le Roux and H. Rouanet, Geometric Data Analysis. New York: Springer-Verlag, 2004. [51] M. Samozino, M. Alexa, P. Alliez, and M. Yvinec, “Reconstruction with Voronoi centered radial basis functions,” in Proc. Symp. Geometry Process., Cagliari, Sardinia, Italy, Jun. 2006, pp. 51–60. [52] J. Moody and C. Darken, “Fast learning in networks of locally-tuned processing units,” Neural Comput., vol. 1, no. 2, pp. 281–294, 1989. [53] M. Er, S. Wu, J. Lu, and H. Toh, “Face recognition with radial basis function (RBF) neural networks,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 697–710, May 2002. [54] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, Dec. 2000. [55] L. K. Saul and S. T. Roweis, “Think globally, fit locally: Unsupervised learning of low dimensional manifolds,” J. Mach. Learn. Res., vol. 4, pp. 119–155, Dec. 2004. [56] M. Borga. (2001), Canonical Correlation a Tutorial [Online]. Available: http://people.imt.liu.se/magnus/cca [57] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer-Verlag, 2006. [58] M. J. Er, W. Chen, and S. Wu, “High-speed face recognition based on discrete cosine transform and RBF neural networks,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 679–691, May 2005.
Hua Huang was born in 1975. He received the B.S., M.S., and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 1996, 2001, and 2006, respectively. He is currently a Professor at the School of Electronics and Information Engineering, Xi’an Jiaotong University. His current research interests include image and video processing, machine learning, and pattern recognition.
Huiting He received the B.S. degree from Chang’an University, Xi’an, China, in 2007. She is currently a post-graduate student in the School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an. Her current research interests include image super resolution and manifold learning.
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
131
Minimum Complexity Echo State Network Ali Rodan, Student Member, IEEE, and Peter Tiˇno
Abstract— Reservoir computing (RC) refers to a new class of state-space models with a fixed state transition structure (the reservoir) and an adaptable readout form the state space. The reservoir is supposed to be sufficiently complex so as to capture a large number of features of the input stream that can be exploited by the reservoir-to-output readout mapping. The field of RC has been growing rapidly with many successful applications. However, RC has been criticized for not being principled enough. Reservoir construction is largely driven by a series of randomized model-building stages, with both researchers and practitioners having to rely on a series of trials and errors. To initialize a systematic study of the field, we concentrate on one of the most popular classes of RC methods, namely echo state network, and ask: What is the minimal complexity of reservoir construction for obtaining competitive models and what is the memory capacity (MC) of such simplified reservoirs? On a number of widely used time series benchmarks of different origin and characteristics, as well as by conducting a theoretical analysis we show that a simple deterministically constructed cycle reservoir is comparable to the standard echo state network methodology. The (short-term) MC of linear cyclic reservoirs can be made arbitrarily close to the proved optimal value. Index Terms— Echo state networks, memory capability, neural networks, reservoir computing, simple recurrent time-series prediction.
I. I NTRODUCTION
R
ECENTLY, there has been an outburst of research activity in the field of reservoir computing (RC) [1]. RC models are dynamical models for processing time series that make a conceptual separation of the temporal data processing into two parts: 1) representation of temporal structure in the input stream through a nonadaptable dynamic reservoir, and 2) a memoryless easy-to-adapt readout from the reservoir. For a comprehensive recent review of RC see [2]. Perhaps the simplest form of the RC model is the echo state network (ESN) [3]–[6]. Roughly speaking, ESN is a recurrent neural network with a nontrainable sparse recurrent part (reservoir) and a simple linear readout. Connection weights in the ESN reservoir, as well as the input weights, are randomly generated. The reservoir weights are scaled so as to ensure the echo state property (ESP): the reservoir state is an “echo” of the entire input history. Typically, spectral radius of the reservoir’s Manuscript received January 21, 2010; accepted October 9, 2010. Date of publication November 11, 2010; date of current version January 4, 2011. The authors are with the School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K. (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TNN.2010.2089641
weight matrix W is made <1.1 ESN has been successfully applied in time-series prediction tasks [6], speech recognition [7], noise modeling [6], dynamic pattern classification [5], reinforcement learning [8], and language modeling [9]. Many extensions of the classical ESN have been suggested in the literature, e.g., intrinsic plasticity [10], [11], decoupled reservoirs [12], refined training algorithms [6], leaky-integrator reservoir units [13], support vector machine [14], filter neurons with delay and sum readout [15], etc. However, there are still serious problems preventing ESN becoming a widely accepted tool. 1) There are properties of the reservoir that are poorly understood [12]. 2) Specification of the reservoir and input connections requires numerous trails and even luck [12]. 3) Strategies to select different reservoirs for different applications have not been devised [16]. 4) Imposing a constraint on spectral radius of the reservoir matrix is a weak tool to properly set the reservoir parameters [16]. 5) The random connectivity and weight structure of the reservoir is unlikely to be optimal and does not give a clear insight into the reservoir dynamics organization [16]. Indeed, it is not surprising that part of the scientific community is skeptical about ESNs being used for practical applications [17]. Typical model construction decisions that an ESN user must make include the following: setting the reservoir size; setting the sparsity of the reservoir and input connections; setting the ranges for random input and reservoir weights; and setting the reservoir matrix scaling parameter α. The dynamical part of the ESN responsible for input stream coding is treated as a black box, which is unsatisfactory from both theoretical and empirical standpoints. First, it is difficult to put a finger on what it actually is in the reservoir’s dynamical organization that makes ESN so successful. Second, the user is required to tune parameters whose function is not well understood. In this paper, we would like to clarify by systematic investigation the reservoir construction: namely, we show that in fact a very simple ESN organization is sufficient to obtain performances comparable to those of the classical ESN. We argue that for a variety of tasks it is sufficient to consider: 1) a simple fixed nonrandom reservoir topology with full connectivity from inputs to the reservoir; 2) a single fixed absolute weight value r for all reservoir connections; and 3) a single weight value v for input connections, with (deterministically generated) aperiodic pattern of input signs. In contrast to the complex trial-and-error ESN construction, our approach leaves the user with only two free parameters to 1 Note that this is not the necessary and sufficient condition for ESP.
1045–9227/$26.00 © 2010 IEEE
132
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Dynamical Reservoir N internal units x(t)
K Input units s(t) V
W
symmetric interval. To account for ESP, the reservoir connection matrix W is typically scaled as W ← αW/|λmax |, where |λmax | is the spectral radius4 of W and 0 < α < 1 is a scaling parameter [5]. ESN memoryless readout can be trained both offline (batch) and online by minimizing any suitable loss function. We use the normalized mean square error (NMSE) to train and evaluate the models
L output units y (t) U
N MSE =
Fig. 1.
ESN architecture.
be set, i.e., r and v. This not only considerably simplifies the ESN construction, but also enables a more thorough theoretical analysis of the reservoir properties. The doors can be open for a wider acceptance of the ESN methodology among both practitioners and theoreticians working in the field of time-series modeling/prediction. In addition, our simple deterministically constructed reservoir models can serve as useful baselines in future RC studies. This paper is organized as follows. Section II gives an overview of ESN design and training. In Section III, we present our simplified reservoir topologies. Experimental results are presented in Section IV. We analyze both theoretically and empirically the short-term memory capacity (MC) of our simple reservoir in Section V. Finally, our work is discussed and concluded in Sections VI and VII, respectively. II. ESN S ESN is a recurrent discrete-time neural network with K input units, N internal (reservoir) units, and L output units. The activation of the input, internal, and output units at time step t are denoted by s(t) = (s1 (t), . . . , s K (t))T , x(t) = (x 1 (t), . . . , x N (t))T , and y(t) = (y1 (t), . . . , y L (t))T , respectively. The connections between the input units and the internal units are given by an N × K weight matrix V , connections between the internal units are collected in an N × N weight matrix W , and connections from internal units to output units are given in L × N weight matrix U . The internal units are updated according to2 x(t + 1) = f (V s(t + 1) + W x(t))
(1)
where f is the reservoir activation function (typically tanh or some other sigmoidal function). The linear readout is computed as3 y(t + 1) = U x(t + 1). (2) Elements of W and V are fixed prior to training with random values drawn from a uniform distribution over a (typically) 2 There are no feedback connections from the output to the reservoir and no direct connections from the input to the output. 3 The reservoir activation vector is extended with a fixed element accounting for the bias term.
yˆ (t) − y(t)2 y(t) − y(t)2
(3)
where yˆ (t) is the readout output, y(t) is the desired output (target), . denotes the Euclidean norm, and < · > denotes the empirical mean. To train the model in offline mode, we: 1) initialize W with a scaling parameter α < 1 and run the ESN on the training set; 2) dismiss data from initial washout period and collect remaining network states x(t) row-wise into a matrix X;5 and 3) calculate the readout weights using, e.g., ridge regression [18] U = (X T X + λ2 I )−1 X T y
(4)
where I is the identity matrix, y a vector of the target values, and λ > 0 is a regularization factor. III. S IMPLE ESN R ESERVOIRS To simplify the reservoir construction, we propose several easily structured topology templates and compare them with those of the classical ESN. We consider both linear reservoirs that consist of neurons with identity activation function, as well as nonlinear reservoirs consisting of neurons with the commonly used tangent hyperbolic (tanh) activation function. Linear reservoirs are fast to simulate but often lead to inferior performance when compared to nonlinear ones [19]. A. Reservoir Topology Besides the classical ESN reservoir introduced in the last section (Fig. 1), we consider the following three reservoir templates (model classes) with fixed topologies (Fig. 2). 1) Delay line reservoir (DLR), which is composed of units organized in a line. Only elements on the lower subdiagonal of the reservoir matrix W have nonzero values Wi+1,i = r for i = 1...N − 1, where r is the weight of all the feedforward connections. 2) DLR with feedback connections (DLRB), which has the same structure as DLR but each reservoir unit is also connected to the preceding neuron. Nonzero elements of W are on the lower Wi+1,i = r and upper Wi,i+1 = b sub-diagonals, where b is the weight of all the feedback connections. 3) Simple cycle reservoir (SCR), in which units are organized in a cycle. Nonzero elements of W are on the lower subdiagonal Wi+1,i = r and at the upper-right corner W1,N = r . 4 The largest among the absolute values of the eigenvalues of W . 5 In case of direct input-output connections, the matrix X collects inputs
s(t) as well.
ˇ RODAN AND TINO: MINIMUM COMPLEXITY ECHO STATE NETWORK
Dynamical Reservoir N internal units x(t)
Input unit s(t)
W V
Dynamical Reservoir N internal units x(t)
Output unit Input unit y(t) s(t) V U
(a) Fig. 2.
133
W U
Dynamical Reservoir N internal units x(t)
Output unit Input unit y(t) s(t) V
(b)
W U
Output unit y(t)
(c)
(a) DLR. (b) DLRB. (c) SCR.
B. Input Weight Structure The input layer is fully connected to the reservoir. For ESN the input weights are (as usual) generated randomly from a uniform distribution over an interval [−a, a]. In case of simple reservoirs (DLR, DLRB, and SCR), all input connections have the same absolute weight value v > 0, the sign of each input weight is determined randomly by a random draw from Bernoulli distribution of mean 1/2 (unbiased coin). The values v and a are chosen on the validation set. IV. E XPERIMENTS A. Datasets We use a range of time series covering a wide spectrum of memory structure and widely used in the ESN literature [3], [4], [6], [10], [11], [19]–[21]. For each dataset, we denote the length of the training, validation, and test sequences by L trn , L val , and L t st , respectively. The first L v values from training, validation, and test sequences are used as the initial washout period. 1) NARMA System: The nonlinear autoregressive moving average (NARMA) system is a discrete time system. This system was introduced in [22]. The current output depends on both the input and the previous output. In general, modeling this system is difficult, due to the nonlinearity and possibly long memory. a) Fixed-order NARMA time series: NARMA systems of order O = 10, 20 given by (5) and (6), respectively y(t + 1) = 0.3y(t) + 0.05y(t)
9
y(t − i )
i=0
+ 1.5s(t − 9)s(t) + 0.1 y(t + 1) = tanh(0.3y(t) + 0.05y(t)
(5) 19
y(t + 1) = tanh(αy(t) + βy(t)
9
y(t −i ) + γ s(t − 9)s(t) + ϕ)
i=0
(7) where α, β, γ , and ϕ are assigned random values taken from ±50% interval around their original values in [21, eq. (5)]. Since the system is not stable, we used a nonlinear saturation function tanh [21]. The input s(t) and target data y(t) are shifted by −0.5 and scaled by 2 as in [10]. The networks were trained on system identification task to output y(t) based on s(t), with L trn = 2000, L val = 3000, L t st = 3000, and L v = 200. 2) Laser Dataset: The Santa Fe Laser dataset [13] is a crosscut through periodic to chaotic intensity pulsations of a real laser. The task is to predict the next laser activation y(t + 1), given the values up to time t: L trn = 2000, L val = 3000, L t st = 3000, and L v = 200. 3) Hénon Map: Hénon Map dataset [23] is generated by y(t) = 1 − 1.4y(t − 1)2 + 0.3y(t − 2) + z(t)
(8)
where y(t) is the system output at time t and z(t) is a normal white noise with standard deviation of 0.05 [24]. We used L trn = 2000, L val = 3000, L t st = 3000, and L v = 200. The dataset is shifted by −0.5 and scaled by 2. Again, the task is to predict the next value y(t + 1), given the values up to time t. 4) Nonlinear Communication Channel: The dataset was created as follows [6]. First, an i.i.d. sequence d(t) of symbols transmitted through the channel is generated by randomly choosing values from {−3, −1, 1, 3} (uniform distribution). Then, d(t) values are used to form a sequence q(t) through a linear filter q(t) = 0.08d(t + 2) − 0.12d(t + 1) + d(t) + 0.18d(t − 1)
y(t − i )
i=0
+ 1.5s(t − 19)s(t) + 0.01)
b) Random 10th order NARMA time series: This system is generated by
(6)
where y(t) is the system output at time t, and s(t) is the system input at time t (an i.i.d stream of values generated uniformly from an interval [0, 0.5]) [21], [22].
−0.1d(t − 2) + 0.09d(t − 3) − 0.05d(t − 4) +0.04d(t − 5) + 0.03d(t − 6) + 0.01d(t − 7).
(9)
Finally, a nonlinear transformation is applied to q(n) to produce the signal s(n) s(t) = q(t) + 0.0036q(t)2 − 0.11q(t)3.
(10)
134
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Following [6], the input s(t) signal was shifted +30. The task is to output d(t −2) when s(t) is presented at the network input. L trn = 2000, L val = 3000, L t st = 3000, and L v = 200. 5) IPIX Radar: The sequence (used in [12]) contains 2000 values with L trn = 800, L val = 500, L t st = 700, and L v = 100. The target signal is the sea clutter data (the radar backscatter from an ocean surface). The task was to predict y(t + 1) and y(t + 5) (1 and 5 step ahead prediction) when y(t) is presented at the network input. 6) Sunspot Series: The dataset (obtained from [25]) contains 3100 sunspots numbers from January 1749 to April 2007, where L trn = 1600, L val = 500, L t st = 1000, and L v = 100. The task was to predict the next value y(t + 1) based on the history of y up to time t. 7) Nonlinear System with Observational Noise: This system was studied in [26] in the context of Bayesian sequential state estimation. The data is generated by s(t − 1) 1 + s 2 (t − 1) + 8 cos(1.2(t − 1)) + w(t) s 2 (t) y(t) = + v(t) 20 s(t) = 0.5s(t − 1) + 25
(11) (12)
where the initial condition is s(0) = 0.1; w(t) and v(t) are zero-mean Gaussian noise terms with variances taken from {1, 10}, i.e., (σw2 , σv2 ) ∈ {1, 10}2 . L trn = 2000, L val = 3000, L t st = 3000, and L v = 200. The task was to predict the value y(t + 5), given the values from t − 5 up to time t presented at the network input. 8) Isolated Digits: This dataset6 is a subset of the TI46 dataset which contains 500 spoken isolated digits (0 − 9), where each digit is spoken 10 times by five female speakers. These 500 digits are randomly split into training (Ntrn = 250) and test (Nt st = 250) sets. Because of the limited amount of data, model selection was performed using 10-fold cross validation on the training set. The Lyon passive ear model [27] is used to convert the spoken digits into 86 frequency channels. Following the ESN literature using this dataset, the model performance will be evaluated using the word error rate (WER), which is the number of incorrectly classified words divided by the total number of presented words. The 10 output classifiers are trained to output 1 if the corresponding digit is uttered and −1 otherwise. Following [28], the temporal mean over complete sample of each spoken digit is calculated for the 10 output classifiers. The winner-take-all methodology is then applied to estimate the spoken digit’s identity. We use this dataset to demonstrate the modeling capabilities of different reservoir models on high-dimensional (86 input channels) time series.
B. Training We trained a classical ESN as well as SCR, DLR, and DLRB models (with linear and tanh reservoir nodes) on the time series described above with the NMSE to be minimized. 6 Obtained from http://snn.elis.ugent.be/rctoolbox.
The model fitting was done using ridge regression,7 where the regularization factor λ was tuned per reservoir and per dataset on the validation set. For each, we calculate the average NMSE8 over 10 simulation runs. Our experiments are organized along four degrees of freedom: 1) reservoir topology; 2) reservoir activation function; 3) input weight structure; and 4) reservoir size. C. Results For each dataset and each model class (ESN, DLR, DLRB, and SCR), we picked on the validation set a model representative to be evaluated on the test set. Ten randomizations of each model representative were then tested on the test set. For the DLR, DLRB, and SCR architectures, the model representatives are defined by the input weight value v and the reservoir weight r (for DLRB network we also need to specify the value b of the feedback connection). The randomization was performed solely by randomly generating the signs for individual input weights,9 the reservoir itself was intact. For the ESN architecture, the model representative is specified by input weight scaling, reservoir sparsity, and spectral radius of the weight matrix. For each model setting (e.g., for ESN, input weight scaling, reservoir sparsity, and spectral radius), we generate 10 randomized models and calculate their average validation set performance. The best performing model setting on the validation set is then used to generate another set of 10 randomized models that are fitted on the training set and subsequently tested on the test set. For some datasets, the performance of linear reservoirs was consistently inferior to that of nonlinear ones. Due to space limitations, in such cases the performance of linear reservoirs is not reported. Linear reservoirs are explicitly mentioned only when they achieve competitive (or even better) results than their nonlinear counterparts. Figs. 3, 4, and 5(a) show the average test set NMSE (across ten randomizations) achieved by the selected model representatives. Fig. 3 presents results for the four model classes using nonlinear reservoir on the Laser, Hénon Map, and Nonlinear Communication Channel datasets. On those time series, the test NMSE for linear reservoirs were of an order of magnitude worse than the NMSE achieved by the nonlinear ones. While the ESN architecture slightly outperforms the simplified reservoirs on the Laser and Hénon Map time series, for the Nonlinear Communication Channel the best performing architecture is the simple delay line network (DLR). The SCR reservoir is consistently the second-best performing architecture. Even though the differences between NMSE are in most cases statistically significant, from the 7 We also tried other forms of offline and online readout training, such as Wiener–Hopf methodology (e.g., [16]), pseudoinverse solution (e.g., [3]), singular value decomposition (e.g., [20]), and recursive least square (e.g., [21]). Ridge regression led to the best results. We are thankful to the anonymous referee for suggesting the inclusion of ridge regression in our repertoire of batch training methods. 8 WER in the case of Isolated Digits dataset. 9 Strictly speaking, we randomly generated the signs for input weights and input biases. However, as usual in the neural network literature, the bias terms can be represented as input weights from a constant input +1.
NMSE
ˇ RODAN AND TINO: MINIMUM COMPLEXITY ECHO STATE NETWORK
Laser
0.022 0.021 0.02 0.019 0.018 0.017 0.016 0.015 0.014 0.013 0.012 0.011 0.01 0.009 0.008
135
Henon map
0.012 ESN SCR DLR DLRB
× 10−3 Non−linear Communication Channel
0.0115 0.011 0.0105 0.01 0.0095 0.009 0.0085 0.008
50
100
150
50
200
100
150
4 3.8 ESN SCR 3.5 DLR 3.2 DLRB 2.9 2.6 2.3 2 1.7 1.4 1.1 0.8 0.5 200 50
ESN SCR DLR DLRB
100
Reservoir size
Reservoir size
150
200
Reservoir size
NMSE
Fig. 3. Test set performance of ESN, SCR, DLR, and DLRB topologies with tanh transfer function on the Laser, Hénon Map, and Nonlinear Communication Channel datasets. 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03
10th order NARMA ESN SCR DLR DLRB
50
100
150
200
0.14 0.13 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01
10th order random NARMA ESN SCR DLR DLRB
50
100
Reservoir size
150
20th order NARMA
0.3 0.29 0.28 0.27 0.26 0.25 0.24 0.23 0.22 0.21 0.2 0.19 0.18 0.17 0.16 0.15 0.14 200 50
ESN SCR DLR DLRB
100
Reservoir size
150
200
Reservoir size
Fig. 4. Test set performance of ESN, SCR, DLR, and DLRB topologies with tanh transfer function on 10th order, random 10th order, and 20th order NARMA datasets. Isolated Digits
0.11 0.1 0.09
WER
0.08
0.08 0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02 100
150
ESN SCR−PI DLR−PI DLRB−PI
0.1 0.09
0.07
0.01 50
Isolated Digits
0.11 ESN SCR DLR DLRB
200
0.01 50
Reservoir size
100
150
200
Reservoir size
(a)
(b)
Fig. 5. Test set performance of ESN, SCR, DLR, and DLRB topologies on the Isolated Digits (speech recognition) task using two ways of generating input connection sign patterns. (a) Using initial digits of π . (b) Random generation (i.i.d. Bernoulli distribution with mean 1/2). Reservoir nodes with tanh transfer function f were used. TABLE I M EAN NMSE FOR ESN, DLR, DLRB, AND SCR A CROSS 10 S IMULATION RUNS (S TANDARD D EVIATIONS IN PARENTHESIS ) AND SCR T OPOLOGIES WITH
D ETERMINISTIC I NPUT S IGN G ENERATION ON THE IPIX Radar AND Sunspot S ERIES . T HE R ESULTS A RE R EPORTED FOR P REDICTION H ORIZON ν M ODELS WITH N ONLINEAR R ESERVOIRS OF S IZE N = 80 (IPIX Radar) AND L INEAR R ESERVOIRS WITH N = 200 N ODES (Sunspot Series)
AND
Dataset
ν
ESN
DLR
DLRB
SCR
SCR-PI
SCR-EX
SCR-Log
IPIX Radar Sunspot
1 5 1
0.00115 (2.48E-05) 0.0301 (8.11E-04) 0.1042 (8.33E-5)
0.00112 (2.03E-05) 0.0293 (3.50E-04) 0.1039 (9.19E-05)
0.00110 (2.74E-05) 0.0296 (5.63E-04) 0.1040 (7.68E-05)
0.00109 (1.59E-05) 0.0291 (3.20E-04) 0.1039 (5.91E-05)
0.00109 0.0299 0.1063
0.00109 0.0299 0.1065
0.00108 0.0297 0.1059
practical point of view they are minute. Note that the Nonlinear Communication Channel can be modeled rather well with a simple Markovian DLR and no complex ESN reservoir structure is needed. Nonlinearity in the reservoir activation and the reservoir size seem to be two important factors for successful learning on those three datasets.
Fig. 4 presents results for the four model classes on the three NARMA time series, namely fixed NARMA of order 10, 20 and random NARMA of order 10. The performance of linear reservoirs does not improve with increasing reservoir size. Interestingly, within the studied reservoir range (50–200), linear reservoirs beat the nonlinear ones on 20th
136
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE II NMSE FOR ESN, DLR, DLRB, AND SCR A CROSS 10 S IMULATION RUNS (S TANDARD D EVIATIONS IN PARENTHESIS ) AND SCR T OPOLOGIES WITH D ETERMINISTIC I NPUT S IGN G ENERATION ON THE Nonlinear System with Observational Noise D ATASET. R ESERVOIRS H AD N = 100 I NTERNAL N ODES WITH tanh T RANSFER F UNCTION f var w
var v
ESN
DLR
DLRB
SCR
SCR-PI
SCR-EX
SCR-Log
1 10 1 10
1 1 10 10
0.4910 (0.0208) 0.7815 (0.00873) 0.7940 (0.0121) 0.9243 (0.00931)
0.4959 (0.0202) 0.7782 (0.00822) 0.7671 (0.00945) 0.9047 (0.00863)
0.4998 (0.0210) 0.7797 (0.00631) 0.7789 (0.00732) 0.9112 (0.00918)
0.4867 (0.0201) 0.7757 (0.00582) 0.7655 (0.00548) 0.9034 (0.00722)
0.5011 0.7910 0.7671 0.8986
0.5094 0.7902 0.7612 0.8969
0.5087 0.7940 0.7615 0.8965
order NARMA.10 For all NARMA series, the SCR network is either the best performing architecture or is not worse than the best performing architecture in a statistically significant manner. Note that NARMA time series constitute one of the most important and widely used benchmark datasets used in the ESN literature (e.g., [3], [4], [6], [10], [11], [19]–[21]). The results for the high-dimensional dataset Isolated Digits are presented in Fig. 5(a). Except for the reservoir size 50, the performances of all studied reservoir models are statistically the same (see Table IV in Appendix A). When compared to ESN, the simplified reservoir models seem to work equally well on this high-dimensional input series. For IPIX Radar, Sunspot Series, and Nonlinear System with Observational Noise, the results are presented in Tables I and II, respectively. On these datasets, the ESN performance did not always monotonically improve with the increasing reservoir size. That is why for each dataset we determined the best performing ESN reservoir size on the validation set (N = 80, N = 200, N = 100 for IPIX Radar, Sunspot series, and Nonlinear System with Observational Noise, respectively). The performance of the other model classes (DLR, DLRB, and SCR) with those reservoir sizes was then compared to that of ESN. In line with most RC studies using the Sunspot dataset (e.g., [29]), we found that linear reservoirs were on par11 with the nonlinear ones. For all three datasets, the SCR architecture performs slightly better than standard ESN, even though the differences are in most cases not statistically significant. Ganguli, Huh, and Sompolinsky [30] quantified and theoretically analyzed MC of nonautonomous linear dynamical systems (corrupted by a Gaussian state noise) using Fisher information between the state distributions at distant times. They found out that the optimal Fisher memory is achieved for so-called nonnormal networks with DLR or DLRB topologies and derived the optimal input weight vector for those linear reservoir architectures. We tried setting the input weights to the theoretically derived values, but the performance did not improve over our simple strategy of randomly picked signs of input weights followed by model selection on the validation set. Of course, the optimal input weight considerations of [30] hold for linear reservoir models only. Furthermore, according to [30], the linear SCR belongs to the class of so-called normal networks, which are shown to be inferior to the 10 The situation changes for larger reservoir sizes. For example, nonlinear ESN and SCR reservoirs of size 800 lead to the average NMSE of 0.0468 (std 0.0087) and 0.0926 (std 0.0039), respectively. 11 They were sometimes better (within the range of reservoir sizes considered in our experiments).
nonnormal ones. Interestingly enough, in our experiments the performance of linear SCR was not worse than that of nonnormal networks. D. Further Simplifications of Input Weight Structure The only random element of the SCR architecture is the distribution of the input weight signs. We found out that any attempt to impose a regular pattern on the input weight signs (e.g., a periodic structure of the form + − − + − − . . ., or + − −+−−. . ., etc.) led to performance deterioration. Interestingly enough, it appears to be sufficient to relate the sign pattern to a single deterministically generated aperiodic sequence. Any simple pseudo-random generation of signs with a fixed seed is fine. Such sign patterns worked universally well across all benchmark datasets used in this paper. For demonstration, we generated the universal input sign patterns in two ways. 1) The input signs are determined from decimal expansion d0 .d1 d2 d3 . . . of irrational numbers [in our case π (PI) and e (EX)]. The first N decimal digits d1 , d2 , . . . , d N are thresholded at 4.5: e.g., if 0 ≤ dn ≤ 4 and 5 ≤ dn ≤ 9, then the nth input connection sign (linking the input to the nth reservoir unit) will be −− and +, respectively. 2) (Log): The input signs are determined by the first N iterates in binary symbolic dynamics of the logistic map f (x) = 4x(1 − x) in a chaotic regime (initial condition was 0.33, generating partition for symbolic dynamics with cut-value at 1/2). The results shown in Figs. 6 (NARMA, Laser, Hénon Map, and Nonlinear Communication Channel datasets) and 5(b) (Isolated Digits) as well as Tables I and II (IPIX Radar, Sunspot, and Nonlinear System with Observational Noise) indicate that comparable performances of our SCR topology can be obtained without any stochasticity in the input weight generation by consistent use of the same sign generating algorithm across a variety of datasets. Detailed results are presented in Table V (Appendix A). We tried to use these simple deterministic input sign generation strategy for the other simplified reservoir models (DLR and DLRB). The results were consistent with our findings for the SCR. We also tried to simplify the input weight structure by connecting the input to a single reservoir unit only. However, this simplification either did not improve, or deteriorated the model performance. E. Sensitivity Analysis We tested sensitivity of the model performance on five-step ahead prediction with respect to variations in the (construction)
ˇ RODAN AND TINO: MINIMUM COMPLEXITY ECHO STATE NETWORK
0.25
20th order NARMA PI Rand Exp logistic
0.24 0.23 0.22
NMSE
137
0.21 0.2 0.19 0.18 0.17 0.16 0.15 50
100
150
200
Henon map
0.012
4 3.8
PI Rand Exp Logistic
0.0115 0.011
Laser
0.022 0.021 0.020 0.019 0.018 0.017 0.016 0.015 0.014 0.013 0.012 0.011 0.010 0.009 0.008 50
PI Rand Exp Logistic
100
× 10−3
150
200
Non−linear Communication Channel PI Rand Exp Logistic
3.5 3.2 2.9
NMSE
0.0105
2.6 2.3
0.01
2 0.0095
1.7 1.4
0.009
1.1 0.0085 0.008 50
0.8 100
150
200
0.5 50
100
Reservoir size
150
200
Reservoir size
Fig. 6. Test set performance of SCR topology using four different ways of generating pseudo-randomized sign patterns using initial digits of π and E x p; logistic map trajectory; and random generation (i.i.d. Bernoulli distribution with mean 1/2). The results are reported for 20th NARMA, Laser, Hénon Map, and Nonlinear Communication Channel datasets. Reservoir nodes with tanh transfer function f were used.
parameters.12 The reservoir size is N = 100 for 10th order NARMA dataset. In the case of ESN, we varied the input scaling as well as the spectral radius and connectivity of the reservoir matrix. In Fig. 7(a), we show how the performance depends on the spectral radius and connectivity of the reservoir matrix. The input scaling is kept fixed at the optimal value determined on the validation set. Performance variation with respect to changes in input scaling (while connectivity and spectral radius are kept fixed at their optimal values) are reported in Table III. For the SCR and DLR models, Fig. 7(c) and (d) illustrates the performance sensitivity with respect to changes in the only two free parameters— the input and reservoir weights v and r , respectively. In the case of DLRB model, Fig. 7(b) presents the performance sensitivity with respect to changes in the reservoir weights r and b, while keeping the input weight fixed to the optimal value.13 We performed the same analysis on Laser and IPIX Radar datasets and obtained similar stability patterns. In general, all the studied reservoir models show robustness with respect to small (construction) parameter fluctuations around the optimal parameter setting. 12 We are thankful to the anonymous reviewer for making the suggestion. 13 Note that Fig. 7(a) and (c) or (d) is not directly comparable since
the model parameters that get varied are different for each model (e.g., connectivity and spectral radius for ESN versus input and reservoir weights for SCR). In this sense, only Fig. 7(c) and (d) can be compared directly.
TABLE III B EST C ONNECTIVITY AND S PECTRAL R ADIUS FOR ESN WITH D IFFERENT I NPUT S CALING FOR 10th Order NARMA D ATASET Dataset
Inp.
Con.
Spec.
10th order NARMA
0.05 0.1 0.5 1
0.18 0.18 0.18 0.18
0.85 0.85 0.85 0.85
NMSE 0.1387 0.1075 0.2315 0.6072
(0.0101) (0.0093) (0.0239) (0.0459)
V. S HORT-T ERM MC OF SCR A RCHITECTURE In his report, Jaeger [4] quantified the inherent capacity of recurrent network architectures to represent past events through a measure correlating the past events in an i.i.d. input stream with the network output. In particular, assume that the network is driven by a univariate stationary input signal s(t). For a given delay k, we consider the network with optimal parameters for the task of outputting s(t − k) after seeing the input stream . . . s(t − 1)s(t) up to time t. The goodness of fit is measured in terms of the squared correlation coefficient between the desired output (input signal delayed by k time steps) and the observed network output y(t) MCk =
Cov 2 (s(t − k), y(t)) V ar (s(t)) V ar (y(t))
(13)
where Cov denotes the covariance and V ar the variance operators. The short-term memory (STM) capacity is then
138
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
(b) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
NMSE
NMSE
(a)
0.2
0.4
0.6
Spectral radius
0.8
1
0.15
0.2
0.05 0.1 Connectivity
0
0.2
0.4
r
0.6
0.8
1
0.2
0.15
0.1
0.05
0
b
(d) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1
NMSE
(c)
NMSE
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.8
0.6
v
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
0.7 0.6 0.5 0.4 0.3 0.2 0.1 1
0.8
0.6
v 0.4 0.2
r
0
1
0.8
0.6
0.4
0.2
0
r
Fig. 7. Sensitivity of (a) ESN, (b) DLRB, (c) DLR, and (d) SCR topologies on the 10th order NARMA dataset. The input sign patterns for SCR, DLR, and DLRB nonlinear reservoirs were generated using initial digits of π .
given by [4] MC =
∞
MCk .
(14)
k=1
Jaeger [4] proved that, for any recurrent neural network with N recurrent neurons, under the assumption of i.i.d. input stream, the STM capacity cannot exceed N. We prove (under the assumption of zero-mean i.i.d. input stream) that the STM capacity of linear SCR architecture with N reservoir units can be made arbitrarily close to N. Since there is a single input (univariate time series), the input matrix V is an N-dimensional vector V = (V1 , V2 , . . . , VN )T . Consider a vector rotation operator rot1 that cyclically rotates vectors by 1 place to the right: e.g., rot1 (V ) = (VN , V1 , V2 , . . . , VN−1 )T . For k ≥ 1, the k-fold application of rot1 is denoted by rotk . The N × N matrix with kth column equal to rotk (V ) is denoted by , e.g.,
= (rot1 (V ), rot2 (V ), . . . , rot N (V )). Theorem 1: Consider a linear SCR network with reservoir weight 0 < r < 1 and an input weight vector V such that the matrix is regular. Then the SCR network MC is equal to MC = N − (1 − r 2N ). The proof can be found in Appendixes B and C. We empirically evaluated the short-term MC of ESN and our three simplified topologies. The networks were trained to memorize the inputs delayed by k = 1, 2, . . . , 40. We used 1 input node, 20 linear reservoir nodes, and 40 output nodes (one for each k). The input consisted of random values sampled from a uniform distribution in the range [−0.5, 0.5]. The input weights for ESN and our simplified topologies have the same absolute value 0.5 with randomly selected signs. The elements of the recurrent weight matrix are set to 0 (80% of weights), 0.47 (10% of weights), or −0.47 (10% of weights), with 0.2 reservoir weights connection fraction and spectral radius λ = 0.9 [16]. DLR and SCR weight r was fixed and set to the
value r = 0.5. For DLRB, r = 0.5 and b = 0.05. The output weights were computed using pseudo-inverse solution. The empirically determined MC values for ESN, DLR, DLRB, and SCR models were (averaged over 10 simulation runs, std. dev. in parenthesis) 18.25 (1.46), 19.44 (0.89), 18.42 (0.96), and 19.48 (1.29), respectively. Note that the empirical MC values for linear SCR are in good agreement with the theoretical value of 20 − (1 − 0.540 ) ≈ 19. VI. D ISCUSSION A large number of models designed for time-series processing, forecasting, or modeling follows a state-space formulation. At each time step t, all relevant information in the driving stream processed by the model up to time t is represented in the form of a state (at time t). The model output depends on the past values of the driving series and is implemented as a function of the state—the so-called readout function. The state space can take many different forms, e.g., a finite set, a countably infinite set, an interval, etc. A crucial aspect of state-space model formulations is an imposition that the state at time t + 1 can be determined in a recursive manner from the state at time t and the current element in the driving series (state transition function). Depending on the application domain, numerous variations on the state space structure, as well as the state-transition/readout function formulations, have been proposed. One direction of research into a data-driven state-space model construction imposes a state-space structure (e.g., an N-dimensional interval) and a semiparametric formulation of both the state-transition and readout functions. The parameterfitting is then driven by a cost functional E measuring the appropriateness of alternative parameter settings for the given task. Recurrent neural networks are examples of this type of approach [22]. If E is differentiable, one can employ the gradient of E in the parameter-fitting process. However, there
ˇ RODAN AND TINO: MINIMUM COMPLEXITY ECHO STATE NETWORK
is a well-known problem associated with parameter fitting in the state-transition function [31]: briefly, in order to “latch” an important piece of past information for the future use, the state-transition dynamics should have an attractive set. In the neighborhood of such a set, the derivatives vanish and hence cannot be propagated through time in order to reliably bifurcate into a useful latching set. A class of approaches referred to as reservoir computing tries to avoid this problem by fixing the state-transition function—only the readout is fitted to the data [2], [32]. The state-space with the associated state-transition structure is called the reservoir. The reservoir is supposed to be sufficiently complex so as to capture a large number of features of the input stream that can potentially be exploited by the readout. The RC models differ in how the fixed reservoir is constructed and what form the readout takes. For example, ESNs [3] typically have a linear readout and a reservoir formed by a fixed recurrent neural network type dynamics. Liquid state machines (LSMs) [33] also mostly have a linear readout and the reservoirs are driven by the dynamics of a set of coupled spiking neuron models. Fractal prediction machines (FPM) [34] have been suggested for processing symbolic sequences. Their reservoir dynamics is driven by fixed affine state transitions over an N-dimensional interval. The readout is constructed as a collection of multinomial distributions over next symbols. Many other (sometimes quite exotic) reservoir formulations have been suggested (e.g., [11], [35]–[37]). The field of RC has been growing rapidly with dedicated special sessions at conferences and special issues of journals [38]. RC has been successfully applied in many practical applications [3]–[6], [9], [39]. However, it is sometimes criticized for not being principled enough [17]. There have been several attempts to address the question of what exactly is a “good” reservoir for a given application [16], [40], but no coherent theory has yet emerged. The largely black-box character of reservoirs prevents us from performing a deeper theoretical investigation of the dynamical properties of successful reservoirs. Reservoir construction is often driven by a series of (more or less) randomized model building stages, with both the researchers and practitioners having to rely on a series of trials and errors. Sometimes, reservoirs have been evolved in a costly and difficult-to-analyze evolutionary computation setting [8], [14], [41], [42]. In an attempt to initialize a systematic study of the field, we have concentrated on three research questions. 1) What is the minimal complexity of the reservoir topology and parameterization so that performance levels comparable to those of standard RC models, such as ESN, can be recovered? 2) What degree of randomness (if any) is needed to construct competitive reservoirs? 3) If simple competitive reservoirs constructed in a completely deterministic manner exist, how do they compare in terms of MC with established models such as recurrent neural networks? On a number of widely used time-series benchmarks of different origin and characteristics, as well as by conducting a theoretical analysis we have shown the following. 1) A very simple cycle topology of reservoir is often sufficient for
139
obtaining performances comparable to those of ESN. Except for the NARMA datasets, nonlinear reservoirs were needed. 2) Competitive reservoirs can be constructed in a completely deterministic manner. The reservoir connections all have the same weight value. The input connections have the same absolute value with sign distribution following one of the universal deterministic aperiodic patterns. 3) The memory capacity of linear cyclic reservoirs with a single reservoir weight value r can be made to differ arbitrarily close from the proved optimal value of N, where N is the reservoir size. In particular, given an arbitrarily small ǫ ∈ (0, 1), for 1
r = (1 − ǫ) 2N the MC of the cyclic reservoir is N − ǫ. Even though the theoretical analysis of the SCR has been done for the linear reservoir case, the requirement that all cyclic rotations of the input vector need to be linearly independent seems to apply to the nonlinear case as well. Indeed, under the restriction that all input connections have the same absolute weight value, the linear independence condition translates to the requirement that the input sign vector follows an aperiodic pattern. Of course, from this point of view, a simple standard basis pattern (+1, −1, −1, . . . , −1) is sufficient. Interestingly enough, we found out that the best performance levels were obtained when the input sign pattern contained roughly equal number of positive and negative signs. At the moment, we have no satisfactory explanation for this phenomenon and we leave it as an open question for future research. Jaeger [4] argues that if the vectors W i V , i = 1, 2, . . . , N, are linearly independent, then the MC of linear reservoir with N units is N. Note that for the SCR reservoir rotk (V ) =
WkV , rk
k = 1, 2, . . . , N
and so the condition that W i V , i = 1, 2, . . . , N, are linearly independent directly translates into the requirement that the matrix is regular. As r → 1, the MC of SCR indeed approaches the optimal MC N. According to Theorem 1, the MC measure depends on the spectral radius of W (in our case, r ). Interestingly enough, in the verification experiments of [4] with a reservoir of size N = 20 and reservoir matrix of spectral radius 0.98, the empirically obtained MC value was 19.2. Jaeger commented that a conclusive analysis of the disproportion between the theoretical and empirical values of MC was not possible, however, he suggested that the disproportion may be due to numerical errors, as the condition number of the reservoir weight matrix W was about 50. Using our result, MC = N − (1 − r 2N ) with N = 20 and r = 0.98 yields MC = 19.4. It is certainly true that, for smaller spectral radius values, the empirically estimated MC values of linear reservoirs decrease, as verified in several studies (e.g., [19]), and this may indeed be at least partially due to numerical problems in calculating higher powers of W . Moreover, empirical estimates of MC tend to fluctuate rather strongly, depending on the actual i.i.d. driving stream used in the estimation (see [16]). Even though Theorem 1 suggests that the spectral radius of W should have an influence on the MC
140
value for linear reservoirs, its influence becomes negligible for large reservoirs since (provided is regular) the MC of SCR is provably bounded within the interval (N − 1, N). MC of a reservoir is a representative member from the class of reservoir measures that quantify the amount of information that can be preserved in the reservoir about the past. For example, Ganguli, Huh, and Sompolinsky [30] proposed a different (but related) quantification of memory capacity for linear reservoirs (corrupted by a Gaussian state noise). They evaluated the Fisher information between the reservoir activation distributions at distant times. Their analysis shows that the optimal Fisher memory is achieved for the reservoir topologies corresponding, e.g., to our DLR or DLRB reservoir organizations. Based on the Fisher memory theory, the optimal input weight vector for those linear reservoir architectures was derived. Interestingly enough, when we tried setting the input weights to the theoretically derived values, the performance in our experiments did not improve over our simple strategy for obtaining the input weights. While in the setting of [30] the memory measure does not depend on the distribution of the source generating the input stream, the MC measure of [4] is heavily dependent on the generating source. For the case of an i.i.d. source (where no dependences between the time-series elements can be exploited by the reservoir), the MC = N − 1 can be achieved by a very simple model: DLR reservoir with unit weight r = 1, one input connection with weight 1 connecting the input with the 1st reservoir unit, and for k = 1, 2, . . . , N − 1 one output connection of weight 1 connecting the (k + 1)-th reservoir unit with the output. The linear SCR, on the other hand, can get arbitrarily close to the theoretical limit MC = N. In cases of non-i.i.d. sources, the temporal dependences in the input stream can increase the MC beyond the reservoir size N [4]. The simple nature of our SCR reservoir can enable a systematic study of the MC measure for different kinds of input stream sources and this is a matter for our future research. Compared to traditional ESN, recent extensions and reformulations of reservoir models often achieved improved performances [11], [12], [36] at the price of even less transparent models and less interpretable dynamical organization. We stress that the main purpose of this paper is not a construction of yet another reservoir model achieving an (incremental or more substantial) improvement over the competitors on the benchmark datasets. Instead, we would like to propose as simplified a reservoir construction as possible, without any stochastic component, which, while competitive with standard ESN, yields transparent models and more amenable to theoretical analysis than the reservoir models proposed in the literature so far. Such reservoir models can potentially help us to answer the question: just what is it in the organization of the nonautonomous reservoir dynamics that leads to often impressive performances of reservoir computation. Our simple deterministic SCR model can be used as a useful baseline in future reservoir computation studies. It is the level of improvement over the SCR baseline that has a potential to truly unveil the performance gains achieved by the more (and sometimes much more) complex model constructions.
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
VII. C ONCLUSION RC learning machines are state-space models with fixed state-transition structure (the “reservoir”) and an adaptable readout form the state-space. The reservoir is supposed to be sufficiently complex so as to capture a large number of features of the input stream that can be exploited by the reservoirto-output readout mapping. Even though the field of RC has been growing rapidly with many successful applications, both researchers and practitioners have to rely on a series of trials and errors. To initialize a systematic study of the field, we have concentrated on three research issues. 1) What is the minimal complexity of the reservoir topology and parametrization so that performance levels comparable to those of standard RC models, such as ESN, can be recovered? 2) What degree of randomness (if any) is needed to construct competitive reservoirs? 3) If simple competitive reservoirs constructed in a completely deterministic manner exist, how do they compare in terms of MC with established models such as recurrent neural networks? On a number of widely used time-series benchmarks of different origins and characteristics, as well as by conducting a theoretical analysis, we have shown the following. 1) A simple cycle reservoir topology is often sufficient for obtaining performances comparable to those of ESN. 2) Competitive reservoirs can be constructed in a completely deterministic manner. 3) The MC of simple linear cyclic reservoirs can be made to be arbitrarily close to the proved optimal MC value. A PPENDIX A D ETAILED R ESULTS Detailed results including standard deviations across repeated experiments (as described in Section IV) are shown in Tables IV and V. A PPENDIX B N OTATION AND AUXILIARY R ESULTS We consider an ESN with linear reservoir endowed with cycle topology (SCR). The reservoir weight is denoted by r . Since we consider a single input, the input matrix V is an N-dimensional vector V1...N = (V1 , V2 , . . . , VN )T . By VN...1 we denote the “reverse” of V1...N , e.g., VN...1 = (VN , VN−1 , . . . , V2 , V1 )T . Consider a vector rotation operator rot1 that cyclically rotates vectors by 1 place to the right, e.g., given a vector a = (a1 , a2 , . . . , an )T , rot1 (a) = (an , a1 , a2 , . . . , an−1 )T . For k ≥ 0, the k-fold application of rot1 is denoted by14 rotk . The N × N matrix with kth column equal to rotk (VN...1 ) is denoted by
= (rot1 (VN...1 ), rot2 (VN...1 ), . . . , rot N (VN...1 )). 14 rot
0 is the identity mapping.
ˇ RODAN AND TINO: MINIMUM COMPLEXITY ECHO STATE NETWORK
141
TABLE IV T EST S ET P ERFORMANCE OF ESN, SCR, DLR, AND DLRB T OPOLOGIES ON D IFFERENT D ATASETS FOR I NTERNAL N ODES WITH tanh T RANSFER F UNCTION Dataset
Reservoir size
ESN
DLR
DLRB
SCR
10th order NARMA
50 100 150 200
0.166 (0.0171) 0.0956 (0.0159) 0.0514 (0.00818) 0.0425 (0.0166)
0.163 (0.0138) 0.112(0.0116) 0.0618 (0.00771) 0.0476 (0.0104)
0.158 (0.0152) 0.105 (0.0131) 0.0609 (0.00787) 0.0402 (0.0110)
0.160 (0.0134) 0.0983 (0.0156) 0.0544 (0.00793) 0.0411 (0.0148)
10th order random NARMA
50 100 150 200
0.131 (0.0165) 0.0645 (0.0107) 0.0260 (0.0105) 0.0128 (0.00518)
0.133 0.0822 0.0423 0.0203
(0.0132) (0.00536) (0.00872) (0.00536)
0.130 (0.00743) 0.0837 (0.00881) 0.0432 (0.00933) 0.0201 (0.00334)
0.129 0.0719 0.0286 0.0164
(0.0111) (0.00501) (0.00752) (0.00412)
20th order NARMA
50 100 150 200
0.232 0.184 0.171 0.165
(0.0577) (0.0283) (0.0152) (0.0158)
0.221 0.174 0.163 0.158
(0.0456) (0.0407) (0.0127) (0.0121)
laser
Hénon Map
Nonlinear communication channel Isolated Digits
0.297 0.235 0.178 0.167
(0.0563) (0.0416) (0.0169) (0.0164)
0.238 0.183 0.175 0.160
(0.0507) (0.0196) (0.0137) (0.0153)
50 100 150 200
0.0184 (0.00231) 0.0125 (0.00117) 0.00945 (0.00101) 0.00819 (5.237E-04)
0.0210 (0.00229) 0.0132 (0.00116) 0.0107 (0.00114) 0.00921 (9.122E-04)
0.0215 (0.00428) 0.0139 (0.00121) 0.0112 (0.00100) 0.00913 (9.367E-04)
0.0196 (0.00219) 0.0131 (0.00105) 0.0101 (0.00109) 0.00902 (6.153E-04))
50 100 150 200
0.00975 (0.000110) 0.00894 (0.000122) 0.00871 (4.988E-05) 0.00868 (8.704E-05)
0.0116 (0.000214) 0.00982 (0.000143) 0.00929 (6.260E-05) 0.00908 (9.115E-05)
0.0110 (0.000341) 0.00951 (0.000120) 0.00893 (6.191E-05) 0.00881 (9.151E-05)
0.0106 (0.000185) 0.00960 (0.000124) 0.00921 (5.101E-05) 0.00904 (9.250E-05)
50 100 150 200
0.0038 0.0021 0.0015 0.0013
(4.06E-4) (4.42E-4) (4.01E-4) (1.71E-4)
50 100 150 200
0.0732 0.0296 0.0182 0.0138
(0.0193) (0.0063) (0.0062) (0.0042)
0.0034 (2.27E-4) 0.0015 (1.09E-4) 0.0011 (1.12E-4) 0.00099 (6.42E-5) 0.0928 0.0318 0.0216 0.0124
(0.0177) (0.0037) (0.0052) (0.0042)
0.0036 0.0016 0.0011 0.0010
(2.26E-4) (1.07E-4) (1.08E-4) (7.41E-5)
0.0035 0.0015 0.0012 0.0010
(2.55E-4) (1.23E-4) (1.23E-4) (7.28E-5)
0.1021 0.0338 0.0236 0.0152
(0.0204) (0.0085) (0.0050) (0.0038)
0.0937 0.0327 0.0192 0.0148
(0.0175) (0.0058) (0.0037) (0.0050)
TABLE V T EST S ET P ERFORMANCE OF SCR T OPOLOGY ON D IFFERENT D ATASETS U SING T HREE D IFFERENT WAYS OF G ENERATING P SEUDO -R ANDOMIZED I NPUT S IGN PATTERNS : I NITIAL D IGITS OF π AND E x p; S YMBOLIC DYNAMICS OF L OGISTIC M AP Dataset
Reservoir size
20th order NARMA
50 100 150 200
laser
Hénon Map
Nonlinear communication channel
ESN 0.297 0.235 0.178 0.167
(0.0563) (0.0416) (0.0169) (0.0164)
SCR-PI
SCR-Ex
SCR-Log
0.233 (0.0153) 0.186 (0.0166) 0.175 (0.00855) 0.166 (0.00792)
0.232 (0.0175) 0.175 (0.0136) 0.158 (0.0103) 0.157 (0.00695)
0.196 (0.0138) 0.169 (0.0172) 0.156 (0.00892) 0.155 (0.00837)
50 100 150 200
0.0184 (0.00231) 0.0125 (0.00117) 0.00945 (0.00101) 0.00819 (5.237E-04)
0.0204 0.0137 0.0115 0.00962
0.0187 0.0153 0.0111 0.00988
0.0181 0.0140 0.0126 0.0107
50 100 150 200
0.00975 (0.000110) 0.00894 (0.000122) 0.00871 (4.988E-05) 0.00868 (8.704E-05)
0.00986 0.00956 0.00917 0.00892
0.00992 0.00985 0.00915 0.00883
0.00998 0.00961 0.00920 0.00898
0.0036 (1.82E-04) 0.0016 (7.96E-05) 0.0012 (7.12E-05) 0.00088 (2.55E-05)
0.0026 (6.23E-05) 0.0017 (1.04E-04) 0.0011 (6.10E-05) 0.00090 (3.05E-05)
0.0033 (1.09E-04) 0.0015 (8.85E-5) 0.0012 (4.56E-05) 0.00093 (3.33E-05)
50 100 150 200
0.0038 0.0021 0.0015 0.0013
(4.06E-4) (4.42E-4) (4.01E-4) (1.71E-4)
We will need a diagonal matrix with diagonal elements 1, r, r 2 , . . . , r N−1 Ŵ = diag 1, r, r 2 , . . . , r N−1 . Furthermore, we will denote the matrix T Ŵ 2 by A A = T Ŵ 2
and (provided A is invertible) (rotk (V1...N ))T A−1 rotk (V1...N ), k ≥ 0, T −1 = (rotk(mod)N (V1...N )) A rotk(mod)N (V1...N ) by ζk . Lemma 1: If is a regular matrix, then ζ N = 1 and ζk = r −2k , k = 1, 2, . . . , N − 1.
142
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Proof: Denote the standard basis vector (1, 0, 0, . . . , 0)T in ℜ N by e1 . The following holds: rotk (V1..N ) = T rotk (e1 ),
k = 1, 2, . . . , N − 1.
This can be easily shown, as T rotk (e1 ) selects the (k+1)st column of T ((k + 1)th row of ), which is formed by (k + 1)st elements of vectors rot1 (VN...1 ), rot2 (VN...1 ), . . . , rot N (VN...1 ). This vector is equal to the k-th rotation of V1...N . It follows that for k = 1, 2, . . . , N − 1
The covariance matrix R can be obtained in an analytical form. For example, because of the zero-mean and i.i.d. nature of the source P, the element R1,2 can be evaluated as follows: R1,2 = E P(s(..t ))[x(t)x T (t)] = E[V1 V2 s 2 (t) + r 2 VN V1 s 2 (t − 1) + r 4 VN−1 VN s 2 (t − 2) + · · · + r 2(N−1) V2 V3 s 2 (t − (N − 1)) + r 2N V1 V2 s 2 (t − N) + r 2(N+1) VN V1 s 2 (t − (N +1))
(rotk (V1...N ))T −1 = (rotk (e1 ))T
+ · · · + r 2(2N−1) V2 V3 s 2 (t − (2N − 1)) + r 4N V1 V2 s 2 (t − 2N) + · · · ]
and so
= V1 V2 V ar [s(t)] + r 2 VN V1 V ar [s(t − 1)] + r 4 VN−1 VN V ar [s(t − 2)] + · · ·
ζk = (rotk (V1..N ))T A−1 rotk (V1...N ) = (rotk (V1..N ))T −1 Ŵ −2 ( −1 )T rotk (V1...N )
· · · + r 2N V1 V2 V ar [s(t − N)] + · · · = σ 2 (V1 V2 + r 2 VN V1 + r 4 VN−1 VN + · · ·
= (rotk (e1 ))T Ŵ −2 rotk (e1 ) = r −2k .
· · · + r 2(N−1) V2 V3 + r 2N V1 V2 + · · · ) = σ 2 (V1 V2 + r 2 VN V1 + r 4 VN−1 VN + · · · ∞ r 2N j · · · + r 2(N−1) V2 V3 )
A PPENDIX C P ROOF OF T HEOREM 1 Given an i.i.d. zero-mean real-valued input stream s(. . . t) = . . . s(t − 2)s(t − 1)s(t) emitted by a source P, the activations of the reservoir units at time t are given by x 1 (t) = V1 s(t) + r VN s(t − 1) + r 2 VN−1 s(t − 2) + r 3 VN−2 s(t − 3) + · · · + r N−1 V2 s(t − (N − 1)) + r N V1 s(t − N) + r N+1 VN s(t − (N + 1)) + · · · + r 2N−1 V2 s(t − (2N − 1)) + r 2N V1 s(t − 2N) + r 2N+1 VN s(t − (2N + 1)) + · · · x 2 (t) = V2 s(t) + r V1 s(t − 1) + r 2 VN s(t − 2) + r 3 VN−1 s(t − 3) + · · · + r N−1 V3 s(t − (N − 1)) + r N V2 s(t − N) + r N+1 V1 s(t − (N + 1)) + · · · + r 2N−1 V3 s(t − (2N − 1)) + r 2N V2 s(t − 2N) + r 2N+1 V1 s(t −(2N + 1)) + r 2N+2 VN s(t −(2N + 2)) + ··· x N (t) = VN s(t) + r VN−1 s(t − 1) + r 2 VN−2 s(t − 2) + · · ·
+r V1 s(t − (2N − 1)) + r 2N+1 +r VN−1 s(t − (2N + 1))
where σ 2 is the variance of P. The expression (16) for R1,2 can be written in a compact form as R1,2 =
σ2 (rot1 (VN...1 ))T Ŵ 2 rot2 (VN...1 ). 1 − r 2N
2N
Ri, j =
σ2 (roti (VN...1 ))T Ŵ 2 rot j (VN...1 ), 1 − r 2N i, j = 1, 2, . . . , N
and σ2
T Ŵ 2
1 − r 2N σ2 = A. 1 − r 2N By analogous arguments R=
U = (1 − r 2N ) r k A−1 rotk (V1...N ).
For the task of recalling the input from k time steps back, the optimal least-squares readout vector U is given by
R = E P(s(..t ))[x(t)x T (t)] is the covariance matrix of reservoir activations and pk = E P(s(..t ))[x(t)s(t − k)].
(19)
(20)
(21)
The ESN output at time t is y(t) = x(t)T U
+ r 2N+2 VN−2 s(t − (2N + 2)) + · · ·
where
(18)
Hence, the optimal readout vector reads [see (15)]
VN s(t − 2N)
U = R −1 pk
(17)
In general
pk = r k σ 2 rotk (V1...N ).
+ r N−1 V1 s(t − (N − 1)) + r N VN s(t − N) + r N+1 VN−1 s(t − (N + 1)) + · · · 2N−1
(16)
j =0
(15)
= (1 − r 2N ) r k x(t)T A−1 rotk (V1...N ). Covariance of the ESN output with the target can be evaluated as Cov(y(t), s(t − k)) = (1 − r 2N ) r k Cov(x(t)T , s(t − k)) × A−1 rotk (V1...N ) = r 2k (1 − r 2N ) σ 2 (rotk (V1...N ))T × A−1 rotk (V1...N ) = r 2k (1 − r 2N ) σ 2 ζk .
ˇ RODAN AND TINO: MINIMUM COMPLEXITY ECHO STATE NETWORK
143
R EFERENCES
Variance of the ESN output is determined as V ar (y(t)) = U T E[x(t) x(t)T ] U = UT R U = pkT R −1 pk = r 2k (σ 2 )2 (rotk (V1...N ))T R −1 rotk (V1...N ) = Cov(y(t), s(t − k)). We can now calculate the squared correlation coefficient between the desired output (input signal delayed by k time steps) and the network output y(n) MCk = =
Cov 2 (s(t − k), y(t)) V ar (s(t)) V ar (y(t)) V ar (y(t)) σ2
= r 2k (1 − r 2N ) ζk . The MC of the ESN is given by MC = MC≥0 − MC0 where M C≥0 =
∞
MCk
k=0
= (1 − r
= (1 − r
2N
2N
)
)
N−1
k=0 N−1
2k
r ζk + 2k
r ζk
=
k=N ∞
r
k=0
k=0
N−1
2N−1
2k
r ζk +
3N−1
2k
r ζk + · · ·
k=2N 2k
r 2k ζk .
k=0
Hence MC =
N−1
r ζk − (1 − r 2N )ζ0 2k
k=0
= ζ0 [1 − (1 − r 2N )] +
N−1
r 2k ζk
k=1
= ζ0r 2N +
N−1
r 2k ζk
k=1
= ζ N r 2N +
N−1
r 2k ζk
k=1
=
N
r 2k ζk .
k=1
By Lemma 1, r 2k ζk = 1 for k = 1, 2, . . . , N − 1, and 2N . It follows that MC = N − 1 + r 2N . N =r
r 2N ζ
[1] M. Lukosevicius and H. Jaeger, “Overview of reservoir recipes,” School Eng. Sci., Jacobs Univ., Bremen, Germany, Tech. Rep. 11, 2007. [2] M. Lukosevicius and H. Jaeger, “Reservoir computing approaches to recurrent neural network training,” Comput. Sci. Rev., vol. 3, no. 3, pp. 127–149, 2009. [3] H. Jaeger, “The ‘echo state’ approach to analysing and training recurrent neural networks,” German Nat. Res. Center Inform. Technol., St. Augustin, Germany, Tech. Rep. GMD 148, 2001. [4] H. Jaeger, “Short term memory in echo state networks,” German Nat. Res. Center Inform. Technol., St. Augustin, Germany, Tech. Rep. GMD 152, 2002. [5] H. Jaeger, “A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ‘echo state network’ approach,” German Nat. Res. Center Inform. Technol., St. Augustin, Germany, Tech. Rep. GMD 159, 2002. [6] H. Jaeger and H. Hass, “Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless telecommunication,” Science, vol. 304, no. 5667, pp. 78–80, Apr. 2004. [7] M. Skowronski and J. Harris, “Minimum mean squared error time series classification using an echo state network prediction model,” in Proc. IEEE Int. Symp. Circuits Syst., Island of Kos, Greece, May 2006, pp. 3153–3156. [8] K. Bush and C. Anderson, “Modeling reward functions for incomplete state representations via echo state networks,” in Proc. Int. Joint Conf. Neural Netw., Montreal, QC, Canada, Jul. 2005, pp. 2995–3000. [9] M. H. Tong, A. Bicket, E. Christiansen, and G. Cottrell, “Learning grammatical structure with echo state network,” Neural Netw., vol. 20, no. 3, pp. 424–432, Apr. 2007. [10] B. Schrauwen, M. Wardermann, D. Verstraeten, J. Steil, and D. Stroobandt, “Improving reservoirs using intrinsic plasticity,” Neurocomputing, vol. 71, nos. 7–9, pp. 1159–1171, Mar. 2008. [11] J. Steil, “Online reservoir adaptation by intrinsic plasticity for backpropagation-decorrelation and echo state learning,” Neural Netw., vol. 20, no. 3, pp. 353–364, Apr. 2007. [12] Y. Xue, L. Yang, and S. Haykin, “Decoupled echo state networks with lateral inhibition,” Neural Netw., vol. 20, no. 3, pp. 365–376, Apr. 2007. [13] H. Jaeger, M. Lukosevicius, D. Popovici, and U. Siewert, “Optimization and applications of echo state networks with leaky-integrator neurons,” Neural Netw., vol. 20, no. 3, pp. 335–352, Apr. 2007. [14] J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez, “Training recurrent networks by evolino,” Neural Comput., vol. 19, no. 3, pp. 757–779, Mar. 2007. [15] G. Holzmann and H. Hauser, “Echo state networks with filter neurons and a delay&sum readout,” Neural Netw., vol. 23, no. 2, pp. 244–256, Mar. 2010. [16] M. C. Ozturk, D. Xu, and J. Principe, “Analysis and design of echo state network,” Neural Comput., vol. 19, no. 1, pp. 111–138, Jan. 2007. [17] D. Prokhorov, “Echo state networks: Appeal and challenges,” in Proc. Int. Joint Conf. Neural Netw., Montreal, QC, Canada, 2005, pp. 1463– 1466. [18] F. Wyffels, B. Schrauwen, and D. Stroobandt, “Stable output feedback in reservoir computing using ridge regression,” in Proc. 18th Int. Conf. Artif. Neural Netw., Prague, Czech Republic, 2008, pp. 808–817. [19] D. Verstraeten, B. Schrauwen, M. D’Haene, and D. Stroobandt, “An experimental unification of reservoir computing methods,” Neural Netw., vol. 20, no. 3, pp. 391–403, Apr. 2007. [20] M. Cernansky and P. Tino, “Predictive modelling with echo state networks,” in Proc. 18th Int. Conf. Artif. Neural Netw., Prague, Czech Republic, 2008, pp. 778–787. [21] H. Jaeger, “Adaptive nonlinear systems identification with echo state network,” in Advances in Neural Information Processing Systems, vol. 15. Cambridge, MA: MIT Press, 2003, pp. 593–600. [22] A. F. Atiya and A. G. Parlos, “New results on recurrent network training: Unifying the algorithms and accelerating convergence,” IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 697–709, May 2000. [23] M. Henon, “A 2-D mapping with a strange attractor,” Commun. Math. Phys., vol. 50, no. 1, pp. 69–77, 1976. [24] M. Slutzky, P. Cvitanovic, and D. Mogul, “Manipulating epileptiform bursting in the rat hippocampus using chaos control and adaptive techniques,” IEEE Trans. Bio-Med. Eng., vol. 50, no. 5, pp. 559–570, May 2003. [25] National Geophysical Data Center. (2007). Sunspot Numbers [Online]. Available: http://www.ngdc.noaa.gov/stp/iono/sunspot.html
144
[26] N. J. Gordon, D. J. Salmond, and A. F. M. Smith, “Novel approach to nonlinear/non-Gaussian Bayesian state estimation,” IEE Proc. Radar Signal Process., vol. 140, no. 2, pp. 107–113, Apr. 1993. [27] R. Lyon, “A computational model of filtering, detection and compression in the cochlea,” in Proc. IEEE ICASSP, Paris, France, May 1982, pp. 1282–1285. [28] B. Schrauwen, J. Defour, D. Verstraeten, and J. Van Campenhout, “The introduction of time-scales in reservoir computing, applied to isolated digits recognition,” in Proc. 17th Int. Conf. Artif. Neural Netw., vol. 4668. Porto, Portugal, 2007, pp. 471–479. [29] F. Schwenker and A. Labib, “Echo state networks and neural network ensembles to predict sunspots activity,” in Proc. Eur. Symp. Artif. Neural Netw.-Adv. Comput. Intell. Learn., Bruges, Belgium, Apr. 2009, pp. 22–24. [30] S. Ganguli, D. Huh, and H. Sompolinsky, “Memory traces in dynamical systems,” Proc. Nat. Acad. Sci., vol. 105, no. 48, pp. 18970–18975, Dec. 2008. [31] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994. [32] B. Schrauwen, D. Verstraeten, and J. Campenhout, “An overview of reservoir computing: Theory, applications and implementations,” in Proc. Eur. Symp. Artif. Neural Netw., Bruges, Belgium, 2007, pp. 471–482. [33] W. Maass, T. Natschlager, and H. Markram, “Real-time computing without stable states: A new framework for neural computation based on perturbations,” Neural Comput., vol. 14, no. 11, pp. 2531–2560, Nov. 2002. [34] P. Tino and G. Dorffner, “Predicting the future of discrete sequences from fractal representations of the past,” Mach. Learn., vol. 45, no. 2, pp. 187–217, Nov. 2001. [35] B. Jones, D. Stekel, J. Rowe, and C. Fernando, “Is there a liquid state machine in the bacterium Escherichia coli?” in Proc. IEEE Symp. Artif. Life, Alife, Honolulu, HI, Apr. 2007, pp. 187–191. [36] Z. Deng and Y. Zhang, “Collective behavior of a small-world recurrent neural system with scale-free distribution,” IEEE Trans. Neural Netw., vol. 18, no. 5, pp. 1364–1375, Sep. 2007. [37] K. Dockendorf, I. Park, H. Ping, J. Principe, and T. DeMarse, “Liquid state machines and cultured cortical networks: The separation property,” Biosystems, vol. 95, no. 2, pp. 90–97, Feb. 2009. [38] H. Jaeger, W. Maass, and J. Principe, “Introduction to the special issue on echo state networks and liquid state machines,” Neural Netw., vol. 20, no. 3, pp. 287–289, 2007. [39] W. Mass, T. Natschlager, and H. Markram, “Fading memory and kernel properties of generic cortical microcircuit models,” J. Physiol., vol. 98, nos. 4–6, pp. 315–330, Jul.–Nov. 2004. [40] S. Hausler, M. Markram, and W. Maass, “Perspectives of the highdimensional dynamics of neural microcircuits from the point of view of low-dimensional readouts,” Complexity (Special Issue Complex Adapt. Syst.), vol. 8, no. 4, pp. 39–50, 2003.
IEEE TRANSACTION ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
[41] K. Ishii, T. van der Zant, V. Becanovic, and P. Ploger, “Identification of motion with echo state network,” in Proc. OCEANS MTTS/IEEE TECHNO-OCEAN, vol. 3. Piscataway, NJ, Nov. 2004, pp. 1205–1210. [42] A. A. Rad, M. Jalili, and M. Hasler, “Reservoir optimization in recurrent neural networks using Kronecker kernels,” in Proc. IEEE ISCAS, Seattle, WA, May 2008, pp. 868–871.
Ali Rodan (S’08) received the B.Sc. and M.Sc. degrees in computer science from Princess Sumaya University for Technology, Amman, Jordan, in 2004, and Oxford Brookes University, Oxford, U.K, in 2005, respectively. He is currently working toward the Ph.D. degree in computer science at the University of Birmingham, Birmingham, U.K. His current research interests include recurrent neural networks, support vector machines, reservoir computing, and data mining.
Peter Tiˇno received the M.Sc. degree from the Slovak University of Technology, Bratislava, Slovakia, in 1988, and the Ph.D. degree from the Slovak Academy of Sciences, Bratislava, in 1997. He was a Fullbright Fellow at the NEC Research Institute, Princeton, NJ, from 1994 to 1995, a PostDoctoral Fellow at the Austrian Research Institute for Artificial Intelligence, Vienna, Austria, from 1997 to 2000, and a Research Associate at Aston University, Birmingham, U.K., from 2000 to 2003. He has been with the School of Computer Science, University of Birmingham, Birmingham, since 2003, and is currently a Senior Lecturer. He is on the editorial boards of several journals. His current research interests include probabilistic modeling and visualization of structured data, statistical pattern recognition, dynamical systems, evolutionary computation, and fractal analysis. Dr. Tiˇno is a recipient of the U.K.–Hong Kong Fellowship for Excellence, 2008. He was awarded the Outstanding Paper of the Year by IEEE T RANSAC TIONS ON N EURAL N ETWORKS with T. Lin, B. G. Horne, and C. L. Giles in 1998 for the work on recurrent neural networks. He won the Best Paper Award in 2002 at the International Conference on Artificial Neural Networks with B. Hammer. In 2010, he co-authored a paper with S. Y. Chong and X. Yao that won the 2011 IEEE Computational Intelligence Society Award for the Outstanding Paper Published in IEEE T RANSACTIONS ON E VOLUTIONARY C OMPUTATION.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
145
Bounded H∞ Synchronization and State Estimation for Discrete Time-Varying Stochastic Complex Networks Over a Finite Horizon Bo Shen, Zidong Wang, Senior Member, IEEE, and Xiaohui Liu
Abstract— In this paper, new synchronization and state estimation problems are considered for an array of coupled discrete time-varying stochastic complex networks over a finite horizon. A novel concept of bounded H∞ synchronization is proposed to handle the time-varying nature of the complex networks. Such a concept captures the transient behavior of the time-varying complex network over a finite horizon, where the degree of bounded synchronization is quantified in terms of the H∞ -norm. A general sector-like nonlinear function is employed to describe the nonlinearities existing in the network. By utilizing a timevarying real-valued function and the Kronecker product, criteria are established that ensure the bounded H∞ synchronization in terms of a set of recursive linear matrix inequalities (RLMIs), where the RLMIs can be computed recursively by employing available MATLAB toolboxes. The bounded H∞ state estimation problem is then studied for the same complex network, where the purpose is to design a state estimator to estimate the network states through available output measurements such that, over a finite horizon, the dynamics of the estimation error is guaranteed to be bounded with a given disturbance attenuation level. Again, an RLMI approach is developed for the state estimation problem. Finally, two simulation examples are exploited to show the effectiveness of the results derived in this paper. Index Terms— Bounded H∞ synchronization, complex networks, discrete-time networks, finite horizon, recursive linear matrix inequalities, stochastic networks, time-varying networks, transient behavior.
I. I NTRODUCTION OMPLEX networks are made up of interconnected nodes and are used to describe various systems of the real world. Many real-world systems can be described by complex networks, such as the World Wide Web, telephone call
C
Manuscript received May 17, 2010; revised August 28, 2010; accepted October 28, 2010. Date of publication November 18, 2010; date of current version January 4, 2011. This work was supported in part by the Engineering and Physical Sciences Research Council of U.K. under Grant GR/S27658/01, the National Natural Science Foundation of China under Grant 61028008 and Grant 60974030, the National 973 Program of China under Grant 2009CB320600, the International Science and Technology Cooperation Project of China under Grant 2009DFA32050, and the Alexander von Humboldt Foundation of Germany. B. Shen is with the School of Information Science and Technology, Donghua University, Shanghai 200051, China (e-mail:
[email protected]). Z. Wang is with the Department of Information Systems and Computing, Brunel University, Uxbridge, Middlesex UB8 3PH, U.K. He is also with the School of Information Science and Technology, Donghua University, Shanghai 200051, China (e-mail:
[email protected]). X. Liu is with the Department of Information Systems and Computing, Brunel University, Uxbridge, Middlesex UB8 3PH, U.K. (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2090669
graphs, neural networks, scientific citation web, etc. Since the discoveries of the “small-world” and “scale-free” properties of complex networks [1], [2], complex networks have become a focus of research and have attracted increasing attention from various fields of science and engineering. In particular, special attention has been paid to the synchronization problem for dynamical complex networks, in which each node is regarded as a dynamical element. It has been shown that the synchronization is ubiquitous in many system models of the natural world, for example, the large-scale and complex networks of chaotic oscillators [3]–[10], the coupled systems exhibiting spatiotemporal chaos and autowaves [11], [12], and the array of coupled neural networks [13]–[21]. Recently, the synchronization problem for discrete-time stochastic complex networks has drawn much research attention since it is rather challenging to understand the interaction topology of complex networks because of the discrete and random nature of network topology [22]. On one hand, discrete-time networks could be more suitable to model digitally transmitted signals in many application areas such as image processing, time-series analysis, quadratic optimization problems, and system identification. On the other hand, the stochastic disturbances over a real complex network may result from the release of probabilistic causes such as neurotransmitters [23], random phase-coupled oscillators [24], and packet dropouts [25]. A great number of results are in the recent literature on the general topic of stochastic synchronization problem for discrete-time complex networks. For example, in [26], the problem of stochastic synchronization analysis has been investigated for a new array of coupled discretetime stochastic complex networks with randomly occurred nonlinearities and time delays. The synchronization stability problem has been studied in [27] for a class of complex dynamical networks with Markovian jumping parameters and mixed time delays. In [28], the delay-distribution-dependent stability has been discussed for stochastic discrete-time neural networks with randomly mixed time-varying delays. Although the synchronization problem for discrete-time stochastic complex networks is now attracting increasing research attention, there are still several open problems deserving further investigation. In a real world, virtually all complex networks are time-varying, that is, all the network parameters are explicitly dependent on time. For example, a major challenge in biological networks is to understand and model, quantitatively, the dynamic topological and functional properties of biological networks. Such time- or condition-
1045–9227/$26.00 © 2010 IEEE
146
specific biological circuitries are referred to as time-varying networks or structural nonstationary networks, which are common in biological systems. The synchronization problem for time-varying complex networks has received some scattered research interest, where most literature has focused on timevarying coupling or time-varying delay terms. For example, in [29], a time-varying complex dynamical network model has been introduced, and it has been revealed that the synchronization of such a model is completely determined by the inner-coupling matrix, the eigenvalues, and the corresponding eigenvectors of the coupling configuration matrix of the network. Very recently, in [30], a class of controlled time-varying complex dynamical networks with similarity has been investigated, and a decentralized holographic structure controller is designed to stabilize the network asymptotically at its equilibrium states. It should be pointed out that, up to now, the general synchronization results for complex networks with time-varying network parameters have been very few, especially when the networks exhibit both discrete-time and stochastic natures. In fact, for a truly time-varying discrete stochastic complex network, it is often theoretically difficult and practically unnecessary to establish easy-to-verify criteria for ensuring the global or asymptotical synchronization (steady-state behavior). Instead, we would be more interested in the transient behaviors over a finite time interval, e.g., the boundedness of the synchronization errors in the mean square and the disturbance rejection attenuation level of the error evolutions. For example, in biological networks, gene promoters can be in various epigenetic states and undergo interactions with many molecules in a highly transient, probabilistic, and combinatorial way, and therefore the resulting complex dynamics can only be analyzed within a finite period [31]. Despite its clear engineering insight, the synchronization problem for time-varying discrete stochastic complex networks poses some fundamental difficulties. 1) How can we define the synchronization concept over a finite horizon? 2) How can we quantify the attenuation level of the synchronization against exogenous disturbances? 3) How can we develop an effective technique to derive mathematically verifiable synchronization criteria? These questions may well explain why the synchronization problem for time-varying complex networks with or without stochastic disturbances is still open, and such a situation is the first motivation of our current investigation. Closely associated with the synchronization problem is the so-called state estimation problem for complex networks. For large-scale complex networks, it is quite common that only partial information about the network nodes (states) is accessible from the network outputs. Therefore, in order to make use of key network nodes in practice, it becomes necessary to estimate the network nodes through available measurements. Note that the state estimation problem for neural networks (a special class of complex networks) was first addressed in [32] and has then drawn particular research interests (see [33], [34]) where the networks are deterministic and continuous-time. Recently, the state estimation problem for complex networks has also gained much attention, see [35]. When it comes to the transient behaviors of time-varying complex networks,
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
similar to the synchronization problem, two natural questions are, how to define the estimator error over a finite horizon in a quantitative way and how to establish the existence conditions for the desired estimators. It is, therefore, the second motivation in our paper to offer satisfactory answers to the two questions. In this paper, we aim to deal with the synchronization and state estimation problems for an array of coupled discrete timevarying stochastic complex networks over a finite horizon. The contribution of this paper is mainly twofold: 1) a novel concept of bounded H∞ synchronization is proposed to reflect the time-varying nature of the complex networks and quantify the attenuation level of the disturbance rejection via the H∞ -norm, and 2) both synchronization and state estimation problems are solved by utilizing a time-varying real-valued function, the Kronecker product, as well as the recursive linear matrix inequalities (RLMIs). Rather than the commonly used Lipschitz-type function, a more general sector-like nonlinear function is employed to describe the nonlinearities existing in the network. We first define the concept of bounded H∞ synchronization for the stochastic complex networks in the discrete-time domain. By utilizing a time-varying realvalued function and the Kronecker product, we show that the addressed synchronization problem can be converted into the feasibility problem of a set of RLMIs. We then turn to the state estimation problem for the same complex networks. Through available output measurements, we aim to design a state estimator to estimate the network states such that the dynamics of the estimation error is bounded in an H∞ sense. Again, an RLMI approach is used, with the main proof omitted, for the state estimation case. Two simulation examples are provided to show the usefulness of the proposed synchronization and state estimation schemes. Notation: The notation used here is fairly standard except where otherwise stated. Rn denotes the n-dimensional Euclidean space. A refers to the norm of a matrix A defined by A = trace(A T A). The notation X ≥ Y (respectively, X > Y ), where X and Y are real symmetric matrices, means that X − Y is positive semidefinite (respectively, positive definite). M T represents the transpose of the matrix M. I denotes the identity matrix of compatible dimension. diag{· · · } stands for a block-diagonal matrix and the notation diagn {∗} is n employed to stand for diag{∗, . . . , ∗}. Moreover, we may fix a probability space (, F , Prob), where Prob, the probability measure, has a total mass 1. E{x} stands for the expectation of the stochastic variable x with respect to the given probability measure Prob. The asterisk ∗ in a matrix is used to denote a term induced by symmetry. Matrices, if they are not explicitly specified, are assumed to have compatible dimensions.
II. P ROBLEM F ORMULATION AND P RELIMINARIES Let a finite discrete time horizon be given as [0 N] := {0, 1, 2, . . . , N}. Consider the following array of stochastic discrete time-varying complex networks consisting of M
SHEN et al.: SYNCHRONIZATION AND STATE ESTIMATION FOR DISCRETE TIME-VARYING STOCHASTIC COMPLEX NETWORKS
coupled nodes of the form
M
x i (k + 1) = f (k, x i (k)) +
wi j Ŵx j (k) + Bi (k)v(k)
j =1
+gi (k, x i (k))ω(k),
i = 1, 2, . . . , M
(1)
with output z i (k) = E(k)x i (k),
i = 1, 2, . . . , M
(2)
where x i (k) ∈ Rn is the state vector of the i th node, z i (k) ∈ Rm is the controlled output of the i th node, Ŵ = diag{r1 , r2 , . . . , rn } is a matrix linking the j th state variable if r j = 0, and W = (wi j ) M×M is the coupled configuration matrix of the network with wi j ≥ 0 (i = j ) but not all zero. As usual, the coupling configuration matrix W = (wi j ) M×M is symmetric (i.e., W = W T ) and satisfies
M
M
wi j =
j =1
w j i = 0,
i = 1, 2, . . . , M.
(3)
j =1
ω(k) is a 1-D, zero-mean Gaussian white noise sequence on a probability space (, F , Prob) with E{ω2 (k)} = 1. Let (, F , {Fk }k∈[0 N] , Prob) be a filtered probability space where {Fk }k∈[0 N] is the family of sub σ -algebras of F generated by {ω(k)}k∈[0 N] . In fact, each Fk is assumed to be the minimal σ -algebras generated by {ω(i )}0≤i≤k−1 , while F0 is assumed to be some given sub σ -algebras of F , independent of Fk for all 1 ≤ k ≤ N [36], and the initial value x i (0) (i = 1, 2, . . . , M) belongs to F0 . For the exogenous disturbance input v(k) ∈ Rq , it is assumed that v = {v(k)}k∈[0 N] ∈ l2 ([0 N], Rq ), where l2 ([0 N], Rq ) is the space of nonanticipatory squaresummable stochastic process v = {v(k)}k∈[0 N] with respect to {Fk }k∈[0 N] with the norm v2[0 N]
=E
N
2
v(k)
k=0
N
=
k=0
E v(k)2 .
The nonlinear vector-valued function f : [0 N] × Rn → is assumed to be continuous and satisfies the following sector-bounded condition [26], [35]: Rn
[ f (k, x) − f (k, y) − U1 (k)(x − y)]T [ f (k, x) − f (k, y) (4) − U2 (k)(x − y)] ≤ 0, ∀x, y ∈ Rn for all k ∈ [0 N], where U1 (k) and U2 (k) are real matrices of appropriate dimensions. The noise intensity function vector gi : [0 N] × Rn → Rn is continuous and satisfies the following conditions: gi (k, 0) = 0 gi (k, x) − g j (k, y)2 ≤ V (k)(x − y)2 , ∀x, y ∈ Rn
(5)
for all k ∈ [0 N] and i, j = 1, 2, . . . , M, where V (k) is a constant matrix.
147
For the purpose of simplicity, we introduce the following notations:
T T (k) x(k) = x 1T (k) x 2T (k) · · · x M
T T (k) B(k) = B1T (k) B2T (k) · · · B M
T F (k, x(k)) = f T (k, x 1 (k)) f T (k, x 2 (k)) · · · f T (k, x M (k))
T G(k, x(k)) = g1T (k, x 1 (k)) g2T (k, x 2 (k)) · · · g TM (k, x M (k)) . (6) By using the Kronecker product, the complex networks (1) can be rewritten in the following compact form: x(k + 1) = F (k, x(k)) + (W ⊗ Ŵ)x(k) +B(k)v(k) + G(k, x(k))ω(k).
(7)
To proceed, we introduce the following definition for the bounded H∞ synchronization. Definition 1: The stochastic discrete time-varying complex network (1) or (7) is said to be boundedly H∞ -synchronized with a disturbance attenuation γ over a finite horizon [0 N] if the following holds:
z i − z j 2[0 N] ≤ γ 2 v2[0 N] + E{x T (0)Sx(0)}
1≤i< j ≤M
(8)
for the given positive scalar γ > 0 and positive definite matrix S = S T > 0. Remark 1: In the past few years, the synchronization problems of complex networks have been well studied over the infinite time horizon, see [35], where all synchronization errors between the subsystems of a complex network are required to asymptotically approach zero. However, for the inherently time-varying complex networks addressed in this paper, we are more interested in the transient behavior of the synchronization over a specified time interval. In other words, we like to examine the transient behavior over a finite horizon rather than the steady-state property over an infinite horizon. For this purpose, we make one of the first few attempts to define the notion of bounded H∞ -synchronization with a disturbance attenuation level so as to characterize the performance requirement of the synchronization over a finite horizon. It is noticed that, if the constraint (8) is met, then the synchronization error between any pair of subsystems of the complex network is guaranteed to be bounded. Furthermore, the H∞ performance index γ > 0 is used to quantify the attenuation level of the synchronization error dynamics against exogenous disturbances. In this paper, our aim is to investigate the bounded H∞ synchronization problem and establish easy-to-verify criteria for the stochastic discrete time-varying complex network (1) over a finite time horizon. Later, we shall address the finitehorizon H∞ state estimation problem by designing the finitehorizon H∞ estimators for the stochastic discrete time-varying complex network (1).
148
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
III. B OUNDED H∞ -S YNCHRONIZATION OF D ISCRETE T IME -VARYING C OMPLEX N ETWORKS In this section, we deal with the bounded H∞ -synchronization problem for the stochastic discrete time-varying complex network (1) with a given disturbance attenuation level over a finite time horizon. The following lemma is important and will be used in the sequel. Lemma 1 [35]: Let U = (αi j ) M×M , P ∈ Rn×n , x = T T T with T T , and y = y T y2T · · · y M x 1 x 2T · · · x M 1 x i , yi ∈ Rn (i = 1, 2, . . . , M). If U T = U and each row sum of U is zero, then αi j (x i − x j )T P(yi − y j ).
x T (U ⊗ P)y = − 1≤i< j ≤M
(9)
The following theorem provides a sufficient condition under which the complex network (1) is boundedly H∞ synchronized with the given disturbance attenuation level over a finite time horizon. Theorem 1: Let the positive scalar γ > 0 and the initial positive definite matrix S T = S > 0 be given. The stochastic discrete time-varying complex network (1) or (7) is boundedly H∞ -synchronized with the disturbance attenuation γ over a finite horizon [0 N] if there exist a family of positive definite matrices {P(k)}0≤k≤N+1 and two families of positive scalars {λ1 (k)}0≤k≤N , {λ2 (k)}0≤k≤N satisfying the initial condition
1≤i< j ≤M
E (x i (0) − x j (0))T P(0)(x i (0) − x j (0))
≤ γ 2 E x T (0)Sx(0)
(10)
and the RLMIs (11) shown at the bottom of the page, for all 0 ≤ k ≤ N and 1 ≤ i < j ≤ M, where (2) T T (1) i j (k) = −Mwi j Ŵ P(k + 1)Ŵ − P(k) + E (k)E(k)
−λ1 (k)U˜ 1 (k) + λ2 (k)V T (k)V (k), T ˜ (2) i j (k) = −Mwi j Ŵ P(k + 1) − λ1 (k)U2 (k), (3) i j (k) (4) i j (k) (5) i j (k)
U˜ 2 (k) = Bi j (k) =
In this section, the finite-horizon H∞ state estimation problem is first formulated for the stochastic discrete time-varying complex network (1), and then an array of time-varying H∞ estimators is designed by using the RLMI approach. Suppose that the measurement of the complex network (1) is of the form yi (k) = Ci (k)x i (k) + Di (k)v(k), i = 1, 2, . . . , M
(13)
where yi (k) ∈ Rr is the measured output vector from the i th node of the complex network. Based on the measurements yi (k) (i = 1, 2, . . . , M), we construct the following state estimator: ⎧ M ⎪ ⎪ x ˆ (k + 1) = f (k, x ˆ (k)) + wi j Ŵ xˆ j (k) ⎪ i i ⎨ ⎪ ⎪ ⎪ ⎩
= P(k + 1)Bi j (k),
U˜ 1 (k) =
IV. F INITE -H ORIZON H∞ S TATE E STIMATION FOR D ISCRETE T IME -VARYING C OMPLEX N ETWORKS
j =1
= −Mwi j Ŵ T P(k + 1)Bi j (k),
=
Proof: See Appendix I. Remark 2: It should be pointed out that the RLMI technique [37], [38] serves as an effective approach to investigating the problems of H∞ filtering and control in a finite time horizon. In Theorem 1, the RLMI approach has been applied, for the first time, to deal with the synchronization problem for the discrete time-varying stochastic complex network and derive a criterion for testing the bounded H∞ -synchronization in terms of a set of RLMIs. Remark 3: Different from the infinite time horizon case, the asymptotical behavior of synchronization error is not required to be analyzed for a time-varying complex network over a finite time horizon and, therefore, the synchronization criterion given in Theorem 1 takes care of the boundedness of the synchronization error but does not actually guarantee its convergence. In case the considered complex network is time-invariant and its steady-state property over an infinite horizon is a concern, an LMIs-based asymptotical synchronization criterion can be easily deduced from the RLMIs (11) as long as the variables P(k), λ1 (k), and λ2 (k) are taken as constant variables P, λ1 , and λ2 , respectively.
zˆ i (k)
+K i (k)(yi (k) − Ci (k)xˆi (k)) = E(k)xˆi (k), i = 1, 2, . . . , M
(14)
where xˆi (k) ∈ Rn is the estimate of network state x i (k), zˆ i (k) ∈ Rm is the estimate of output z i (k), and K i (k) ∈ Rn×r is the estimator parameter to be designed. The initial values of estimators are assumed to be zeros, i.e., xˆi (0) = 0 for all i = 1, 2, . . . , M. By setting the estimation error ei = x i − xˆ i and the filtering error z˜ i = z i − zˆ i , the error dynamics of complex network can
2γ 2 I + BiTj (k)P(k + 1)Bi j (k), − M(M − 1) U1T (k)U2 (k) + U2T (k)U1 (k) , 2 U T (k) + U2T (k) , − 1 2 M wik wkj . Bi (k) − B j (k), wi(2) = j
(12)
k=1
⎡
(1)
i j (k) ⎢ ⎢ ∗
i j (k) = ⎢ ⎣ ∗ ∗
(2)
i j (k) P(k + 1) − λ1 (k)I ∗ ∗
0 0 P(k + 1) − λ2 (k)I ∗
⎤ (3) i j (k) ⎥ (4) i j (k)⎥ ⎥≤0 0 ⎦ (5) i j (k)
(11)
SHEN et al.: SYNCHRONIZATION AND STATE ESTIMATION FOR DISCRETE TIME-VARYING STOCHASTIC COMPLEX NETWORKS
be obtained from (1), (13), and (14) as follows: ⎧ ⎪ ei (k + 1) = −K i (k)Ci (k)ei (k) + f˜(k, ei (k)) ⎪ ⎪ ⎪ ⎪ M ⎪ ⎨ + wi j Ŵe j (k) + (Bi (k) − K i (k)Di (k))v(k) j =1 ⎪ ⎪ ⎪ +gi (k, ei (k) + xˆi (k))ω(k) ⎪ ⎪ ⎪ ⎩ z˜ (k) = E(k)ei (k) (15) i
where f˜(k, ei (k)) = f (k, x i (k)) − f (k, xˆi (k)). Introducing the notations T (k) T , x(k) ˆ = xˆ1T (k) xˆ2T (k) · · · xˆ M T e(k) = e1T (k) e2T (k) · · · e TM (k) , T z˜ (k) = z˜ 1T (k) z˜ 2T (k) · · · z˜ TM (k) ,
T (k)E (k) − ε1 (k)U˜ 1 (k) 1 (k) = − P(k) + E
+ ε2 (k)VT (k)V (k), (16)
2 (k) = − C T (k)K T (k)P(k + 1) + (W ⊗ Ŵ)T P(k + 1), 3 (k) = B T (k)P(k + 1) − D T (k)K T (k)P(k + 1),
C(k) = diag{C1 (k), C2 (k), . . . , C M (k)}, T D(k) = D1T (k) D2T (k) · · · D TM (k) ,
ˆ 4 (k) = µ(k + 1) − µ(k) + ε2 (k)xˆ T (k)VT (k)V (k)x(k),
E (k) = diag M {E(k)},
T F˜ (k, e(k)) = f˜T (k, e1 (k)) f˜T (k, e2 (k)) · · · f˜T (k, e M (k)) we can rewrite the error dynamics of complex networks (15) in the following compact form: ⎧ e(k + 1) = (−K (k)C(k) + W ⊗ Ŵ)e(k) + F˜ (k, e(k)) ⎪ ⎪ ⎪ ⎨ + (B(k) − K (k)D(k))v(k) ⎪ + G(k, e(k) + x(k))ω(k) ˆ ⎪ ⎪ ⎩ z˜ (k) = E (k)e(k) (17) where B(k) and G(k, x(k)) are defined in (6). In this section, we aim to design the time-varying estimators (14) for the stochastic discrete time-varying complex network (1) such that the filtering error z˜ (k) satisfies the following H∞ performance constraint:
˜z 2[0 N] ≤ γ 2 v2[0 N] + E{e T (0)Se(0)} (18)
for the given disturbance attenuation level γ > 0 and positive definite matrix S T = S > 0. In the following theorem, a sufficient condition is given to guarantee that the filtering error satisfies the H∞ performance constraint (18).
1 (k) −ε1 (k)U˜ 2 (k) 0 ⎢ ∗ −ε1 (k)I 0 ⎢ ⎢ ∗ ∗ −γ 2 I ⎢ ⎢ ∗ ∗ ∗ ⎢ ⎢ ∗ ∗ ∗ ⎢ ⎣ ∗ ∗ ∗ ∗ ∗ ∗ ⎡ 0 1 (k) −ε1 (k)U˜ 2 (k) ⎢ ∗ −ε (k)I 0 1 ⎢ 2I ⎢ ∗ ∗ −γ ⎢ ⎢ ∗ ∗ ∗ ⎢ ⎢ ∗ ∗ ∗ ⎢ ⎣ ∗ ∗ ∗ ∗ ∗ ∗
Theorem 2: Let the scalar γ > 0, initial positive definite matrix S T = S > 0, and estimator parameters K i (k) (i = 1, 2, . . . , M) be given. The filtering error z˜ (k) satisfies the H∞ performance constraint (18) if there exist a family of positive definite matrices {P(k)}0≤k≤N+1 and three families of positive scalars {ε1 (k)}0≤k≤N , {ε2 (k)}0≤k≤N , {µ(k)}0≤k≤N+1 satisfying the initial condition
E e T (0)P(0)e(0) + µ(0) ≤ γ 2 E e T (0)Se(0) (19) and the RLMIs (20) shown at the bottom of page, for all 0 ≤ k ≤ N, where
K (k) = diag{K 1 (k), K 2 (k), . . . , K M (k)},
⎡
149
T (k)U (k) + U T (k)U (k) U1 2 1 2 , 2 T (k) U T (k) + U2 , U˜ 2 (k) = − 1 2 U1 (k) = diag M {U1 (k)}, U2 (k) = diagM {U2 (k)},
U˜ 1 (k) =
V (k) = diag M {V (k)}.
(21)
Proof: See Appendix II. After establishing the analysis results, we are now ready to deal with the design problem of the finite-horizon H∞ estimators for the stochastic network (1). The following result can be readily derived from Theorem 2, and therefore its proof is omitted for saving space. Theorem 3: Let the scalar γ > 0 and initial positive definite matrix S T = S > 0 be given. The finite-horizon H∞ estimation problem is solvable for the time-varying stochastic complex network (1) if there exist a family of positive definite diagonal block matrices {P(k) = diag{P1 (k), P2 (k), . . . , PM (k)}}0≤k≤N+1 , a family of diagonal block matrices {X (k) = diag{X 1 (k), X 2 (k), . . . , X M (k)}}0≤k≤N , and three families of positive scalars {ε1 (k)}0≤k≤N , {ε2 (k)}0≤k≤N , {µ(k)}0≤k≤N+1 satisfying the initial condition (19) and the RLMIs (22) shown at the bottom of the page,
0 0 0 −ε2 (k)I ∗ ∗ ∗
ε2 (k)VT (k)V (k)x(k) ˆ 0 0 0 4 (k) ∗ ∗
2 (k) P(k + 1) 3 (k) 0 0 −P(k + 1) ∗
0 0 0 −ε2 (k)I ∗ ∗ ∗
ε2 (k)VT (k)V (k)x(k) ˆ 0 0 0 4 (k) ∗ ∗
¯ 2 (k) P(k + 1) ¯ 3 (k) 0 0 −P(k + 1) ∗
⎤ 0 ⎥ 0 ⎥ ⎥ 0 ⎥ P(k + 1) ⎥ ⎥≤0 ⎥ 0 ⎥ ⎦ 0 −P(k + 1) ⎤ 0 ⎥ 0 ⎥ ⎥ 0 ⎥ P(k + 1) ⎥ ⎥≤0 ⎥ 0 ⎥ ⎦ 0 −P(k + 1)
(20)
(22)
150
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
1 (k), 4 (k), U˜ 2 (k), and V (k) are defined in Theorem 2. Furthermore, if (19) and (22) are true, the desired estimators are given by (14) with the following parameters: K i (k) = Pi−1 (k + 1)X i (k), i = 1, 2, . . . , M
V. I LLUSTRATIVE E XAMPLES In this section, two simulation examples are presented to demonstrate the effectiveness of the established criteria on the bounded H∞ -synchronization as well as the finite-horizon H∞ state estimation problems for the complex network (1). Consider a stochastic time-varying complex network (1) with four nodes in a given finite time horizon k ∈ [0 25]. The coupling configuration matrix are assumed to be W = (wi j ) M×M with −0.3, i = j wi j = 0.1, i = j
and the inner-coupling matrix is given as Ŵ = diag4 {0.1}. The nonlinear time-varying function f (k, x i (k)) is chosen as f (k, x i (k)) = ⎧ −0.15x i1 (k) + 0.1x i2 (k) + tanh(0.1x i1 (k)) ⎪ ⎪ , 0 ≤ k < 10 ⎨ 0.25x i2 (k) − tanh(0.1x i2 (k)) 0.25x i1 (k) − tanh(0.15x i1 (k)) ⎪ ⎪ , 10 ≤ k ≤ 25 ⎩ 0.1x i2 (k)
and the disturbance matrices are taken as 0.14 + 0.1 sin(6(k − 1)) −0.13 B1 (k) = , B2 (k) = , 0.12 0.1 0 0.15 B3 (k) = , B4 (k) = . −0.15 0.1
The noise intensity function is simplified to gi (k, x i (k)) = Vi (k)x i (k), with 0.09 −0.117 , i = 1, 2, 3, 4. Vi (k) = −0.045 0.135 Then, it is easily verified that ⎧ −0.15 0.1 ⎪ ⎪ , 0 ≤ k < 10 ⎨ 0.25 0 U1 (k) = 0.25 0 ⎪ ⎪ , 10 ≤ k ≤ 25 ⎩ 0 0.1
0
5
10
15 Time (k)
20
25
0
5
10
15 Time (k)
20
25
0
5
10
15 Time (k)
20
25
0.02 0 −0.02
(24)
for all 0 ≤ k ≤ N. Remark 4: In Theorem 3, a criterion is established to ensure the existence of the desired estimator gains, and the explicit expression of such estimator gains is characterized in terms of the solution to a set of RLMIs. Note that such RLMIs can be effectively solved and checked by the algorithms such as the interior-point method. The state estimate at current time is involved in RLMIs (22), which means that more current information is used to estimate the state the next time. In this sense, the estimator design scheme in terms of RLMIs (22) can potentially improve the accuracies of the state estimation.
0
−0. 02
z1 − z3
(23)
z1 − z2
0.02
0.02 z1 − z4
where ¯2 (k) = −C T (k)X T (k) + (W ⊗ Ŵ)T P(k + 1) ¯ 3 (k) = B T (k)P(k + 1) − D T (k)X T (k),
0 −0.02
Fig. 1.
Synchronization errors between z 1 (k) and z i (k) (i = 2, 3, 4).
⎧ ⎪ ⎪ −0.05 0.1 , 0 ≤ k < 10 ⎨ 0.15 0 U2 (k) = 0.1 0 ⎪ ⎪ , 10 ≤ k ≤ 25 ⎩ 0 0.1 0.09 −0.117 and V (k) = . −0.045 0.135 We are now ready to deal with the bounded H∞ synchronization problem as well as the finite-horizon H∞ state estimation problem over the given finite horizon for the complex network (1) with above parameters. Example 1: In this example, let us test the bounded H∞ -synchronization of the complex network based on our established criterion. Set the initial values of the complex network as T T x 2 (0) = 0.15 −0.1 x 1 (0) = 0.1 −0.15 T T x 3 (0) = 0.2 −0.1 x 4 (0) = 0.1 −0.2 . Let the disturbance attenuation level and the positive definite matrix be γ = 0.7071 and S = diag8 {1}, respectively. In order to check whether the complex network mentioned above is boundedly H∞ -synchronized with the given disturbance attenuation level γ , we first choose the initial positive definite matrices P(0) = diag2 {1} to satisfy the initial condition (10). Then the set of RLMIs (11) in Theorem 1 can be solved recursively by using MATLAB (with the YALMIP 3.0), and a set of feasible solutions is obtained as shown in Table I. According to Theorem 1, the array of stochastic discrete time-varying complex networks can reach the bounded H∞ synchronization with the given disturbance attenuation level γ . In the simulation, the exogenous disturbance input v(k) is selected as a random variable that obeys uniform distribution over [−0.25 0.25]. The simulation results are presented in Fig. 1, which plots the synchronization error between the output z 1 (k) and outputs z i (k) (i = 2, 3, 4). It can be seen from Fig. 1 that all synchronization errors are indeed bounded, which verifies the effectiveness of the synchronization criteria proposed in Theorem 1. Remark 5: Recently, considerable research efforts have been made on the synchronization problems of complex networks, and various synchronization concepts have been
SHEN et al.: SYNCHRONIZATION AND STATE ESTIMATION FOR DISCRETE TIME-VARYING STOCHASTIC COMPLEX NETWORKS
TABLE II E STIMATOR PARAMETERS K i (k) (i = 1, 2, 3, 4)
TABLE I VARIABLES P(k), λ1 (k), λ2 (k)
k 0 1 2 3 4 5 .. . 21 22 23 24 25
P(k) 1 0 0 1 0.1922 −0.0019 −0.0019 0.2030 0.1410 −0.0008 −0.0008 0.1480 0.1477 −0.0003 −0.0003 0.1544 0.1496 0.0000 0.0000 0.1560 0.1503 0.0002 0.0002 0.1565 .. . 0.1790 −0.0045 −0.0045 0.1797 0.1721 −0.0036 −0.0036 0.1709 0.1904 −0.0032 −0.0032 0.1874 0.1825 −0.0024 −0.0024 0.1786 0.1832 −0.0019 −0.0019 0.1786
151
λ1 (k)
λ2 (k)
k
1.0879
1.1482
0
0.7973
0.8180
1
0.8296
0.8484
2
0.8351
0.8544
3
0.8367
0.8561
4
0.8372
0.8565
5
.. .
.. .
.. .
0.8217
0.8318
21
0.8926
0.9029
22
0.8364
0.8471
23
0.8314
0.8418
24
0.8328
0.8432
25
K 1 (k) 0.1208 0.1461 0.0847 0.1299 0.0634 0.1300 0.0469 0.1191 0.0411 0.1209 0.0398 0.1253 .. . 0.1473 0.1198 0.1174 0.1201 0.0933 0.1198 0.0689 0.1200 0.0528 0.1201
K 2 (k) −0.0697 0.1414 −0.1207 0.1047 −0.1134 0.1090 −0.1230 0.1046 −0.1282 0.1009 −0.1093 0.1137 .. . −0.1364 0.1018 −0.1302 0.1001 −0.1370 0.1024 −0.1308 0.1001 −0.1317 0.1001
K 3 (k) −0.0604 −0.2137 −0.0077 −0.1569 −0.0462 −0.1892 −0.0051 −0.1542 −0.0016 −0.1512 −0.0193 −0.1640 .. . 0.0063 −0.1510 0.0002 −0.1501 0.0062 −0.1514 0.0008 −0.1500 0.0015 −0.1501
K 4 (k) 0.1615 0.1459 0.1452 0.1127 0.1673 0.1344 0.1485 0.1019 0.1496 0.1013 0.1473 0.1089 .. . 0.1522 0.1001 0.1502 0.1001 0.1512 0.1007 0.1504 0.1000 0.1512 0.1001
0.06
the disturbance attenuation level is given as γ = 1, and the positive definite matrix is taken as S = diag8 {5}. We choose the initial positive definite matrices P1 (0) = P2 (0) = P3 (0) = P4 (0) = diag2 {1} and positive scalar µ(0) = 0.5 to satisfy the initial condition (19). By using MATLAB (with the YALMIP 3.0) again, the set of RLMIs (22) in Theorem 3 can be solved recursively, and all desired estimator parameters can be derived. Table II lists all estimator parameters K i (k)
State z1(k) Estimate of z1(k)
0.05
z1 and its estimate
proposed, such as asymptotical synchronization [14], [29], exponential synchronization [4], [35], and exponential H∞ synchronization [16]. However, as far as we know, all the synchronization concepts in the existing literature are concerned with the case of infinite time horizon and only the asymptotical behavior of the synchronization has been analyzed. As a distinguishing feature, the notion of bounded H∞ -synchronization proposed in this paper can be used to characterize the transient behavior of the synchronization over a specified time interval. In other words, the derived bounded H∞ -synchronization criterion can guarantee that: 1) the synchronization error over a given time interval is bounded, and 2) the influence from the external disturbances and the initial states to the synchronization error is attenuated with a given H∞ -norm γ . This has been well verified by the simulation results of Example 1. Example 2: In this example, we deal with the finite-horizon H∞ state estimation problem. The initial values of complex network are set as T T x 1 (0) = 0.1 0.2 x 2 (0) = −0.2 0.1 T T x 3 (0) = −0.1 −0.15 x 4 (0) = −0.15 −0.1
0.04 0.03 0.02 0.01 0 −0.01
Fig. 2.
0
5
10
15 Time (k)
20
25
Output z 1 (k) and its estimate zˆ 1 (k).
(i = 1, 2, 3, 4) and the variables Pi (k) (i = 1, 2, 3, 4) and µ(k) are shown in Table III. In the simulation, the exogenous disturbance input v(k) is the same as that used in Example 1. Simulation results are presented in Figs. 2–5 which show the output z i (k) and its estimate zˆ i (k) (i = 1, 2, 3, 4). The simulation has confirmed that the designed estimators perform very well. Remark 6: From above simulation examples, it can be seen that the developed RLMI-based algorithms are implemented where the initial variable matrices are chosen beforehand to satisfy the conditions (10) and (19). For the synchronization algorithm, the selection of initial matrices is independent of the initial values of the complex network, which can be seen from the condition (10). In other words, the H∞ -synchronization of the complex network depends only on the given attenuation
152
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
k 0 1 2 3 4 5 .. . 21 22 23 24 25
10
TABLE III VARIABLES Pi (k) (i = 1, 2, 3, 4) AND µ(k)
P1 (k) 1 0 0 1 6.4331 −1.7597 −1.7597 6.3939 2.7170 −0.0017 −0.0017 2.7147 13.3071 −0.5008 −0.5008 13.3598 41.7495 −1.2458 −1.2458 41.6259 5.0967 −0.0018 −0.0018 5.0968 .. . 14.6042 0.0066 0.0066 14.6504 75.0034 0.4867 0.4867 77.0554 12.1434 0.0001 0.0001 12.1438 98.8228 5.9542 5.9542 116.0138 39.6120 0.0099 0.0099 39.6777
P2 (k) 1 0 0 1 5.9195 −1.8174 −1.8174 6.5502 2.7140 −0.0033 −0.0033 2.7117 11.0833 −1.8382 −1.8382 12.5355 38.2524 −3.7762 −3.7762 39.5097 5.0951 −0.0022 −0.0022 5.0956 .. . 14.5695 0.0003 0.0003 14.6490 72.4101 0.8626 0.8626 76.8426 12.1444 −0.0004 −0.0004 12.1438 98.8232 4.0458 4.0458 116.3961 39.6102 −0.0011 −0.0011 39.6776
P3 (k) 1 0 0 1 6.4184 −1.4750 −1.4750 6.8707 2.7162 −0.0021 −0.0021 2.7155 10.9266 −2.2962 −2.2962 11.9898 40.6841 −1.6534 −1.6534 41.3739 5.0962 −0.0017 −0.0017 5.0968 .. . 14.5482 −0.0053 −0.0053 14.6513 70.4913 0.2790 0.2790 77.1546 12.1445 −0.0001 −0.0001 12.1439 91.1770 1.8811 1.8811 117.8131 39.5736 −0.0083 −0.0083 39.6794
P4 (k) 1 0 0 1 6.5249 −1.5846 −1.5846 6.5494 2.7168 −0.0009 −0.0009 2.7160 13.1279 −0.6367 −0.6367 13.2111 42.3505 −0.5179 −0.5179 42.2990 5.0968 −0.0013 −0.0013 5.0972 .. . 14.6075 0.0079 0.0079 14.6509 75.2171 0.5131 0.5131 77.0856 12.1434 0.0002 0.0002 12.1438 101.6412 5.5271 5.5271 116.3522 39.6328 0.0087 0.0087 39.6789
0.02
× 10−3 State z2(k) Estimate of z2(k)
6 4
µ(k) 0.5000 0.4990 0.4987 0.4986 0.4984 0.4980 .. . 0.4942 0.4940 0.4938 0.4937 0.4934
State z3(k) Estimate of z3(k)
0.015 z3 and its estimate
8 z2 and its estimate
0.01 0.005 0
2 −0.005 0
−2
Fig. 3.
−0.01 0
5
10
15 Time (k)
20
25 Fig. 4.
0
5
10
15 Time (k)
20
25
Output z 3 (k) and its estimate zˆ 3 (k).
Output z 2 (k) and its estimate zˆ 2 (k).
level γ , but is not affected by the initial values. Although a small attenuation level leads to smaller synchronization error, there does exist a lowest bound for the attenuation level γ especially when certain complexities such as parameter uncertainties are present. For the complex network in Example 1, the minimum γ can be computed as γ = 0.4425. On the other hand, for the H∞ estimation algorithm, it can be seen from (19) that the estimation algorithm depends not only on the attenuation level γ but also on the initial values of the complex network. In order to show the effects on the filtering performance caused by different initial values and attenuation levels, some comparative simulation results are
presented in Figs. 6–13. Figs. 6–9 plot the filtering errors z˜ i (i = 1, 2, 3, 4) with different attenuation levels (γ = 1 and γ = 3), which shows that a smaller attenuation level indeed results in better filtering performance. Moreover, the filtering errors z˜ i (i = 1, 2, 3, 4) with different initial values are depicted in Figs. 10–13. Remark 7: Note that the RLMI approach developed in this paper is based on LMIs. The standard LMI system has a polynomial-time complexity, which is bounded by O(MN 3 log(V/ε)), where M is the total row size of the LMI system, N is the total number of scalar decision variables, V is a data-dependent scaling factor, and ε is relative accuracy set for algorithm. The computational complexity of the developed RLMI-based algorithm can be easily obtained via the time
SHEN et al.: SYNCHRONIZATION AND STATE ESTIMATION FOR DISCRETE TIME-VARYING STOCHASTIC COMPLEX NETWORKS
0.01 20
0.005
× 10−3 Attenuation level γ 1 Attenuation level γ 3
0
15
−0.005 Filtering error z˜ 3
z4 and its estimate
153
−0.01 −0.015 −0.02 −0.025
State z4(k) Estimate of z4(k)
−0.03 −0.035
0
5
10
15 Time (k)
20
10
5
0
25
−5
0
5
10
15 Time (k)
20
25
Output z 4 (k) and its estimate zˆ4 (k).
Fig. 5.
Fig. 8.
Filtering error z˜ 3 (k) with different attenuation levels.
0.06 0.005
Attenuation level γ 1 Attenuation level γ 3
0.05
0
0.04 Filtering error z˜ 4
Filtering error ˜z1
−0.005 0.03 0.02 0.01
Fig. 6.
−0.02
Attenuation level γ 1 Attenuation level γ 3
−0.03 0
5
10
15 Time (k)
20
25 −0.035 0
5
10
Filtering error z˜1 (k) with different attenuation levels. Fig. 9.
6
25
Filtering error z˜ 4 (k) with different attenuation levels.
Initial value x1 [0.1; 0.2] Initial value x1 [0.15; 0.25]
0.06 0.05
0 Filtering error ˜z1
Filtering error z˜ 2
20
0.07
Attenuation level γ 1 Attenuation level γ 3
2
−2 −4 −6 −8
0.04 0.03 0.02 0.01
−10 −12
15 Time (k)
× 10−3
4
Fig. 7.
−0.015
−0.025
0 −0.01
−0.01
0 0
5
10
15 Time (k)
20
25
−0.01
0
5
10
15 Time (k)
20
25
Filtering error z˜2 (k) with different attenuation levels. Fig. 10.
complexity of the standard LMI system. For example, let us look at the bounded H∞ -synchronization criterion for the complex network (1) (as described in Theorem 1), where the number of network nodes is M, the length of finite time horizon is N + 1, and the dimensions of network variables can be seen from x i (k) ∈ Rn , z i (k) ∈ Rm (i = 1, 2, . . . , M),
Filtering error z˜ 1 (k) with different initial values.
v(k) ∈ Rq , and ω(k) ∈ R. The RLMI-based synchronization criterion is implemented recursively for N + 1 steps and, at every step, M(M − 1)/2 standard LMIs given by (11) need to be solved. For each of these LMIs, we have M = 3n + q and N = (n 2 + n + 4)/2. Therefore, the
154
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
16
Initial value x2 [−0.2; 0.1] Initial value x2 [−0.15; 0.15]
14 12 Filtering error ˜z2
Obviously, the computational complexity of the RLMI-based algorithms depends linearly on the length of finite time horizon and polynomially on the dimensions of network variables, which means that the overall computational burden is mainly caused by the complexity of LMI computation. Fortunately, research on LMI optimization is a very active area in the applied mathematics, optimization, and the operations research community, and substantial speedups can be expected in the future.
× 10−3
10 8 6 4 2 0
VI. C ONCLUSION
−2 −4
0
5
10
15 Time (k)
20
25
Filtering error z˜2 (k) with different initial values.
Fig. 11.
0.035 Initial value x3 [−0.1; 0.15] Initial value x3 [−0.1; 0.2]
0.03 0.025 Filtering error ˜z3
0.02 0.015 0.01 0.005 0 −0.005 −0.01
Fig. 12.
0
5
10
15 Time (k)
20
25
In this paper, we have addressed a novel synchronization problem for a class of discrete time-varying stochastic complex networks over a finite horizon. A notion of bounded H∞ synchronization has been first defined to characterize the transient performance of synchronization. Then a testing criterion on the bounded H∞ -synchronization has been established for the considered complex networks in terms of a set of RLMIs. Subsequently, the finite-horizon H∞ state estimation problem has been considered for the complex networks under consideration. By using the RLMI approach, a sufficient condition under which the filtering error satisfies the H∞ performance constraint has been obtained, and then all the desired finitehorizon H∞ estimators have been designed. Finally, two simulation examples have been employed to demonstrate the effectiveness of the results derived in this paper. Further research topics include the extension of our results to more general complex networks with various time delays and also to the H∞ estimation problem for complex networks with multiple coupled sensors. A PPENDIX I
Filtering error z˜3 (k) with different initial values.
P ROOF OF T HEOREM 1 0.01
Proof: Define the real-valued function
Filtering error ˜z4
0
−0.03 −0.04 Initial value x4 [−0.15; −0.1] Initial value x4 [−0.1; −0.2]
−0.05
Fig. 13.
(25)
where {P(k)}0≤k≤N+1 is the solution to the RLMIs (11) with the initial condition (10) and U = (αi j ) M×M with M − 1, i = j αi j = −1, i = j .
−0.02
−0.06
V (k, x(k)) = x T (k) U ⊗ P(k) x(k)
−0.01
0
5
10
15 Time (k)
20
25
We can calculate
+
E{z i (k) − z j (k)2 } − γ 2 E{v(k)2 }
1≤i< j ≤M
Filtering error z˜4 (k) with different initial values.
computational complexity of the RLMI-based synchronization criterion algorithm can be represented as O(n 3 M 2 N + n 2 M 2 Nq). Similarly, it is not difficult to calculate that the time complexity of the finite-horizon H∞ state estimation algorithm is O(n 3 M 2 N + n 2 q M N + n 2r M 2 N + nqr M N).
E{V (k + 1, x(k + 1))} − E{V (k, x(k))}
=E
⎧ ⎨ ⎩
F T (k, x(k)) U ⊗ P(k + 1) F (k, x(k))
+x T (k)(W ⊗ Ŵ)T U ⊗ P(k + 1) (W ⊗ Ŵ)x(k)
SHEN et al.: SYNCHRONIZATION AND STATE ESTIMATION FOR DISCRETE TIME-VARYING STOCHASTIC COMPLEX NETWORKS
+G T (k, x(k)) U ⊗ P(k + 1) G(k, x(k))
Therefore, by noting (11), it follows from (28)–(30) that
+v T (k)B T (k) U ⊗ P(k + 1) B(k)v(k)
E{V (k + 1, x(k + 1))} − E{V (k, x(k))} E{z i (k) − z j (k)2 } − γ 2 E{v(k)2 } +
+2F T (k, x(k)) U ⊗ P(k + 1) (W ⊗ Ŵ)x(k)
1≤i< j ≤M
+2F T (k, x(k)) U ⊗ P(k + 1) B(k)v(k)
≤
+2x T (k)(W ⊗ Ŵ)T U ⊗ P(k + 1) B(k)v(k)
1≤i< j ≤M
−x T (k) U ⊗ P(k) x(k) +
155
¯ i j (k)ξi j (k) E ξiTj (k)
T xi j (k) U˜ 1 (k) U˜ 2 (k) xi j (k) fi j (k) fi j (k) ∗ I T # T xi j (k) −V (k)V (k) 0 xi j (k) −λ2 (k) gi j (k) ∗ I gi j (k)
E ξiTj (k) i j (k)ξi j (k) = −λ1 (k)
z i (k) − z j (k)2 − γ 2 v(k)2
⎫ ⎬
. (26)
⎭
1≤i< j ≤M
For the purpose of notation simplicity, set xi j (k) = x i (k) − x j (k) fi j (k) = f (k, x i (k)) − f (k, x j (k))
1≤i< j ≤M
(27)
gi j (k) = gi (k, x i (k)) − g j (k, x j (k)). By using Lemma 1 and noting (6), we can obtain that
E{V (k + 1, x(k + 1))} − E{V (k, x(k))}
+ E z i (k) − z j (k)2 − γ 2 E v(k)2
≤ 0.
(31)
Summing up (31) from 0 to N with respect to k yields z i − z j 2[0 N] ≤ γ 2 v2[0 N] 1≤i< j ≤M (32) + E{x T (0) (U ⊗ P(0)) x(0)}.
1≤i< j ≤M
=
1≤i< j ≤M T T −Mwi(2) j xi j (k)Ŵ P(k + 1)Ŵxi j (k) +giTj (k)P(k + 1)gi j (k) +v T (k)BiTj (k)P(k + 1)Bi j (k)v(k) −2Mwi j fiTj (k)P(k + 1)Ŵxi j (k) +2fiTj (k)P(k + 1)Bi j (k)v(k) −2Mwi j xiTj (k)Ŵ T P(k + 1)Bi j (k)v(k) −xiTj (k)P(k)xi j (k) 2γ 2 +xiTj (k)E T (k)E(k)xi j (k) −
=
1≤i< j ≤M
By considering the initial condition (10), the inequality (8) follows from (32) immediately and, consequently, the proof of this theorem is complete.
E fiTj (k)P(k + 1)fi j (k)
M(M − 1)
¯ i j (k)ξi j (k) E ξiTj (k)
A PPENDIX II P ROOF OF T HEOREM 2 Proof: Let the real-valued function be V (k, e(k)) = e T (k)P(k)e(k) + µ(k)
v(k)2
where {P(k)}0≤k≤N+1 and {µ(k)}0≤k≤N+1 are the solutions to the RLMIs (20) with the initial condition (19). For notation simplicity, we denote
# (28)
where
T ξi j (k) = xiTj (k) fiTj (k) giTj (k) v T (k) ⎡ (1) ⎤ (3) ¯ (k) −Mwi j Ŵ T P(k + 1) 0 i j (k) ij ⎢ ⎥ (4) ⎢ ∗ ⎥ P(k + 1) 0 (k) ⎢ ⎥ ¯ ij
i j (k) = ⎢ ⎥ ∗ P(k + 1) 0 ⎦ ⎣ ∗ (5) ∗ ∗ ∗ i j (k) ¯ (1)(k) = −Mw(2) Ŵ T P(k + 1)Ŵ − P(k) + E T (k)E(k). and ij ij Subsequently, using the notations in (27), we rewrite (4) as T xi j (k) U˜ 1 (k) U˜ 2 (k) xi j (k) (29) ≤ 0. fi j (k) fi j (k) ∗ I Similarly, (5) can also be rewritten as T xi j (k) −V T (k)V (k) 0 xi j (k) ≤ 0. gi j (k) ∗ I gi j (k)
(33)
(30)
T ζ(k) = e T (k) F˜ T (k, e(k)) v T (k) G T (k, e(k) + x(k)) ˆ 1 A(k) = −K (k)C(k) + W ⊗ Ŵ I B(k) − K (k)D(k) 0 0 H= 0 0 0 I 0 . (34) Tedious but straightforward calculation shows that E{V (k + 1, e(k + 1))} − E{V (k, e(k))} +E{˜z (k)2 } − γ 2 E{v(k)2 }
T = E A(k)ζ (k) + Hζ(k)ω(k) ×P(k + 1) A(k)ζ (k) + Hζ(k)ω(k)
T −e T (k)P(k)e(k) + e T (k)E (k)E (k)e(k) 2 T −γ v (k)v(k) + µ(k + 1) − µ(k)
= E ζ T (k) 1 (k) + AT (k)P(k + 1)A(k) +HT P(k + 1)H ζ(k) (35)
156
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
⎡
1 (k) ⎢ ∗ ⎢ 2 (k) = ⎢ ⎢ ∗ ⎣ ∗ ∗ where 1 (k) ⎡ −P(k) + ⎢ ⎢ =⎢ ⎢ ⎣
0 0 −γ 2 I ∗ ∗
T (k)E (k) E ∗ ∗ ∗ ∗
0 0 0 0 ∗ −γ 2 I ∗ ∗ ∗ ∗
T
U˜ 1 (k) U˜ 2 (k) ∗ I
⎤ 0 0 ⎥ 0 0 ⎥ ⎥. 0 0 ⎥ ⎦ 0 0 ∗ µ(k + 1) − µ(k)
e(k) ≤ 0 (36) F˜ (k, e(k))
and ⎡
⎤T e(k) ⎦ ⎣G(k, e(k) + x(k)) ˆ 1 ⎡ ⎤ −VT (k)V (k) 0 −VT (k)V (k)x(k) ˆ ⎦ ∗ I 0 ×⎣ ∗ ∗ −xˆ T (k)VT (k)V (k)x(k) ˆ ⎡ ⎤ e(k) ⎦≤0 ˆ × ⎣G(k, e(k) + x(k)) 1
(37)
respectively. By considering (35) and (37), we can obtain E{V (k + 1, e(k + 1))} − E{V (k, e(k))} + E{˜z (k)2 }
−γ 2 E{v(k)2 } ⎧ ⎨ ≤ E ζ T (k) 1 (k) + AT (k)P(k + 1)A(k) ⎩
+HT P(k + 1)H ζ (k)
T e(k) e(k) U˜ 1 (k) U˜ 2 (k) ˜ ˜ F(k, e(k)) ∗ I F (k, e(k)) ⎡ ⎤T e(k) ⎦ ˆ −ε2 (k) ⎣G(k, e(k) + x(k)) 1 ⎡ ⎤ T −V (k)V (k) 0 −VT (k)V (k)x(k) ˆ ⎦ ∗ I 0 ×⎣ T T ∗ ∗ −xˆ (k)V (k)V (k)x(k) ˆ ⎡ ⎤⎫ e(k) ⎬ ⎦ ˆ × ⎣G(k, e(k) + x(k)) ⎭ 1
= E ζ T (k) 2 (k) + AT (k)P(k + 1)A(k) +HT P(k + 1)H ζ (k) , (38)
−ε1 (k)
0 0 0 −ε2 (k)I ∗
⎤ ˆ ε2 (k)VT (k)V (k)x(k) ⎥ 0 ⎥ ⎥. 0 ⎥ ⎦ 0 4 (k)
By using the Schur complement formula and noting (20), we can easily obtain from (38)
From (4) and (5), we have e(k) F˜ (k, e(k))
−ε1 (k)U˜ 2 (k) −ε1 (k)I ∗ ∗ ∗
where 2 (k) is shown at the top of the page.
E{V (k + 1, e(k + 1))} − E{V (k, e(k))} + E{˜z (k)2 } − γ 2 E{v(k)2 } ≤ 0.
(39)
Then, the rest of this paper can be easily accomplished by following the methods used in the proof of Theorem 1 and is therefore omitted. R EFERENCES [1] A. L. Barabasi and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, Oct. 1999. [2] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’ networks,” Nature, vol. 393, no. 6684, pp. 440–442, Jun. 1998. [3] J. Jost and M. P. Joy, “Spectral properties and synchronization in coupled map lattices,” Phys. Rev. E, vol. 65, no. 1, pp. 061201-1–061201-9, Jan. 2002. [4] J. Liang, Z. Wang, Y. Liu, and X. Liu, “Robust synchronization of an array of coupled stochastic discrete-time delayed neural networks,” IEEE Trans. Neural Netw., vol. 19, no. 11, pp. 1910–1921, Nov. 2008. [5] W. Lu and T. Chen, “Synchronization of coupled connected neural networks with delays,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 12, pp. 2491–2503, Dec. 2004. [6] W. Lu and T. Chen, “Global synchronization of discrete-time dynamical network with a directed graph,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 2, pp. 136–140, Feb. 2007. [7] J. Lu and D. W. C. Ho, “Globally exponential synchronization and synchronizability for general dynamical networks,” IEEE Trans. Syst. Cybern. B, Cybern., vol. 40, no. 2, pp. 350–361, Apr. 2010. [8] R. Palm, “Synchronization of decentralized multiple-model systems by market-based optimization,” IEEE Trans. Syst., Man Cybern., B, Cybern., vol. 34, no. 1, pp. 665–672, Feb. 2004. [9] F. Souza and R. Palhares, “Synchronisation of chaotic delayed artificial neural networks: An H∞ control approach,” Int. J. Syst. Sci., vol. 40, no. 9, pp. 937–944, Sep. 2009. [10] X. F. Wang and G. Chen, “Synchronization in small-world dynamical networks,” Int. J. Bifurc. Chaos, vol. 12, no. 1, pp. 187–192, 2002. [11] V. Perez-Munuzuri, V. Perez-Villar, and L. O. Chua, “Autowaves for image processing on a 2-D CNN array of excitable nonlinear circuits: Flat and wrinkled labyrinths,” IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 40, no. 3, pp. 174–181, Mar. 1993. [12] A. L. Zheleznyak and L. O. Chua, “Coexistence of low- and highdimensional spatio-temporal chaos in a chain of dissipatively coupled Chua’s circuits,” Int. J. Bifurc. Chaos, vol. 4, no. 3, pp. 639–674, 1994. [13] Z. Fei, H. Gao, and W. X. Zheng, “New synchronization stability of complex networks with an interval time-varying coupling delay,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 6, pp. 499–503, Jun. 2009. [14] H. Gao, J. Lam, and G. Chen, “New criteria for synchronization stability of general complex dynamical networks with coupling delays,” Phys. Lett. A, vol. 360, no. 2, pp. 263–273, Dec. 2006. [15] X. Hu and J. Wang, “Design of general projection neural networks for solving monotone linear variational inequalities and linear and quadratic optimization problems,” IEEE Trans. Syst., Man Cybern. B, Cybern., vol. 37, no. 5, pp. 1414–1421, Oct. 2007. [16] H. R. Karimi and H. Gao, “New delay-dependent exponential H∞ synchronization for uncertain neural networks with mixed time delays,” IEEE Trans. Syst., Man Cybern. B, Cybern., vol. 40, no. 1, pp. 173–185, Feb. 2010.
SHEN et al.: SYNCHRONIZATION AND STATE ESTIMATION FOR DISCRETE TIME-VARYING STOCHASTIC COMPLEX NETWORKS
[17] Z. Li and G. Chen, “Global synchronization and asymptotic stability of complex dynamical networks,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 1, pp. 28–33, Jan. 2006. [18] S. Mou, H. Gao, J. Lam, and W. Qiang, “A new criterion of delaydependent asymptotic stability for Hopfield neural networks with time delay,” IEEE Trans. Neural Netw., vol. 19, no. 3, pp. 532–535, Mar. 2008. [19] L. M. Pecora and T. L. Carroll, “Synchronization in chaotic systems,” Phys. Rev. Lett., vol. 64, no. 8, pp. 821–824, 1990. [20] R. Yang, Z. Zhang, and P. Shi, “Exponential stability on stochastic neural networks with discrete interval and distributed delays,” IEEE Trans. Neural Netw., vol. 21, no. 1, pp. 169–175, Jan. 2010. [21] R. Yang, H. Gao, and P. Shi, “Novel robust stability criteria for stochastic Hopfield neural networks with time delays,” IEEE Trans. Syst., Man Cybern. B, Cybern., vol. 39, no. 2, pp. 467–474, Apr. 2009. [22] Z. Toroczkai, “Complex networks: The challenge of interaction topology,” Los Alamos Sci., vol. 29, pp. 94–109, 2005. [23] J. Buhmann and K. Schulten, “Influence of noise on the function of a physiological neural network,” Bio. Cybern., vol. 56, nos. 5–6, pp. 313–327, 1987. [24] K. Wood, C. Van den Broeck, R. Kawai, and K. Lindenberg, “Continuous and discontinuous phase transitions and partial synchronization in stochastic three-state oscillators,” Phys. Rev. E, vol. 76, no. 4, pp. 041132-1–041132-9, Oct. 2007. [25] Z. Wang, D. W. C. Ho, Y. Liu, and X. Liu, “Robust H∞ control for a class of nonlinear discrete time-delay stochastic systems with missing measurements,” Automatica, vol. 45, no. 3, pp. 684–691, Mar. 2009. [26] Z. Wang, Y. Wang, and Y. Liu, “Global synchronization for discrete-time stochastic complex networks with randomly occurred nonlinearities and mixed time delays,” IEEE Trans. Neural Netw., vol. 21, no. 1, pp. 11–25, Jan. 2010. [27] H. Li and D. Yue, “Synchronization of Markovian jumping stochastic complex networks with distributed time delays and probabilistic interval discrete time-varying delays,” J. Phys. A: Math. Theoretical, vol. 43, no. 10, pp. 105101-1–105101-26, 2010. [28] Y. Tang, J. Fang, M. Xia, and D. Yu, “Delay-distribution-dependent stability of stochastic discrete-time neural networks with randomly mixed time-varying delays,” Neurocomputing, vol. 72, nos. 16–18, pp. 3830–3838, Oct. 2009. [29] J. Lü and G. Chen, “A time-varying complex dynamical network model and its controlled synchronization criteria,” IEEE Trans. Autom. Control, vol. 50, no. 6, pp. 841–846, Jun. 2005. [30] W. Zhong, J. D. Stefanovski, G. M. Dimirovski, and J. Zhao, “Decentralized control and synchronization of time-varying complex dynamical network,” Kybernetika, vol. 45, no. 1, pp. 151–167, 2009. [31] A. Coulon, O. Gandrillon, and G. Beslon, “On the spontaneous stochastic dynamics of a single gene: Complexity of the molecular interplay at the promoter,” BMC Syst. Biol., vol. 4, no. 2, pp. 1–18, Jan. 2010. [32] Z. Wang, D. W. C. Ho, and X. Liu, “State estimation for delayed neural networks,” IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 279–284, Jan. 2005. [33] Y. He, Q. Wang, M. Wu, and C. Lin, “Delay-dependent state estimation for delayed neural networks,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 1077–1081, Jul. 2006. [34] Y. Liu, Z. Wang, and X. Liu, “Design of exponential state estimators for neural networks with mixed time dekays,” Phys. Lett. A, vol. 364, no. 5, pp. 401–412, May 2007. [35] Y. Liu, Z. Wang, J. Liang, and X. Liu, “Synchronization and state estimation for discrete-time complex networks with distributed delays,” IEEE Trans. Syst., Man Cybern. B, Cybern., vol. 38, no. 5, pp. 1314– 1325, Oct. 2008. [36] N. Berman and U. Shaked, “H∞ control for discrete-time nonlinear stochastic systems,” IEEE Trans. Autom. Control, vol. 51, no. 6, pp. 1041–1046, Jun. 2006.
157
[37] E. Gershon, A. Pila, and U. Shaked, “Difference LMIs for robust H∞ control and filtering,” in Proc. Eur. Control Conf., Porto, Portugal, 2001, pp. 3469–3474. [38] E. Gershon, U. Shaked, and I. Yaesh, H∞ Control and Estimation of State-Multiplicative Linear Systems. New York: Springer-Verlag, 2005.
Bo Shen received the B.Sc. degree in mathematics from Northwestern Polytechnical University, Xi’an, China, in 2003. He is currently pursuing the Ph.D. degree in the School of Information Science and Technology, Donghua University, Shanghai, China. He is also now a visiting Ph.D. student in the Department of Information Systems and Computing, Brunel University, West London, U.K. He was a Research Assistant in the Department of Electrical and Electronic Engineering, University of Hong Kong, Hong Kong, China, from August 2009 to February 2010. His current research interests include nonlinear control and filtering, stochastic control and filtering, complex networks, and genetic regulatory networks. Dr. Shen is a very active reviewer for many international journals.
Zidong Wang (SM’03) was born in Jiangsu, China, in 1966. He received the B.Sc. degree in mathematics from Suzhou University, Suzhou, China, in 1986, the M.Sc. degree in applied mathematics in 1990, and the Ph.D. degree in electrical and computer engineering in 1994, both from Nanjing University of Science and Technology, Nanjing, China. He is currently a Professor of Dynamical Systems and Computing at Brunel University, West London, U.K. He has published more than 120 papers in refereed international journals. His current research interests include dynamical systems, signal processing, bioinformatics, and control theory and applications. Prof. Wang is currently serving as an Associate Editor for 12 international journals including the IEEE T RANSACTIONS ON AUTOMATIC C ONTROL, the IEEE T RANSACTIONS ON N EURAL N ETWORKS , the IEEE T RANSACTIONS ON S IGNAL P ROCESSING , the IEEE T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS —PART C: A PPLICATIONS AND R EVISIONS , and the IEEE T RANSACTIONS ON C ONTROL S YSTEMS T ECHNOLOGY.
Xiaohui Liu received the B.E. degree in computing from Hohai University, Nanjing, China, in 1982 and the Ph.D. degree in computer science from HeriotWatt University, Edinburg, U.K., in 1988. He is currently a Professor of Computing at Brunel University, West London, U.K. He leads the Intelligent Data Analysis (IDA) Group, performing interdisciplinary research involving artificial intelligence, dynamic systems, image and signal processing, and statistics, particularly for applications in biology, engineering and medicine. Prof. Liu serves on editorial boards of four computing journals, founded the biennial international conference series on IDA in 1995, and has given numerous invited talks in bioinformatics, data mining, and statistics conferences.
158
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Brief Papers Extended Input Space Support Vector Machine Ricardo Santiago-Mozos, Member, IEEE, Fernando Pérez-Cruz, Senior Member, IEEE, and Antonio Artés-Rodríguez, Senior Member, IEEE
Abstract— In some applications, the probability of error of a given classifier is too high for its practical application, but we are allowed to gather more independent test samples from the same class to reduce the probability of error of the final decision. From the point of view of hypothesis testing, the solution is given by the Neyman–Pearson lemma. However, there is no equivalent result to the Neyman–Pearson lemma when the likelihoods are unknown, and we are given a training dataset. In this brief, we explore two alternatives. First, we combine the soft (probabilistic) outputs of a given classifier to produce a consensus labeling for K test samples. In the second approach, we build a new classifier that directly computes the label for K test samples. For this second approach, we need to define an extended input space training set and incorporate the known symmetries in the classifier. This latter approach gives more accurate results, as it only requires an accurate classification boundary, while the former needs an accurate posterior probability estimate for the whole input space. We illustrate our results with well-known databases. Index Terms— Classifier output combination, multiple sample classification, Neyman–Pearson, support vector machines.
I. I NTRODUCTION We are given a set of K samples: T = {x1∗ , . . . , x∗K }, where x∗j ∈ Rd , and we are told that all of them belong to one of two possible alternatives. If the competing hypotheses are represented by their density functions, respectively, p1 (x) and p−1 (x), the most powerful test is given by the Neyman– Pearson lemma, which compares the product of the likelihood ratios for each x∗j to a threshold, which is determined by the size of the test [1]. The Type II error of the test decreases exponentially with K and the error exponent is given by the Manuscript received September 14, 2009; revised July 29, 2010 and October 26, 2010; accepted October 27, 2010. Date of publication November 18, 2010; date of current version January 4, 2011. This work was supported in part by Ministerio de Educación of Spain under projects DEIPRO TEC2009-14504C02-01 and COMONSENS CSD2008-00010. F. Pérez-Cruz was supported in part by Marie Curie Fellowship 040883-AI-COM. R. Santiago-Mozos has been supported in part by Marie Curie Transfer of Knowledge Fellowship of the EU sixth Framework Programme under contract CT-2005-029611 and Fundación Española para la Ciencia y la Tecnología, Ministerio de Educación of Spain. R. Santiago-Mozos is with the College of Engineering and Informatics, National University of Ireland Galway, Galway, Ireland (e-mail:
[email protected];
[email protected]). F. Pérez-Cruz and A. Artés-Rodríguez are with the Signal Theory and Communication Department, Universidad Carlos III de Madrid, Madrid 28903, Spain (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this brief are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2090668
Kullback–Leibler divergence between p1 (x) and p−1 (x) [2]. Therefore, the Neyman–Pearson lemma provides a tradeoff between the number of samples that we take before deciding and the achievable probability of error. This problem is standard in many applications of simple hypothesis testing. For example, in radar [3], several samples are collected prior to declaring whether a target is present. In medical applications, a subject is tested several times before a disease can be diagnosed, because some tests are unreliable and present high false-positive and misdetection rates. A neuron collects several spikes [4] before it detects a certain pattern. Taking a decision with several samples allows the reduction of the probability of error (misdetections and false alarms) at the cost of gathering more information and/or waiting longer. In all these applications, the samples are known to come from the same class and are gathered to increase reliability. In classification problems, the likelihoods p1(x) and p−1 (x) are unknown and we are given a set of samples (i.e., the training set) that describes each likelihood. If p1 (x) and p−1 (x) are known to belong to a parametric family, we could estimate those parameters and apply the likelihood ratio to decide the best hypothesis. Nevertheless, for these estimates the properties described by the Neyman–Pearson lemma do not hold. And more often than not, p1 (x) and p−1 (x) cannot be described by known parametric distributions and we would need to resort to nonparametric estimation methods. In any case, if we want to classify T into two alternative hypothesis, we would be ill-advised to estimate p1(x) and p−1 (x) and apply a product of likelihood ratio test (because it is an illposed problem [5]) instead of directly building a classifier from the training set, which assigns a label to our test data. When the likelihoods are unknown and we are given a training dataset, there is no equivalent result to the Neyman– Pearson lemma, which tells us how to take a unique decision for K test samples that are known to come from the same class, because they have been gathered that way. Two possible alternatives come to mind. First, train any classifier and combine its outputs for each of the K test samples to come up with a unique decision for these samples. We refer to this solution as the consensus decision and we explore it in Section II of this brief. Second, we propose to build a classifier that directly classifies the K test samples belonging to the same class. This direct option works with an extended input space that takes K samples at once and trains the desired classifier. In order to do this, we need to transform the original d-dimensional input space into a K d-dimensional space that represents the same problem and build the classifier with it. We refer to this solution as the direct decision and we explore it in Section III.
1045–9227/$26.00 © 2010 IEEE
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
We need to exploit the structure embedded in our problem to obtain a reduced complexity and high-performance classifier. The set T of test samples are not ordered and any order should provide the same outcome, as they all come from the same class. To exploit this symmetry in the extended input space, we have decided to adapt a support vector machine (SVM) [6] to this framework. SVMs are state-of-the-art nonlinear versatile classifiers that are well known and easy to train, although our procedure also holds for most classification algorithms of interest. We refer to our extended input space SVM as ESVM, which takes the set T and gives a binary decision. The consensus and direct decisions present different pros and cons, which make each technique desirable in different scenarios. The consensus decision needs accurate posterior probability estimates in the whole range (which might be hard to come by for high confidence decisions), however, it only needs to work with an input space of dimension d. The direct decision, on the one hand, operates with an input space of dimension K d that has been extended artificially without having more training samples to build the classifier, so it might be more exposed to the curse of dimensionality. On the other hand, the direct decision only needs to return a decision, not a posterior probability estimate. Also the consensus decision can be used for any value of K , while in the direct decision K is specific. It is not clear cut which algorithm would be preferable, although in the experiments carried out it seems that the direct approach might be preferable (lower probability of error). The rest of this brief is organized as follows. In Section II, we establish how to combine the posterior probability estimates from each test sample to reach a consensus label. We present the direct decision in Section III and the extended input space SVM in Section IV. We introduce an illustrative example in Section V together with the performance of the proposed extended input space SVM with well-known databases. We conclude in Section VI. II. C ONSENSUS D ECISION
y ∗ ∈ {±1}.
(1)
We could be tempted to compute the posterior probability estimate as
p(y ∗ = 1|T ) =
K
p(yk∗ = 1|x k∗ )
k=1 K
k=1
p(yk∗ = 1|x k∗) +
K
as follows: p(y ∗ = 1) p(x1∗ , . . . , x∗K |y ∗ = 1) p(x1∗ , . . . , x∗K ) ∗ p(y = 1) k p(xk∗ |y ∗ = 1) = p(x1∗ , . . . , x∗K ) p(y ∗=1|xk∗ ) p(xk∗) p(y ∗ = 1) k p(y ∗ =1) = ∗ p(x1 , . . . , x∗K ) ∗ ∗ k p(y = 1|xk ) p(y ∗ = 1) K −1 = ∗ ∗ . (3) ∗ = 1|x∗ ) p(y k p(y = −1|xk ) k k + p(y ∗ = 1) K −1 p(y ∗ = −1) K −1 To obtain the second equality, we have applied the fact that the K test samples are independent given the label, and for the third equality we have applied Bayes rule, as we did for the first. To decide for either hypothesis, we need to multiply the posterior probability estimate for each sample and divide by the prior probabilities to the K −1 power. We assign the consensus label to the K test sample set by the following rule: ⎧ ∗ ∗ ∗ ∗ ⎨1, k p(y = −1|xk ) k p(y = 1|xk ) > ∗ (4) y = p(y ∗ = 1) K −1 p(y ∗ = −1) K −1 ⎩ −1, otherwise. p(y ∗ = 1|T ) =
III. D IRECT D ECISION In this section, we describe an algorithm for extending the training set to a K d-dimensional space, in which we can classify the K test samples directly. Given a training }, where xic ∈ Rd and set D = {x11 , . . . , xn1+ , x1−1 , . . . , xn−1 − c ∈ {±1} denotes the class label, we define the initial extended input space training set for class +1 as ⎡ 1 ⎤ xℓ11 xℓ112 · · · xℓ11n¯ + ⎢ x1 ⎥ 1 1 ⎢ ℓ21 xℓ22 · · · xℓ2n¯ + ⎥ 1 1 1 1 ⎥ ⎢ Zo = [z1 z2 · · · zn¯ + ] = ⎢ . .. .. ⎥ (5) .. . ⎣ .. . . ⎦ xℓ1K 1 xℓ1K 2 · · · xℓ1K n¯ +
We want to classify T into two possible alternatives and we have a soft-output classifier whose output can be interpreted as a posterior probability estimate, i.e., p(y ∗ |x∗ )
159
(2) p(yk∗ = −1|x k∗)
k=1
and decide accordingly. But (2) relies on each of the test samples being independent, but they are not, because all the test samples belong to the same class. To assign a consensus label to the test set with K samples we proceed
which is a K d × n¯ + matrix containing in its columns the samples of the extended input space with n¯ + = ⌊n + /K ⌋. The indices li j are samples without replacement for 1, . . . , n + , consequently, in each column of Z1o we have an independent representation of the training set. We can similarly obtain an extended input space training set for class −1, namely Z−1 o . Building the training set this way presents a severe limitation, as it reduces the number of training samples by a factor of K and makes the direct classifier harder to train. But there are symmetries in this problem that can be readily included in the training set that increases the number of training samples. These symmetries additionally impose constraints in the optimization problem, which simplifies its training procedure, as we explain in the next section. Let us use a representative example to illustrate this point. Suppose that K = 2 and n + = 10, a possible Z1o might be x1 x1 x1 x1 x1 1 1 1 1 1 1 4 10 8 6 . Zo = z1 z2 z3 z4 z5 = 11 (6) x7 x21 x51 x31 x91
160
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
1 example, in this set we have the extended 1 sample z1 = For 1 x x1 , and we should expect the sample 71 to present the x71 x1 same label, because the ordering of the sample should not matter. Therefore, we can extend the initial training set by including this permutation Z1p = z111 z121 z131 z141 z151 z112 z122 z132 z142 z152 1 1 x1 x41 x10 x81 x61 x71 x21 x51 x31 x91 = 1 . (7) 1 x7 x21 x51 x31 x91 x11 x41 x10 x81 x61
In this notation, the first subindex in ziℓ identifies the sample, and the second the permutation. This procedure can be applied for any number of training samples n + and n − and any number of test samples K . For any K , Z1p , and Z−1 p , respectively, contain n¯ + K ! and n¯ − K ! columns of dimension K d, as we need to incorporate all possible permutations z1i . Finally, to further increase the sample diversity of the training procedure, we randomly create R samples for Z1o and Z−1 o from D. Then, we build the final training matrix for the direct classifier as follows: Z = Z1 Z−1 −1 −1 = Z1p1 Z1p2 · · · Z1p R Z−1 p1 Z p2 · · · Z p R . (8) IV. E XTENDED I NPUT S PACE SVM
Now we can use the matrix Z in (8) to train any binary nonlinear classifier of our liking, and we can then classify K test samples directly. But by doing so, we are ignoring the symmetries that we have included in this extended training set, from which the learning machine should benefit. Adding these symmetries to the learning procedure would be classifier dependent. Therefore, we operate from now on with SVMs, although, as mentioned earlier, similar steps can be applied to most classifiers of interests. We have chosen SVMs, because they are state-of-the-art learning machines that can be easily trained and have been used in may different applications with considerable success. In what follows, we assume the reader is already familiar with soft-margin nonlinear SVMs and its primal and dual representations, in any case, a detailed presentation can be found in [6], [7]. The SVM primal optimization functional solves 1 ξi (9) min ||w||2 + C w,b,ξi 2 i
subject to
φ(ziℓ ) + b ≥ 1 − ξi yi w ⊤
ℓ
i
which has to be minimized with respect to the primal variables (w, b and ξi ) and maximized with respect to the dual variables (αi and ηi ). We can now compute the Karush–Kuhn–Tucker conditions [8] ∂L = w− αi yi ϕ(zi ) = 0 (13) ∂w i ∂L = αi yi = 0 (14) ∂b i
∂L = C − ηi − αi = 0 ∂ξi
where we have defined ϕ(zi ) =
N
max αi
N
N
1 αi αi α j yi yi kϕ (zi , z j ) − 2 i=1 j =1
1 We have modified the notation in this section to make it compatible with standard SVM notation, as it makes this section easier to understand.
(17)
i=1
subject to (14) and 0 ≤ αi ≤ C. The kernel is given by K! K!
φ(ziℓ1 )⊤ φ(z j ℓ2 )
ℓ1 =1 ℓ2 =1
=
K! K!
ℓ1 =1 ℓ2 =1
where we have removed the superindex in ziℓ , which determines the class label, and have replaced it by yi , which also takes the values +1 or −1, depending on the class label.1 In (10) and (11), i = 1, . . . , N, with N = R(n¯ + + n¯ − ) and
(16)
φ(ziℓ )
with zi being any ziℓ . We can use standard optimization tools to obtain the dual formulation
(10) (11)
(15)
ℓ
kϕ (zi , z j ) = ϕ(zi )⊤ ϕ(z j ) =
ℓ
ξi ≥ 0
ℓ = 1, . . . , K ! contains all the possible permutations for any training sample. The function φ(·), which can be characterized by its kernel kφ (·, ·) = φ(·)⊤ φ(·), is a nonlinear mapping of the input to a higher dimensional space [6]. In the standard SVM formulation, the slack variables ξi and the class label yi would also be indexed by ℓ, because there can be a nonzero slack for each training sample, and each sample has its own class label. But by construction, we have included all the symmetries in the training set, so yi is identical for any ℓ and w⊤ φ(ziℓ ) is independent of ℓ. To clearly understand this last point, see the example of a training set in (7), in which the training set presents all permutations of each training sample, consequently, given the symmetries, the learning machine has no information to provide us with different outputs for each permutation. The Lagrangian for this problem becomes 1 ηi ξi ξi − L(w, b, ξi , αi , ηi ) = ||w||2 + C 2 i i ⊤ (12) φ(ziℓ ) + b) − 1 + ξi αi yi (w −
kφ (ziℓ1 , z j ℓ2 ) = K !
K!
kφ (zi , z j ℓ ).
ℓ=1
(18)
The solution for this nonlinear transformation is quite interesting, because it adds the symmetry in the dual, we are training with all the possible symmetries of a training sample without needing to add a support vector for each new symmetric sample and we use a single nonlinear transformation.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
161
For large K , the computational complexity can be quite high and we use LibSVM [9] to train the SVM. There are other approaches that can train the SVM with over a million training samples [10], [11] and that can also be applied to solve the ESVM.
0.8
V. E XPERIMENTS
0.4
A. Toy Example We are given n samples generated from a zero-mean unitvariance Gaussian that describes the class +1, and for the class −1 we are also given n samples from a unit-mean Gaussian with variance 4. We first train a nonlinear SVM with 50 samples from each class. We show in Fig. 1(a) the posterior estimate given by the SVM using Platt’s method (dashed line) and the true posterior probability (solid line). As we mentioned in the introduction, we notice that the SVM accurately predicts if a sample is correctly classified, because, if we threshold the decisions at 0.5 in Fig. 1(a), we can see that the SVM predictions and the true posterior decisions almost coincide. But the SVM posterior probability estimates are far from accurate when we are away from the classification boundary and it does not accurately estimate extreme posterior probabilities [e.g., x > 4 or x < −4 in Fig. 1(a)]. We have plotted in Fig. 1(b) the Bayes (solid), the SVM2 (dashed), and the ESVM2 (dash-dotted) decision boundaries. We see that the ESVM2 is closer to the optimal decision function and it does not have the artifacts that the SVM2 presents, due to its inaccurate posterior probability estimates. In Fig. 2(a) and (b), we show the probability of error as a function of K for 20 and 50 training samples per class, respectively. (For this experiment, we did not impose the symmetries in the kernel and we only train with the
0.6
p(y= 1|x)
0.5
0.3 0.2 0.1 0 −8
−6
−4
0
−2
2
4
6
8
10
x (a) 3 SVM2 ESVM2
2
Bayes
1 0
(z)2
We first show a toy 1-D example, in which we illustrate the benefits of classifying K samples directly, instead of combining the individual decisions. We then move on to classify different well-known databases, which allows the drawing of some general conclusions about the performance of our extended input space SVM. In the figures and tables, we denote as SVM K the solution reached by consensus, combining K SVM outputs. To obtain the SVM K solution, we have used (4) and have transformed the SVM soft outputs into posterior probability estimates using Platt’s method [12]. We use LibSVM [9] to train the SVM, and the implementation of Platt’s method is given in [13]. We denote as ESVM K the solution of the extended input space SVM with K samples. We also use LibSVM to train the ESVM. In all the experiments, the hyperparameters of the SVM and ESVM have been computed by cross-validation and we have used a radial basis function kernel. We have set N = K n (i.e., R = N/(n¯ + + n¯ − )), where n is the number of training samples in the original database. We have found empirically that increasing the number of training samples makes the ESVM K predictions more stable. In any case, the information in the training sets for the SVM and ESVM is the same.
SVM + Platt Bayes
0.7
−1 −2 −3 −4 −5 −5
−4
−3
−2
−1 (z)1 (b)
0
1
2
3
Fig. 1. (a) True posterior probability and its estimate using an SVM with 50 training samples and Platt’s method. (b) Optimal decision boundary together with the SVM2 and ESVM2 classification functions.
extended input space training sets Z1o and Z−1 o , as K ! grows exponentially with K .) To obtain these plots, we have run 104 independent trials with 105 test samples. The probability of error reduces as we increase K . In the figures, we also see that the performance gap between the SVM K and the ESVM K increases with K . This is an expected result, because as K increases the inaccuracies in the SVM posterior probability estimate are more noticeable, when we compute the consensus label for the test set. The ESVM K only focuses on the decision boundary and it does not need to give accurate soft outputs. It is also noteworthy that the probability of error reduces even when K becomes larger than the number of training samples per class. It can be seen that for larger K , the ESVM K and SVM K solution are better and closer to each other and to the Bayes solution, but still the ESVM K outperforms the SVM K for all K . Finally, in Table I, we show the performance of the SVM2 and ESVM2 as we increase the number of training samples
162
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE II S UMMARY OF THE 13 D ATABASES U SED IN T HIS S ECOND E XPERIMENT: I TS N AME , D IMENSION , N UMBER OF T RAINING AND T EST PATTERNS ,
Prob Error
AND THE
10−1
SVMK ESVMK Bayes
10−2
0
10
20
30
40
50
K (a)
Name Titanic Flare-solar Banana Breast-cancer Diabetes Waveform Ringnorm Twonorm Thyroid German Heart Splice Image
Dim 3 9 2 9 8 21 20 20 5 20 13 60 18
SVM P REDICTION E RROR # Train 150 666 400 200 468 400 400 400 140 700 170 1000 1300
# Test 2051 400 4900 77 300 4600 7000 7000 75 300 100 2175 1010
SVM 2.28e-1 ± 1.2e-2 3.23e-1 ± 1.8e-2 1.09e-1 ± 5.6e-3 2.52e-1 ± 4.5e-2 2.32e-1 ± 1.7e-2 9.80e-2 ± 4.4e-3 1.50e-2 ± 9.5e-4 2.43e-2 ± 1.4e-3 4.62e-2 ± 2.1e-2 2.41e-1 ± 2.2e-2 1.55e-1 ± 3.4e-2 1.08e-1 ± 7.4e-3 3.24e-2 ± 6.1e-3
SVMK
TABLE III
ESVMK Bayes
SVM3 S OLUTION C OMPARED WITH THE ESVM3 FOR 13 D ATABASES . T HE HSVM3 S OLUTION S HOWS THE SVM3 P ERFORMANCE WITH H ARD O UTPUTS
Prob Error
10−1
10−2
10−3
0
10
20
30
40
50
K (b) Fig. 2. Error rate for the toy example as a function of K for (a) 40 and (b) 100 training samples.
Name Titanic Flaresolar Banana Breastcancer Diabetes Waveform Ringnorm Twonorm Thyroid German Heart Splice Image
SVM3 1.85e-1 ± 2.2e-2 2.31e-1 ± 3.7e-2
ESVM3 1.36e-1 ± 2.3e-2 1.87e-1 ± 3.0e-2
HSVM3 1.88e-1 ± 2.3e-2 2.46e-1 ± 3.8e-2
1.78e-2 ± 3.9e-3 1.89e-1 ± 6.8e-2
1.53e-2 ± 3.3e-3 1.80e-1 ± 6.5e-2
3.53e-2 ± 4.3e-3 2.46e-1 ± 6.0e-2
1.23e-1 1.15e-2 3.52e-4 3.64e-4 4.00e-4 1.46e-1 5.28e-2 1.65e-2 1.04e-3
± ± ± ± ± ± ± ± ±
3.1e-2 2.9e-3 4.7e-4 3.6e-4 4.0e-3 3.1e-2 3.7e-2 5.0e-3 1.7e-3
1.18e-1 1.04e-2 5.14e-5 3.04e-4 4.17e-4 1.47e-1 5.80e-2 2.04e-2 4.17e-3
± ± ± ± ± ± ± ± ±
2.8e-2 2.8e-3 1.4e-4 3.1e-4 4.2e-3 3.4e-2 4.0e-2 4.5e-3 4.6e-3
1.74e-1 3.05e-2 8.75e-4 1.83e-3 1.03e-2 1.99e-1 7.13e-2 3.17e-2 2.68e-3
± ± ± ± ± ± ± ± ±
3.1e-2 5.0e-3 5.8e-4 7.9e-4 1.8e-2 3.7e-2 4.0e-2 6.7e-3 2.9e-3
TABLE I P ROBABILITY OF E RROR OF SVM2 AND ESVM2 , AS THE N UMBER OF T RAINING S AMPLES P ER C LASS I S I NCREASED T OGETHER WITH T HEIR S TANDARD D EVIATION # samples SVM2 ESVM2
20 0.333 ± 0.154 0.264 ± 0.040
50 0.249 ± 0.030 0.237 ± 0.016
200 0.240 ± 0.020 0.225 ± 0.006
per class. Both methods converge to the same solution, which corresponds to the Bayes classifier whose error is 0.22 for this problem with K = 2. The results in this table have been obtained with 104 independent trials with 105 test samples. B. Real Databases We have also carried out experiments with the 13 databases in [14]. Each database has been preprocessed to present zero mean and unit variance, and 100 training and test samples sets have been generated, except for Splice and Image, which
only use 20. In Table II, we present the databases and some of their key features, together with the best SVM solution. For all the experiments in this section, we have set K = 3, although the results can be readily extended for larger values of K . To build the test set for the extended input space with K = 3, we first split the test database into two parts, one for the class +1 samples and the other for the class −1 samples. We then take three consecutive examples from each part without replacement until all the samples have been used. To compute the prior probabilities for the consensus decision, we use the relative frequencies in the training set. In Table III, we report the probability of error for SVM3 and ESVM3 . In Table IV, we compare with two statistics the errors in Table III to measure the difference between SVM3 and ESVM3 and report whether these differences are statistically significant. The first one is the classic t-test [15] and the second one is a more conservative corrected resampled t-test [16]. We have used boldface to denote that ESVM3 is better than
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE IV T WO S TATISTICS C OMPUTED TO C OMPARE W HETHER THE D IFFERENCE B ETWEEN ESVM K AND SVM K A RE S IGNIFICANT. B OLDFACED VALUES I NDICATES ESVM K I S S UPERIOR AND I TALICS A RE U SED O THERWISE Name Titanic Flare-solar Banana Breast-cancer Diabetes Waveform Ringnorm Twonorm Thyroid German Heart Splice Image
Test in [16] 5.30e-101 2.00e-096 1.84e-006 1.83e-033 5.28e-017 0.0279 0.543 0.903 0.973 0.0451 6.90e-018 2.62e-007 5.55e-006
0.2
When the likelihoods are unknown and we are given a training dataset, there is no equivalent result to the Neyman– Pearson lemma, which tells us how to take a unique decision for K test samples. We explored two alternatives to solve this problem. The consensus decision takes the posterior probability estimates to predict a single label for the set T , and the direct decision builds a classifier that classifies T in one take. We have also shown how the symmetries of the extended input space can be added to SVMs to give more accurate and reduced complexity classifiers. R EFERENCES
0.15 0.1 0.05 0
0
2
4 6 8 10 12 14 # number training samples for ESVM3/n
number of training samples of the extended input space. We have generated training sets with 0.25n, 0.5n, n, 2n, 4n, 8n, and 16n, where n is the number of training patterns in Table II. We notice that once we use n training samples, there is little improvement, as the amount of information to learn the classifier is limited by n, not the number of repetitions that we use. VI. C ONCLUSION
breast−cancer german twonorm banana splice image
0.25
Prob Error
t-Test in [15] 5.26e-200 2.23e-195 7.25e-073 3.31e-127 4.49e-102 1.35e-040 1.80e-008 0.224 0.730 3.21e-037 9.58e-104 7.53e-019 4.57e-017
163
16
Fig. 3. Probability of error as we artificially increase the extended input space training set for six representative databases.
SVM3 and we have used italic-face when SVM3 is better than ESVM3 . For the t-test, there are seven databases in which ESVM3 is superior to SVM3 and four otherwise. For the more conservative test, only six databases pass the statistically significant threshold. This is an expected result, as the ESVM has some limitations and it should not always be better than SVM K , but it is clear than in some cases it is much better and for some problems might be the way to improve the solution as we gather more examples. In the Table III, we also report the SVM3 performance with hard outputs, denoted by HSVM3 . These results show that, even though Platt’s method is inaccurate for short training sequences, it is better than not using the SVM soft output at all. Also, if we compare the results for the SVM in Table II and the ESVM3 or SVM3 in Table III, we can see a significant gain in all cases. Either of the proposed methods would improve the performance of the SVM, if we can gather more independent samples. Finally, we show the probability of error of six representative databases for ESVM3 in Fig. 3 when we increase the
[1] L. Wasserman, All of Statistics. New York: Springer-Verlag, 2004. [2] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [3] H. V. Poor, An Introduction to Signal Detection and Estimation, 2nd ed. New York: Springer-Verlag, 1994. [4] P. Dayan and L. F. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. Cambridge, MA: MIT Press, 2005. [5] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [6] B. Schölkopf and A. Smola, Learning with Kernels. Cambridge, MA: MIT Press, 2001. [7] F. Perez-Cruz and O. Bousquet, “Kernel methods and their potential use in signal processing,” IEEE Signal Process, Mag., vol. 21, no. 3, pp. 57–65, May 2004. [8] R. Fletcher, Practical Methods of Optimization, 2nd ed. New York: Wiley, 1987. [9] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working set selection using the second order information for training SVM,” J. Mach. Learn. Res., vol. 6, pp. 1889–1918, Dec. 2005. [10] T. Joachims, “Training linear SVMs in linear time,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Philadelphia, PA, 2006, pp. 217–226. [11] S. S. Keerthi and D. DeCoste, “A modified finite Newton method for fast solution of large scale linear SVMs,” J. Mach. Learn. Res., vol. 6, pp. 341–361, Dec. 2005. [12] J. C. Platt, “Probabilities for SV machines,” in Advances in Large Margin Classifiers, A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, Eds. Cambridge, MA: MIT Press, 2000, pp. 61–73. [13] H. Lin, C.-J. Lin, and R. C. Weng, “A note on Platt’s probabilistic outputs for support vector machines,” Mach. Learn., vol. 68, no. 3, pp. 267–276, Oct. 2007. [14] G. Rätsch, B. Schölkopf, A. J. Smola, S. Mika, T. Onoda, and K.-R. Müller, “Robust ensemble learning,” in Advances in Large Margin Classifiers, A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, Eds. Cambridge, MA: MIT Press, 2000, pp. 207–220. [15] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. San Mateo, CA: Morgan Kaufmann, 2005. [16] C. Nadeau and Y. Bengio, “Inference for the generalization error,” Mach. Learn., vol. 52, no. 3, pp. 239–281, Sep. 2003.
164
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Robust Stability Criterion for Discrete-Time Uncertain Markovian Jumping Neural Networks with Defective Statistics of Modes Transitions Ye Zhao, Lixian Zhang, Shen Shen, and Huijun Gao
Abstract— This brief is concerned with the robust stability problem for a class of discrete-time uncertain Markovian jumping neural networks with defective statistics of modes transitions. The parameter uncertainties are considered to be norm-bounded, and the stochastic perturbations are described in terms of Brownian motion. Defective statistics means that the transition probabilities of the multimode neural networks are not exactly known, as assumed usually. The scenario is more practical, and such defective transition probabilities comprise three types: known, uncertain, and unknown. By invoking the property of the transition probability matrix and the convexity of uncertain domains, a sufficient stability criterion for the underlying system is derived. Furthermore, a monotonicity is observed concerning the maximum value of a given scalar, which bounds the stochastic perturbation that the system can tolerate as the level of the defectiveness varies. Numerical examples are given to verify the effectiveness of the developed results. Index Terms— Markovian jumping neural network, stability, transition probability matrix.
I. I NTRODUCTION The past decades have witnessed extensive research on neural networks (NNs) in both mathematics and control communities, e.g., [1] and [2]. These studies are motivated by numerous applications of the NNs in diverse fields such as associative memory, pattern recognition, image processing, etc. As a major concern, the stability problem of the NNs has drawn much attention and a great number of efficient analysis approaches have been proposed in the literature, e.g., [3] and [4]. Meanwhile, considering the NNs involved with parameter uncertainties and/or stochastic perturbations, which frequently lead to the poor performance or even instability of the system, the corresponding stability analyses have also been widely investigated and many useful results have been obtained, e.g., [5] and [6] and the references therein. Moreover, the NNs often display a feature of network modes jumpings and such jumpings are commonly considered to be determined by an ideal homogeneous Markov chain in the Manuscript received June 2, 2010; accepted October 31, 2010. Date of publication December 3, 2010; date of current version January 4, 2011. This work was supported in part by the Open Project of State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology under Project QAK201008, the National Natural Science Foundation of China under Grant 60904001, the Foundation of Science and Technology Innovative Talents of Harbin City under Project 2010RFLXS007, the Outstanding Youth Science Fund of China under Grant 60825303, under 973 Project 2009CB320600 in China, and the Key Laboratory of Integrated Automation for the Process Industry (Northeastern University), Ministry of Education. L. Zhang and H. Gao are with the State Key Laboratory of Urban Water Resources and Environment, Harbin Institute of Technology (HIT), Harbin 150090, China, and also with the Space Control and Inertial Technology Research Center, HIT, Harbin 150080, China (e-mail:
[email protected];
[email protected]). Y. Zhao and S. Shen are with the Space Control and Inertial Technology Research Center, Harbin Institute of Technology, Harbin 150080, China (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TNN.2010.2093151
most literature. With the aid of analysis and synthesis methodologies in the dynamic systems with Markovian jumping parameters, i.e., the Markov jump linear systems (MJLSs), several significant results on the Markovian jumping neural networks (MJNNs) have been reported, e.g., [7] and [8]. It is worth mentioning that a recent interesting consideration for MJLSs is that the transition probabilities (TPs) to form the Markov chain are assumed to be not exactly known. The scenario containing such defective TPs is more general and the underlying MJLSs are thereby more practicable. Consequently, a few meaningful studies have been carried out, e.g., [9]–[12], and two concepts have been proposed so far, namely, the partially unknown TPs [11] and the uncertain TPs [10]. Also, the idea of the partially unknown TPs has recently been applied to the MJNNs [13]. For the concept of uncertain TPs, the elements in a transition probability matrix (TPM) are uncertain within an interval, and two description methods, namely the norm-bounded and the polytope uncertainty description, have been proposed. Correspondingly, the true elements in a TPM are unknown but belong to a given range with lower and upper bounds [10], or a given polytope with a certain number of vertices [14]. It should be noted that such given information is assumed obtainable when perfect statistics of the modes transitions is targeted in practical samplings and computations. On the other hand, the concept of partially unknown TPs assumes that some elements in a TPM are known, and others are not (even without any further given information of the statistics) [11]. Therefore, it can be well understood that the concept of partially unknown TPs is more general and the concept of uncertain TPs is less conservative since more information is “contrived” in the latter case. In fact, the uncertain TPs can be considered as the unknown ones with further given knowledge offered from statistics. In reverse, the unknown TPs can also be viewed as uncertain ones within their “natural” intervals, which can be calculated from the known TPs and the property that the sum of each row is 1 in a TPM. In other words, the two concepts of the defective statistics are mathematically interrelated. Nevertheless, from a different viewpoint, such two concepts actually reflect different levels of the defectiveness. Note that, so far, these two lines of attacks of the defective statistics of modes transitions are still dealt separately. In fact, a more practical scenario that designers may encounter is that some TPs are known, some are uncertain with tighter intervals, and others are unknown with “natural intervals.” However, the issues on MJLSs taking account of the two aforesaid concepts of defective TPs in a composite manner are largely open, let alone the applications to the MJNNs area. In this brief, we aim to address the robust stability problem for a class of discrete-time uncertain MJNNs with defective statistics of modes transitions. The parameter uncertainties are assumed to be norm-bounded, and the stochastic perturbations are described in terms of Brownian motion. The main contribution of this brief is that a framework incorporating the two concepts of the partially unknown TPs and the uncertain TPs is proposed for the first time and the underlying MJNN is studied under the framework. A sufficient stability criterion for the underlying system is obtained by using the property of
1045-9227/$26.00 © 2010 IEEE
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
the TPM and the convexity of uncertain domains. A monotonicity concerning the maximum value of a given scalar which bounds the stochastic perturbations affecting the system stability is observed as the level of the defectiveness varies. The remainder of this brief is organized as follows. In Section II, the mathematical model of the system concerned is formulated and some preliminary results are given. Section III is devoted to establishing the stability criterion for the underlying system and deriving several corollaries for the different simplified cases of the system. Numerical examples are provided in Section IV and this brief is concluded in Section V. Notation: The notations used in this brief are quite standard. Rn and Rm×n refer to, respectively, the n-dimensional Euclidean space, and the set of all m × n real matrices, N+ stands for the sets of positive integers. The notation P > 0 (≥ 0) means P is real symmetric positive (semi-positive) definite and the superscript “T ” denotes the transpose of vectors or matrices. Moreover, let ( , F , P) be a complete probability space, in which is the sample space, F is the σ algebra of subsets of the sample space, and P is the probability measure on F . In addition, in symmetric block matrices or long matrix expressions, we use * as an ellipsis for the terms that are introduced by symmetry and diag{· · · } stands for a block-diagonal matrix. Matrices, if their dimensions are not explicitly stated, are assumed to be compatible for algebraic operations. E [·] stands for the mathematical expectation and Mi is adopted to denote M(i ) for brevity. I and 0 represent, respectively, identity matrix and zero matrix with appropriate dimensions. II. P RELIMINARIES AND P ROBLEM F ORMULATION Consider an n-neuron discrete-time uncertain Markovian jumping neural network, defined in a complete probability space ( , F , P) y(k + 1) = (A(r (k)) + A(r (k)))y(k) + (B(r (k)) +B(r (k))) × f (y(k)) + σ (y(k), k)w(k) (1) (k))T
Rn ,
where y(k) = (y1 (k), y2 (k), . . . , yn ∈ is the state vector associated with the n neurons, f (y(k)) = ( f 1 (y1 (k)), f2 (y2 (k)), . . . , f n (yn (k)))T ∈ Rn , denotes the nonlinear activation function with the initial condition f (0) = 0, A(r (k)) = diag {a1 (r (k)), a2 (r (k)), . . . , an (r (k))} has positive entries am (r (k)) < 1, ∀m = 1, 2, . . . , n, the real matrix B(r (k)) is the constant connection weight matrix. In addition, A(r (k)), B(r (k)) are time-varying parameter uncertainties. The Markov chain {r (k), k ∈ N+ } orchestrating the modes jumpings of the NNs takes values in a finite set I {1, . . . , N} with mode TPs Pr(r (k + 1) N= j |r (k) = i ) = πi j , where πi j ≥ 0, ∀i , j ∈ I, and j =1 πi j = 1. Correspondingly, the Markovian transition probability matrix
is defined by ⎤ ⎡ π11 π12 · · · π1N ⎢ π21 π22 · · · π2N ⎥ ⎥ ⎢ (2)
=⎢ ⎥. .. ⎦ ⎣ . π N1
π N2
· · · πN N
165
The set I contains N modes of (1) and for r (k) = i ∈ I, the system matrices of the i t h mode are denoted by (Ai + Ai , Bi + Bi ), which are real and known with compatible dimensions. Taking account of the stochastic perturbations in forms of σ (y(k), k)w(k), w(k) is a scalar Brownian Motion on ( , F , P) such that E[w(k)] = 0, E[w2 (k)] = 1. Now we recall a necessary assumption for our derivation. Assumption 1: The function σ : Rn × R Rn is a Borel measurable function which satisfies
σ T (x, k)σ (x, k) ≤ ρx T x,
∀x ∈ Rn
(3)
where ρ is a positive constant, which bounds the stochastic perturbations that the system can tolerate. More details of using ρ to describe the stochastic perturbations can be found in [15]. In addition, the parameters uncertainties, as commonly adopted in literature, e.g., [2] and [16], are assumed to have the structure ∀r (k) = i , [Ai , Bi ] = Mi Fi [N1i , N2i ], where Mi , N1i and N2i are real constant matrices and Fi is an unknown time-varying matrix-valued function and satisfies FiT Fi ≤ I, ∀i ∈ N+ . Remark 1: Note that, in practice, all the elements or a part of them in TPM (2) are probably costly or even impossible to obtain. Thus, instead of putting great efforts to measure or estimate the TPM, it is necessary and significant, from control perspectives, to further conduct research on the MJNNs with defective statistics of modes transitions. In this brief, the statistics of modes transitions is considered to be defective. Specifically, some elements in matrix are assumed not known exactly. They may be uncertain within given intervals offered from statistics, or they do not have such available intervals. We term the former as “uncertain” elements, and the latter as “unknown” ones. As described in [14], we assume that the TPM = [π i j ] N×N belongs to a given polytope P with vertices r , r = 1, 2, . . . , M M M P | = αr = 1 (4) αr r ; αr ≥ 0, r=1
r=1
where r = [π i j ] N×N , i, j ∈ I, r = 1, . . . , M, are given TPMs containing unknown elements still. It is worth emphasizing that in (4), the property of each TPM r holds and the property of TPM will be accordingly satisfied. (i) (i) For simplicity, ∀i ∈ I, we denote I = I K ∪ IU C ∪ IU(i)K as follows: (i)
(i)
IK { j : πi j is known}, IU C { j : π˜ i j is uncertain} and (i)
IU K { j : πˆ i j is unknown}. Here, each uncertain element and unknown element is labeled with the tide “ ˜ ” and “ ˆ ”, respectively. Then, let (i) (i) πU C j ∈I (i) π˜ irj , ∀r = 1, . . . , M and πK j ∈I (i) πi j , UC K respectively. Remark 2: Note that the unknown elements actually have “natural intervals” which can be determined by the known elements, the lower and upper bounds of the uncertain elements,
166
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
and the property that the sum of each row is 1 in a TPM. However, the reason of differentiating uncertain elements from unknown elements is that the uncertain elements with tighter intervals (not only the “natural intervals”) reflect more information of the statistics and the TPM can be described more precisely. The objective of this brief is to establish a stability criterion for (1) when the statistics of modes transitions is defective as stated in (4). To proceed further, we recall the essential assumption for the neuron activation function and the definition of asymptotic stability in the mean square for the underlying system. Assumption 2: The neuron activation function in MJNN (1) is monotonically increasing and bounded, which satisfies f j (s1 ) − f j (s2 ) 0≤ ≤ h, ∀ j = 1, . . . , n s1 − s2 where s1 , s2 ∈ R, s1 = s2 , and h is a positive constant. Definition 1: The MJNN (1) is said to be asymptotically stable in the mean square if, for any solution y(k) of (1), the following holds:
Psi Ai Ŵ ∗ ∗
Psi Bi L + ǫ N1iT N2i ̥ ∗
⎤
Psi Mi 0 ⎥ ⎥< 0 0 ⎦ −ǫ I
µ∗ ρ I
Pi + ǫ N1iT N1i ,
(5) (6)
where ⎧ (i) (i) ⎪ PK = j ∈I (i) π i j P j , PU C = j ∈I (i) ˜ irj P j , ⎪ ⎪ ⎪ K UC ⎪ ⎪ ⎨ P (i) = (i) ˆ P , i j j UK j ∈ IU K M (i) ⎪ i = P (i) + ⎪ ˜ irj )P j + PU K , P (i) ( ⎪ r=1 αr π K j ∈ I ⎪ UC ⎪ ⎪ ⎩ i (i) (i) (i) (i) (i) Ps = PK + PU C + (1 − πK − πU C )P j , ∀ j ∈ IU K
E E[V (yk+1 , k + 1, rk+1 ) |yk , rk = i ] − V (yk , k, rk ) T P i yk+1 − ykT Pi yk = yk+1 T = A˜ i yk + B˜i f (yk ) P i A˜ i yk + B˜i f (yk )
+ σ T (yk )P i σ (yk ) − ykT Pi yk T ≤ A˜ i yk + B˜i f (yk ) P i A˜ i yk + B˜i f (yk ) + σ T (yk )P i
× σ (yk ) − ykT Pi yk + 2ykT L f (yk ) − 2h −1 f T (yk )L f (yk )
In this section, we will derive a stability criterion for the discrete-time uncertain MJNN (1) with defective statistics of modes transitions (4) and simplify the criterion when the complex dynamics in (1) are reduced. The following theorem presents a sufficient condition on the asymptotic stability in the mean square for (1). Theorem 1: Consider the MJNN (1) with the defective TPM (4). Suppose that Assumptions 1 and 2 hold. The corresponding system is asymptotically stable in the mean square if there exist a set of matrices Pi > 0, a diagonal matrix L > 0, and positive scalars µ∗ and ǫ, ∀i ∈ I, such that
−2h −1 L
j =1
πi j P j , A˜i = Ai + Ai , B˜i = Bi + Bi . (10)
By Assumption 1 and (5), it can be readily shown that σ T (yk )P i σ (yk ) ≤ µ∗ σ T (yk )σ (yk ) ≤ µ∗ ρykT yk , and then T E ≤ A˜ i yk + B˜i f (yk ) P i A˜ i yk + B˜i f (yk ) +ykT (µ∗ ρ I − Pi )yk + 2ykT L f (yk ) −2h −1 f T (yk )L f (yk ).
(7)
(8)
(11)
Further, we denote ⎧ ⎪ ⎨ i = A˜ i yk + B˜i f (yk )
= ykT (µ∗ ρ I − Pi )yk + 2ykT L f (yk ) ⎪ ⎩ i −2h −1 f T (yk )L f (yk ).
(12)
Then, (11) becomes
E ≤ iT P i i + i .
(13)
Now, we decompose the defective TPM considered in this brief N Pi = πi j P j j =1 M (i) r ˆ i j Pj α π ˜ = PK + r i j Pj + (i) π (i) j ∈ IU C
and Ŵ = − ̥= + Proof: By Assumption 2 and f (0) = 0, it is straightforward to show that 0 ≤ f (yik )/(yik ) ≤ h, when s2 = 0. Since f (yik ) is assumed to be monotonically increasing with the initial condition f (0) = 0, one knows f (yik ) > 0 and yik > 0. Then, we can further show yik − h −1 f (yik ) ≥ 0.
N
M
ǫ N2iT N2i .
(9)
To derive the stability criterion, we introduce the following Lyapunov function candidate for (1), V (yk , k, rk ) = ykT Pi yk , ∀rk = i, i ∈ I. By (9), it follows that
Pi =
III. M AIN R ESULTS
P i < µ∗ I ⎡ −Psi ⎢ ∗ i =⎢ ⎣ ∗ ∗
ykT L f (yk ) − h −1 f T (yk )L f (yk ) ≥ 0.
where
lim E[| y(k) |2 ] = 0.
k
Multiplying (8) by lii f (yik ) on the right, and since lii > 0, the above inequality is equivalent to yik lii f (yik ) − h −1 f (yik )lii f (yik ) ≥ 0. By denoting a positive definite matrix L = diag{l11 , l22 , . . . , lnn }, yk = (y1k , y2k , . . . , ynk )T and f (yk ) = ( f (y1k ), f (y2k ), . . . , f (ynk ))T, the following inequality holds:
r=1
(i)
j ∈ IU K
where r=1 αr π˜ irj , ∀ j ∈ IU C represents an uncertain element M in the polytope uncertainty description. As r=1 αr = 1 and αr can take values arbitrarily in [0, 1], (13) implies that M (i) r T E ≤ i P K + αr π˜ i j P j (i) j ∈ IU C r=1 ˆ i j P j i + i + (i) π j ∈ IU K M (i) ˜ r Pj αr iT PK + = (i) π r=1 j ∈ IU C i j ˆ i j P j i + i . (14) + (i) π j ∈ IU K
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Then, (14) holds if and only if ∀r = 1, . . . , M (i) (i) T E ≤ i P K + P U C + i + i (i) ˆ i j P j j ∈ IU K (i) (i) (i) (i) = iT PK + PU C + 1 − πK − πU C πˆ i j × P j i + i . (i) j ∈IU K 1 − π (i) − π (i) K UC
Thus, by [6, Lemma 1], we can verify that = P˜ mi ϒ + ϒ T P˜ miT = P˜ mi Mi Fi N˜ + N˜ T F T M T P˜ miT
i
πˆ i j (i)
(i)
1 − πK − πU C
(15)
≤1
and
(i)
πˆ i j (i)
(i)
1 − πK − πU C
=1
j ∈ IU K
(15) becomes E ≤ (i)
πˆ i j (i)
j ∈ IU K
(i)
1 − πK − πU C (i) (i) (i) (i) T ×(i (PK + PU C + (1 − πK − πU C )P j )i + i ). (i)
(i)
Thus, for 0 ≤ πˆ i j ≤ 1 − πK − πU C , the above inequality (i) is equivalent to ∀ j ∈ IU K (i) (i) (i) (i) E ≤ iT PK + PU C + (1 − πK − πU C )P j i + i . (i)
(i)
(i)
(i)
Considering (12) and Psi = PK +PU C +(1−πK −πU C )P j , one knows that E ≤ iT Psi i + i T = A˜i yk + B˜i f (yk ) Psi A˜ i yk + B˜i f (yk )
k
i
˜ k = ξkT ξ
(16)
where
T ξk = ykT f T (yk ) T i A˜i Ps A˜ i + µ∗ ρ I − Pi ˜ = ∗
A˜iT Psi B˜i + L B˜iT Psi B˜i − 2h −1 L
.
By Schur complement, (6) implies that ∀i ∈ I ⎡ ⎤ −Psi Psi Ai Psi Bi ⎣ ∗ Ŵ L + ǫ N1iT N2i⎦+ǫ −1 P˜ mi Mi MiT P˜ miT < 0 (17) ∗ ∗ ̥
where P˜ mi = [PsiT , 0, 0]T . Meanwhile, we denote
ϒ = [0, Ai , Bi ], N˜ = [0, N1i , N2i ] ⎤ ⎡ Psi Ai Psi Bi −Psi ⎣ ⎦ ∗ µ∗ ρ I − Pi L = −1 ∗ ∗ −2h L
= P˜ mi ϒ + ϒ T P˜ miT .
(18)
Then it follows from (17) and (18) that ⎤ ⎡ Psi Ai Psi Bi −Psi ⎦ µ∗ ρ I − Pi L + = ⎣ ∗ ∗ ∗ −2h −1 L + P˜ mi ϒ + ϒ T P˜ miT ⎡ ⎤ Psi Ai Psi Bi −Psi ⎦ ≤ ⎣ ∗ µ∗ ρ I − Pi L −1 ∗ ∗ −2h L T ˜ −1 ˜ i ˜ + ǫ N N + ǫ Pm Mi MiT P˜ miT ⎤ ⎡ −Psi Psi Ai Psi Bi = ⎣ ∗ Ŵ L + ǫ N1iT N2i ⎦ ∗ ∗ ̥ −1 ˜ i T ˜ iT + ǫ Pm Mi M Pm < 0. i
By (10), we have ⎡
−Psi ⎣ + = ∗ ∗
Psi A˜i ∗ µ ρ I − Pi ∗
⎤ Psi B˜i ⎦<0 L −1 −2h L
which, by Schur complement, implies that T i ∗ ˜ ˜ A˜ iT Psi B˜i + L ˜ = Ai Ps Ai + µ ρ I − Pi < 0. ∗ B˜iT Psi B˜i − 2h −1 L (19) From (16) and (19), for a negative scalar δ, we know E = E[V (yk+1 , k+1, rk+1 ) |yk , rk = i ]−V(yk , k, rk ) ≤ δ | ξk |2
+ ykT (µ∗ ρ I − Pi )yk +2ykT L f (yk ) − 2h −1 f T(yk )L f (yk ) = ykT (A˜ iT Psi A˜i + µ∗ ρ I − Pi )yk + f T (yk ) ×(B˜ T Psi B˜i − 2h −1 L) f (yk ) + 2y T (A˜ T Psi B˜i + L) f (yk ) i
i
≤ ǫ N˜ T N˜ + ǫ −1 P˜ mi Mi MiT P˜ miT .
Since 0≤
167
which is equal to E[V (yk+1 , k + 1, rk+1 )] − E[V (yk , k, rk )] ≤ δ E[| ξk |2 ]. (20) Given a positive integer m, the recursive sum of both sides of (20) from zero to m implies m E[| ξk |2 ] E[V (ym+1 , m+1, rm+1 )]−E[V (y0 , 0, r0 )] ≤ δ k=0 2 which gives −δ m (y0 , 0, r0 )]. Letting k=0 E[| ξk | ] ≤ E[V m 2 m ∞, we know that the series k=0 E[| ξk | ] is 2 E[| yk | ] = 0, hence convergent, which means lim k the proof is completed. Remark 3: Note that the MJNN treated in Theorem 1 covers two simplified cases, i.e., the MJNN only with parameter uncertainties or only with stochastic perturbations, which we will address as follows. The proofs of the corresponding corollaries can be obtained in the same vein as the proof for Theorem 1. Case 1: If there are no parameter uncertainties Ai , Bi in the MJNN, the system reduces to
y(k + 1) = Ai y(k) + Bi f (y(k)) + σ (y(k), k)w(k)
(21)
where the system matrices (Ai , Bi ) are the same as the ones in (1). Then, we have the following corollary.
168
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Corollary 1: Consider the MJNN (21) with the defective TPM (4). Suppose that Assumptions 1 and 2 hold. The corresponding system is asymptotically stable in the mean square if there exist a set of matrices Pi > 0, a diagonal matrix L > 0, and positive scalar µ∗ , ∀i ∈ I, such that P i < µ∗ I and ⎡ ⎤ −Psi Psi Ai Psi Bi ⎦<0 i = ⎣ ∗ µ∗ ρ I − Pi L −1 ∗ ∗ −2h L where the parameters P i and Psi are the same as those in (7). Case 2: If there are no stochastic perturbations σ (y(k), k) w(k) in the MJNN, the system reduces to y(k + 1) = (Ai + Ai )y(k) + (Bi + Bi ) f (y(k))
(22)
where the system matrices (Ai +Ai , Bi +Bi ) are the same as those in (1). Then, we have the following corollary. Corollary 2: Consider the MJNN (22) with the defective TPM (4). Suppose that Assumptions 1 and 2 hold. The corresponding system is asymptotically stable in the mean square if there exist a set of matrices Pi > 0, a diagonal matrix L > 0, and positive scalars µ∗ and ǫ, ∀i ∈ I, such that P i < µ∗ I and ⎡ i ⎤ −Ps Psi Ai Psi Bi Psi Mi ⎢ ∗ −Pi + ǫ N1iT N1i L + ǫ N1iT N2i 0 ⎥ ⎥<0 i = ⎢ ⎣ ∗ ∗ ̥ 0 ⎦ ∗ ∗ ∗ −ǫ I where the parameters P i and Psi are the same as those in (7). Remark 4: Note also that the elements of the defective TPM in Theorem 1, which include the three sorts of TPs, i.e., known, uncertain, and unknown, could reduce to their different simplified cases shown as below (two sorts or one sort). Correspondingly, the composition of the parameters P i and Psi in (7) will be different. 1) All the elements in the TPM are unknown. The corresponding system can be considered as the so-called switched NN under arbitrary switching, in terms of the analyses in [11]. Then we have (i)
(i)
P i = PU K , Psi = P j , ∀ j ∈ IU K . 2) The TPM only contains known and unknown elements [9], and we have (i) ˆ i j Pj P i = PK + (i) π j ∈ IU K (i) (i) + 1 − πK P j , ∀ j ∈ IU(i)K . Psi = PK
3) The TPM only contains known and uncertain elements [10], then we have M (i) r α π ˜ P i = PK + r i j Pj (i) j ∈ IU C
(i)
r=1
(i)
Psi = PK + PU C . 4) All the elements are known. The corresponding system becomes the conventional MJNN with completely known TPM [17] (i)
P i = Psi = PK .
Remark 5: Note that, as the level of the defectiveness varies, it is intuitive to conjecture that there exists a monotonicity with respect to the relevant system performance (e.g., in this brief, the bound of the stochastic perturbations that the system can tolerate without becoming unstable), which we will verify via the numerical examples in next section. IV. N UMERICAL E XAMPLES In this section, three examples are presented to verify the theoretical findings. For description brevity, we denote i th row of the r th vertex in the polytope uncertainty description as
ri , ∀i ∈ I, ∀r = 1, . . . , M. Example 1: Consider a three-neuron MJNN of four jumping modes with defective TPM (4) to be given by ⎤ ⎡ ⎤ ⎡ 0.4 0 0 0.4 0 0 A1 = ⎣ 0 0.3 0 ⎦ , A2 = ⎣ 0 0.3 0 ⎦ 0 0 0.5 0 0 0.3 ⎡ ⎤ ⎡ ⎤ 0.4 0 0 0.4 0 0 A3 = ⎣ 0 0.9 0 ⎦ , A4 = ⎣ 0 0.2 0 ⎦ 0 0 0.7 0 0 0.7 ⎡ ⎤ 0.19 −0.21 0.09 B1 = ⎣ 0.00 −0.31 0.19 ⎦ −0.20 −0.10 −0.20 ⎡ ⎤ 0.21 −0.20 0.10 B2 = ⎣ 0.00 −0.30 0.19 ⎦ −0.21 −0.10 −0.20 ⎤ ⎡ 0.20 −0.20 0.10 B3 = ⎣ 0.00 −0.27 0.21 ⎦ −0.21 −0.12 −0.19 ⎤ ⎡ 0.10 −0.20 0.21 B4 = ⎣ 0.10 −0.20 0.12 ⎦ −0.10 −0.12 −0.30 Mi = 0.3I, N1i = 0.1I, N2i = 0.2I, i = 1, 2, 3, 4 h = 0.01, ρ = 0.3. The TPM comprises five vertices r , r = 1, 2, . . . , 5, and their second lines r2 , r = 1, 2, . . . , 5, are given by
21 = [ ?
23
25
0.15 0.30
? ], 22 = [ ?
0.15 0.60 ? ]
= [ ?
0.45 0.30
24
0.45 0.55 ? ]
=[ ?
0.40 0.60 ? ]
? ],
=[?
and other rows in the five vertices are defined with the same elements, ∀r = 1, 2, . . . , 5
r1 = [ ?
0.4 ?
0.2 ], r3 = [ ?
r4
0.3 ?
? ].
=[ ?
0.2
0.5 ? ]
For simplicity, the TPM in the polytope uncertainty description can be rewritten in the following norm-bounded form: ⎡ ⎤ ? 0.4 ? 0.2 ⎢ ? [0.15, 0.45] [0.3, 0.6] ? ⎥ ⎢ ⎥. (23) ⎣ ? 0.2 0.5 ? ⎦ ? 0.3 ? ?
By Theorem 1, one can verify that (5)–(6) have a feasible solution, which shows that the given system is asymptotically
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE I S TABILITY OF THE MJNN C ORRESPONDING TO D IFFERENT ρ VALUES Value of ρ
Stability of MJNN
0.3 0.4 0.5 0.6
Stable Stable Unstable Unstable
TABLE II M AXIMUM VALUE OF ρ FOR U NCERTAIN TP S WITH
Interval of 22 [0.05, 0.85] [0.15, 0.75] [0.25, 0.65] [0.35, 0.55]
D IFFERENT I NTERVALS Interval of 23 [0.10, 0.90] [0.20, 0.80] [0.30, 0.70] [0.40, 0.60]
Maximum value of ρ 0.389 0.412 0.439 0.441
stable in the mean square despite the defectiveness existing in TPM (23). Note that, as shown in (3), ρ has a constraint on the intensity of stochastic perturbations. This means that a larger ρ may cause the corresponding MJNN to become unstable. By Theorem 1, one can further obtain the relation between the different ρ and the stability of the resulting MJNN, as listed in Table I. It is seen from Table I that a larger ρ, which allows the stochastic perturbations σ (y(k), k)w(k) to be more intense, will lead to the instability of the system. Thus a direct question is: what is the factor that gives rise to different maximum values of ρ such that the corresponding MJNN is unstable? It is natural for us to conjecture that different defectiveness of a TPM may have such a potential. That is, as the level of the defectiveness varies, the maximum value of ρ will change. The corresponding verification will be shown in Examples 2 and 3. Example 2: Consider the MJNN in Example 1 and change the intervals of uncertain TPs π22 and π23 in (23). The purpose here is to demonstrate the different behaviors of the underlying MJNN as the intervals of uncertain TPs vary. Using the conditions in Theorem 1, we can obtain the maximum value of ρ by solving the following minimization problem: min 1/ρ subject to L M I s (5) and (6). Given four different intervals of π22 and π23 , the corresponding computation results are shown in Table II. It can be seen that, as the intervals of uncertain TPs π22 and π23 become smaller, the maximum value of ρ increases, i.e., more intense stochastic perturbations are allowed. Now, we will consider the more complex cases in Example 3, in which all the three types of TPs are involved in the variations. Example 3: Consider the MJNN in Example 1 with four different defective TPMs as listed in Table III. From Cases I–IV in Table III, the level of the defectiveness decreases, which one can observe in three cases: 1) unknown elements turn into uncertain or even known ones;
169
TABLE III F OUR D IFFERENT T RANSITION P ROBABILITY M ATRICES Case I: Completely unknown ? ? ? ⎢ ? ? ? ⎢ ⎣ ? ? ? ? ? ? ⎡
⎡
? ⎢ 0.2 ⎢ ⎣ ? ? ⎡
? ⎢ 0.2 ⎢ ⎣ ? 0.4
TPM ? ? ? ?
⎤ ⎥ ⎥ ⎦
Case II: Defective TPM 1 ⎤ 0.4 ? 0.2 ? [0.3, 0.5] 0.1 ⎥ ⎥ 0.2 [0.1, 0.7] ? ⎦ ? ? ?
Case III: Defective TPM 2 ⎤ 0.4 ? 0.2 [0.15, 0.35] [0.3, 0.5] 0.1 ⎥ ⎥ 0.2 [0.4, 0.6] ? ⎦ 0.3 0.2 0.1
Case IV: Completely known 0.3 0.4 0.1 ⎢ 0.2 0.3 0.4 ⎢ ⎣ 0.1 0.2 0.5 0.4 0.3 0.2 ⎡
TPM 0.2 0.1 0.2 0.1
⎤ ⎥ ⎥ ⎦
TABLE IV M AXIMUM VALUE OF ρ FOR D IFFERENT C ASES Case
Maximum value of ρ
I II III IV
0.134 0.296 0.438 0.525
2) the intervals of the uncertain elements become tighter; and 3) the uncertain elements become known ones. In particular, Case I represents the so-called switched NNs under arbitrary switching and Case IV represents the conventional MJNN with completely known TPM. The corresponding result can be seen in Table IV. From the computation results, it can be also seen that the lower is the level of defectiveness of the TPM, the stronger is the capability of tolerating stochastic perturbations for ensuring stability of the system. As seen from Example 1, the validity of Theorem 1 is demonstrated. Also, it can be concluded from Examples 2 and 3 that, as more statistics are available to the designers, the relevant system performance (the capability of tolerating stochastic perturbations here) will be improved as conjectured. V. C ONCLUSION This brief dealt with the stability criterion for a class of uncertain MJNNs with defective statistics of modes transitions in discrete time domain. The defective TPs took account of the recent studies, i.e., the so-called uncertain TPs and partially unknown TPs, in a composite way. By using the property of the TPM and the convexity of uncertain domains, a sufficient condition for the stability of the underlying system was established. Furthermore, a monotonicity between the level of the defectiveness and the system’s capability of tolerating the stochastic perturbations was observed concerning the maximum value of a given scalar ρ. Numerical examples were
170
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
provided to show the effectiveness of the developed results. It is worth mentioning that the consideration of the defective TPM can be further extended to other issues of MJNNs, such as MJNNs with time delays [17], [18], MJNNs in continuous time domain [19], etc. R EFERENCES [1] Z. Wang, D. W. C. Ho, and X. Liu, “State estimation for delayed neural networks,” IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 279–284, Jan. 2005. [2] P. Shi, “Filtering on sampled-data systems with parametric uncertainty,” IEEE Trans. Autom. Control, vol. 43, no. 7, pp. 1022–1027, Jul. 1998. [3] J. Zhang, P. Shi, J. Qiu, and H. Yang, “A new criterion for exponential stability of uncertain stochastic neural networks with mixed delays,” Math. Comput. Model., vol. 47, nos. 9–10, pp. 1042–1051, May 2008. [4] S. Mou, H. Gao, W. Qiang, and K. Chen, “New delay-dependent exponential stability for neural networks with time delay,” IEEE Trans. Syst., Man, Cybern., Cybern., vol. 38, no. 2, pp. 571–576, Apr. 2008. [5] R. Yang, H. Gao, and P. Shi, “Novel robust stability criteria for stochastic Hopfield neural networks with time delays,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 2, pp. 467–474, Apr. 2009. [6] Y. Liu, Z. Wang, and X. Liu, “Robust stability of discrete-time stochastic neural networks with time-varying delays,” Neurocomputing, vol. 71, nos. 4–6, pp. 823–833, Jan. 2008. [7] Z. Wang, Y. Liu, L. Yu, and X. Liu, “Exponential stability of delayed recurrent neural networks with Markovian jumping parameters,” Phys. Lett. A, vol. 356, nos. 4–5, pp. 346–352, Aug. 2006. [8] H. Li, B. Chen, Q. Zhou, and W. Qian, “Robust stability for uncertain delayed fuzzy Hopfield neural networks with Markovian jumping parameters,” IEEE Trans. Syst., Man, Cybern., Cybern., vol. 39, no. 1, pp. 94–102, Feb. 2009.
[9] L. Zhang and J. Lam, “Necessary and sufficient conditions for analysis and synthesis of Markov jump linear systems with incomplete transition descriptions,” IEEE Trans. Autom. Control, vol. 55, no. 7, pp. 1695– 1701, Jul. 2010. [10] J. Xiong, J. Lam, H. Gao, and D. W. C. Ho, “On robust stabilization of Markovian jump systems with uncertain switching probabilities,” Automatica, vol. 41, no. 5, pp. 897–903, May 2005. [11] L. Zhang and E.-K. Boukas, “Stability and stabilization of Markovian jump linear systems with partly unknown transition probabilities,” Automatica, vol. 45, no. 2, pp. 463–468, Feb. 2009. [12] L. Zhang, E.-K. Boukas, and J. Lam, “Analysis and synthesis of Markov jump linear systems with time-varying delays and partially known transition probabilities,” IEEE Trans. Autom. Control, vol. 53, no. 10, pp. 2458–2464, Nov. 2008. [13] Z. Wu, H. Su, and J. Chu, “State estimation for discrete Markovian jumping neural networks with time delay,” Neurocomputing, vol. 73, nos. 10–12, pp. 2247–2254, Jun. 2010. [14] C. E. de Souza, A. Trofino, and K. A. Barbosa, “Mode-independent H∞ filters for Markovian jump linear systems,” IEEE Trans. Autom. Control, vol. 51, no. 11, pp. 1837–1841, Nov. 2006. ˇ [15] D. D. Siljak and D. M. Stipanovi´c , “Robust stabilization of nonlinear systems: The LMI approach,” Math. Prob. Eng., vol. 6, no. 5, pp. 461–493, 2000. [16] H. Gao and T. Chen, “New results on stability of discrete-time systems with time-varying state delay,” IEEE Trans. Autom. Control, vol. 52, no. 2, pp. 328–334, Feb. 2007. [17] Y. Liu, Z. Wang, J. Liang, and X. Liu, “Stability and synchronization of discrete-time Markovian jumping neural networks with mixed modedependent time delays,” IEEE Trans. Neural Netw., vol. 20, no. 7, pp. 1102–1116, Jul. 2009. [18] Y. Liu, Z. Wang, and X. Liu, “On synchronization of discrete-time Markovian jumping stochastic complex networks with mode-dependent mixed time-delays,” Int. J. Mod. Phys. B, vol. 23, no. 3, pp. 411–434, Jan. 2009. [19] Z. Wang, Y. Liu, and X. Liu, “On global asymptotic stability of neural networks with discrete and distributed delays,” Phys. Lett. A, vol. 345, nos. 4–6, pp. 299–308, 2005.
Digital Object Identifier 10.1109/TNN.2010.2102871
Digital Object Identifier 10.1109/TNN.2010.2102870
IEEE Computational Intelligence Society Officers President GARY YEN School of Elect. and Comp. Eng. Oklahoma State Univ. Stillwater, OK 74078 E-mail:
[email protected]
President Elect MARIOS M. POLYCARPOU Dept. of Elect. and Comput. Eng. Univ. of Cyprus Nicosia 1678, Cyprus E-mail:
[email protected]
Vice President—Conferences GARY B. FOGEL Natural Selection Inc. San Diego, CA 92121 E-mail:
[email protected]
Vice President—Publications XIN YAO School of Computer Science The Univ. of Birmingham Birmingham B15 2TT, U.K. E-mail:
[email protected]
Vice President—Finances PIERO BONISSONE Computing and Decision Sciences GE Global Research Niskayuna, NY 12309 E-mail:
[email protected]
Vice President—Technical Activities HISAO ISHIBUCHI Dept. of Computer Science & Intelligent Systems Osaka Prefecture Univ. Sakai 599-8531, Japan E-mail:
[email protected]
Vice President—Member Activities PABLO ESTEVEZ Dept. Elect. Eng. Univ. of Chile Santiago, Chile E-mail:
[email protected]
Vice President—Education JENNIE SI Dept. of Elect. Eng. Arizona State Univ. Tempe, AZ 85287 E-mail:
[email protected]
Executive Administrator JO-ELLEN B. SNYDER IEEE Computational Intelligence Society Piscataway, NJ 08854 E-mail:
[email protected]
Technical Activities HISAO ISHIBUCHI Dept. of Computer Science & Intelligent Systems Osaka Prefecture Univ. Sakai 599-8531, Japan
Division X Director VINCENZO PIURI Dept. of Info. Technol. Univ. of Milan 26013 Crema, Italy
Standing Committee Chairs Conferences GARY B. FOGEL Natural Selection Inc. San Diego, CA 92121 Constitution & Bylaws DAVID B. FOGEL Lincoln Vale CA LP Natural Selection Inc. San Diego, CA 92121
Educational Activities JENNIE SI Dept. Elec. Eng. Arizona State Univ. Tempe, AZ 85287-5706
Member Activities PABLO ESTEVEZ Dept. Elect. Eng. Univ. of Chile Santiago, Chile
Publications XIN YAO School of Computer Science The Univ. of Birmingham Birmingham B15 2TT, U.K.
Finance PIERO P. BONISSONE Computing & Decision Sciences General Electric Global Research Niskayuna, NY 12309
Nominations DAVID B. FOGEL Lincoln Vale CA LP Natural Selection Inc. San Diego, CA 92121
Strategic Planning ENRIQUE RUSPINI Artificial Intelligence CTR SRI International Menlo Park, CA 94025-3493
IEEE CIS Technical Committees Neural Networks CESARE ALIPPI Politecnico di Milano Milano 20133, Italy
Computational Finance ROBERT GOLAN DBmind Technologies, Inc. Miami, FL 33140
Fuzzy Systems CHIN-TENG (CT) LIN National Chiao-Tung Univ. Hsinchu City 300, Taiwan
Emerging Technologies CHANG-SHING LEE Dept. of Computer Science and Information Engineering National Univ. of Tainan Tainan 700, Taiwan
Evolutionary Computation CARLOS A. COELLO COELLO CINVESTAV-IPN Dept. Computacion Mexico D.F. 07360, Mexico
Adaptive Dynamic Programming and Reinforcement Learning MARCO WIERING Dept. of Inf. & Comput. Sciences Univ. Utrecht Utrecht 3584CH, The Netherlands
Standards PLAMEN ANGELOV Dept. of Communication Systems Lancaster Univ. Lancaster LA1 4WA, U.K.
Bioinformatics and Bioengineering DANIEK ASHLOCK Univ. of Guelph Guelph, ON, Canada
Data Mining ALESSANDRO SPERDUTI Dipartimento di Matematica Pura ed Applicata Univ. of Padova Padova 35121, Italy
Games SUNG-BAE CHO Dept. of Computer Science Yonsei Univ., Seoul 120-749, Korea
Intelligent Systems Applications THANOS VASILAKOS ECE Dept National Tech. Univ. of Athens 15410 Athens, Greece Autonomous Mental Development MINORU ASADA Graduate School of Eng. Osaka Univ. Osaka, Japan
Editors IEEE Transactions on Evolutionary Computation GARRISON GREENWOOD Dept. of Elect. and Computer Eng. Portland State Univ. Portland, OR 97207-0751
IEEE Transactions on Fuzzy Systems CHIN-TENG (CT) LIN National Chiao-Tung Univ. Hsinchu City, 300, Taiwan
IEEE Transactions on Neural Networks DERONG LIU Dept. of Electrical and Computer Engineering Univ. of Illinois at Chicago, Chicago, IL 60607-7053
IEEE Transactions on Autonomous Mental Development ZHENGYOU ZHANG Microsoft Research Redmond, WA 98052
IEEE CIS Web Page HAIBO HE Dept. of Elect., Comp. and Biomedical Eng. University of Rhode Island Kingston, RI 02881
IEEE CIS E-letter ZENG-GUANG HOU Institute of Automation The Chinese Academy of Sciences Beijing 100080, China
IEEE CIS Magazine KAY CHEN TAN Dept. of Elect. and Comput. Eng. National Univ. of Singapore 117576, Singapore
IEEE Transactions on Computational Intelligence and AI in Games SIMON M. LUCAS School of Comput. Sci. and Electron. Eng. Univ. of Essex Colchester, Essex CO4 3SQ, UK
IEEE Computational Intelligence Society ADCOM Members Elected Members (2009–2013) JAMES C. BEZDEK (2009–2011) Computer Science Dept. Univ. of West Florida Pensacola, FL 32514
ENRIQUE RUSPINI (2009–2011) Artificial Intelligence CTR SRI International Menlo Park, CA 94025-3493
ROBERT KOZMA (2010–2012) Dept. of Mathematical Sciences Univ. of Memphis Memphis, TN 38152
JIM KELLER (2009–2011) Dept. of Elect. and Comput. Eng. Univ. of Missouri-Columbia Columbia, MO 65211
JACEK ZURADA (2009–2011) Dept. of Elect. and Comput. Eng. Univ. of Louisville Louisville, KY 40292
JULIA CHUNG (2009–2011) National Cheng-Kung Univ. Tainan, Taiwan 70101
OSCAR CORDON (2010–2012) Applications of Fuzzy Logic and Evolutionary Algorithms Research Unit European Centre for Soft Computing Mieres, 33600 Spain
NIKHIL R. PAL (2010–2012) Electronics & Communications Sciences Unit Indian Statistical Institute Calcutta 700 108, India
Digital Object Identifier 10.1109/TNN.2010.2102711
C. PRINCIPE, PH.D. (2010–2012) Computational NeuroEngineering Laboratory Univ. of Florida Gainesville, FL 32611 JOSE
LIPO WANG (2010–2012) School of Electrical and Electronic Engineering Nanyang Technological Univ. Singapore 639798 SIMON M. LUCAS (2011-2013) School of Comput. Sci. and Electron. Eng. Univ. of Essex Colchester, CO4 3SQ U.K.
JERRY MENDEL (2011-2013) Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089-2564 LUIS MAGDALENA (2011-2013) European Centre for Soft Computing Mieres, Asturias, Spain
JANUSZ KACPRYZYK (2011-2013) Systems Research Institute BERNADETTE BOUCHON-MEUNIER (2011-2013) Polish Academy of Sciences LIP6 Warsaw 601–447, Poland Université Pierre et Marie Curie 75016 Paris, France
Information for Authors The IEEE TRANSACTIONS ON NEURAL NETWORKS publishes high-quality papers in the theory, design, and application of neural networks, ranging from software to hardware. Emphasis will be given to artificial neural networks. Readers are encouraged to submit manuscripts which disclose significant technical achievements, indicate exploratory developments, or present significant application for neural networks.
English grammar and usage prior to submitting their manuscript for review or during the review process can go directly to http://www.prof-editing.com/ieee/ to submit a manuscript for copy editing. The SPi copy editors edit for grammar, usage, organization, and clarity, querying potentially substantive revisions as necessary. Cost estimates are available immediately online. Edited manuscripts are expected to be returned to the author within two weeks of submission.
A. Types of Contributions TNN publishes three types of articles: • Papers (Full Papers) • Brief Papers • Comments Papers and Communications Full Papers are characterized by novel contributions of archival nature in developing theories and/or innovative applications of neural networks and learning systems. The contribution should not be of incremental nature, but must present a well-founded and conclusive treatment of a problem. Well organized survey of literature on topics of current interest may also be considered. Usually a full paper will not exceed 12 pages of formatted text in the IEEE two-column style. Brief Papers report sufficiently interesting new theories and/or developments on previously published work in neural networks and related areas. For example, brief papers may report an extension of previous results or algorithms, innovative applications of a known approach to interesting problems, brief theoretical results, etc. The contribution should be conclusive and useful. A brief paper must not exceed 6 pages of formatted text in the IEEE two-column style. Comments Papers and Communications are short articles which may be commenting on an error one has found in, or a significant disagreement one has with, a previously published paper. Typically, a comments paper is assigned to the same Associate Editor who handled the published paper being commented on. If the Associate Editor who was handling the previously published paper is no longer available, the Editor-in-Chief will assign the comments paper to another Associate Editor whose expertise closely matches the paper’s topic. Comments papers and communications should comprise a significant contribution of interest to the TNN readership. The authors of the original paper may be invited to submit a rebuttal. A comments paper should be as concise as possible and must not exceed 3 pages formatted in the IEEE two-column style. During the review process, submitted manuscripts will NOT be transferred from the Full Paper category to the Brief Paper category after submission/review. It would be the responsibility of authors to decide the category of their manuscript at the time of submission. If a manuscript is reviewed as a Paper and at the end of the review process, Reviewers/Associate Editor/Editor-in-Chief find it not suitable as a full Paper but is a potential candidate for a Brief Paper, then the manuscript has to be resubmitted as a Brief Paper after revision, if authors desire to do so. Review management for Papers and Brief Papers is under the direction of an Associate Editor, who will normally solicit four reviews and wait for at least three responses before a decision is reached. To avoid delay in processing your paper, please follow closely these guidelines.
D. Style for Manuscript The IEEE TRANSACTIONS ON NEURAL NETWORKS follows the format standards of the IEEE. The IEEE "Information for Authors" kit (PDF, 755 KB) can be downloaded from: http://www.ieee.org/portal/cms_docs_iportals/iportals/publications/authors/ transjnl/auinfo07.pdf. Here are some of the general guidelines. A list of keywords and an abstract (described below) are required for all manuscripts submitted to this journal. When submitting a new article through the Manuscript Central system, you may choose your own keywords related to the submitted manuscript. The submitted manuscript must be in the following format: • PDF format; • Singled-spaced, double column, standard IEEE published format. All pages should be numbered. Provide an abstract of reasonable length that is an informative summary of the paper, including any important results found or conclusions drawn. Authors are encouraged to put detailed derivations in appendixes. Except in unusual cases, manuscripts over 12 pages (for full papers), 6 pages (for brief papers) and 3 pages (for comments papers) typed single spaced in standard IEEE two-column style, including figures, appendices, etc., will not be accepted for review. E. Page Charges After a manuscript has been accepted for publication, the author’s company or institution will be requested to pay a charge of $110 per printed page to cover part of the cost of publication. Page charges for this TRANSACTIONS, like those for journals of other professional societies, are not obligatory nor is their payment a prerequisite for publication. The author will receive 100 free reprints without covers if the page charge is honored. F. Copyright It is the policy of the IEEE to own copyright to the technical contributions it publishes. To comply with the IEEE copyright policy, authors are required to sign the IEEE Copyright Transfer Form before publication in either the print or electronic medium. The form is provided upon approval of the manuscript. Authors must submit a signed copy of this form with their final manuscripts (after a manuscript is accepted for publication).
B. Submission of Manuscripts
G. Submission of Final Manuscript
To avoid delay in processing your paper, please follow closely the following guidelines. Submission and review of new manuscripts is now done through Manuscript Central, the IEEE’s on-line submission and review system. Please log on to mc.manuscriptcentral.com/tnn and follow the directions to create an account (if a first time user) and to submit your manuscript. If the manuscript is printable (all font embedded), it will be entered into the review process. You will be able to check on the status of your manuscript during the review process. The IEEE TRANSACTIONS ON NEURAL NETWORKS is primarily devoted to archival reports of work that have not been published elsewhere. Specifically, conference records and book chapters that have been published are not acceptable unless and until they have been significantly enhanced. In special circumstances or on exceptional occasions, the Editor-in-Chief may deem a contribution noteworthy enough to be exempted from this policy. Authors will be asked to confirm that the work being submitted has not been published elsewhere nor is it currently under review by another publication. If either of these conditions is not met or is subsequently violated, the article will be disqualified from possible publication in TNN. Plagiarism in any form will be considered a serious breach of professional conduct with potentially severe ethical and legal consequences as defined in the IEEE PSPB operational manual, which can be downloaded from: http://www.ieee.org/portal/ cms_docs_iportals/iportals/publications/PSPB/opsmanual.pdf
Page Numbers: Number all pages, including illustrations, in a single series, with no omitted numbers. Figures should be identified with the figure number. References and Captions: Put all references at the end of your paper in IEEE style (see "Information for Authors" kit above). Do not include figure captions on the illustrations themselves. Figure captions should be sufficiently clear so that the figures can be understood without detailed reference to the accompanying text. Axes of graphs should have self-explanatory labels, not just symbols. Illustrations and Photos: All figures will be printed in black and white, unless specifically requested. The color figures involve some cost to be born by the authors. The exact amount depends on the number of color figures (the total cost includes a flat fee plus a fee per figure). However, authors can use color figures keeping in mind that in IEEE Xplore these will appear in color but on the print copy it will be in black and white. Therefore, while referring to the figures in the text, authors should not refer to color, but some other attributes of the figures. Letters should be large enough to be readily legible when the figures are reduced to two or one-column width - as much as 4:1 reduction from the original. Manuscript and Electronic File: For the final printed production of the manuscript, the author will need to provide a single zip file which contains all the source files of the peer approved version. Please be certain that changes made to your paper version are incorporated into your electronic version. Check that your files are complete including abstract, text, references, footnotes, biographies (for Papers), and figures captions. IEEE can process most commercial software programs, but not page layout programs. Do not send postscript files. The preferred programs are TeX, LaTeX, and WORD (use standard macros). An IEEE LaTeX style file can be obtained by visiting the IEEE Author Digital Tool Box website, http://www.ieee.org/publications_standards/publications/authors/authors_journals.html, where standard IEEE LaTeX and Microsoft Word templates can be found by scrolling down the page, by e-mail at
[email protected], or by fax at (+1 732) 562 0545.
C. Professional Editing Services Sometimes TNN receives submissions that suffer from poor English usage and readability. Such manuscripts often get rejected because of extremely poor readability. Authors, at their own cost, may now take the help of SPi Publisher Services for pre-submission professional editing services. An author willing to get assistance with
Digital Object Identifier 10.1109/TNN.2010.2102712